Table of Contents
- Running
scpca-nf
as a Data Lab staff member - Maintaining references for
scpca-nf
- Running the merge workflow
This section provides instructions for running the main workflow, found in main.nf
.
Note that there are three other workflows:
build-index.nf
for building reference indices (see #adding-additional-organisms)build-celltype-ref.nf
for creating cell type annotation references (see #adding-additional-cell-type-references)merge.nf
for merging processed objects produced by the main workflow (see #running-the-merge-workflow)
The instructions below assume that you are a member of the Data Lab with access to AWS.
Most of the workflow settings described are configured for the ALSF Childhood Cancer Data Lab computational infrastructure.
To process samples that are not part of the ScPCA project, please see the instructions on using scpca-nf
with external data.
To process single-cell and single-nuclei samples using scpca-nf
you will need access to 24 GB of RAM and 12 CPUs, so we recommend using AWS batch.
The first step in running the workflow is ensuring that your AWS credentials are configured.
You can then run the workflow with the batch
profile, which has been named in the nextflow.config
file with full settings set up in the profile_awsbatch.config
file.
Note that you will also need to use the ccdl
profile to set file paths, and you can specify both profiles by separating them with a comma.
You can then run the workflow use the following command:
nextflow run AlexsLemonade/scpca-nf -profile ccdl,batch
There are several flags and/or parameters which you may additionally wish to specify, as follows.
- Nextflow flags:
-resume
: Resume workflow from most recent checkpoint-with-tower
: UseNextflow Tower
to monitor workflow (requires separate Nextflow Tower registration)
- Workflow parameters:
--run_ids list,of,ids
: A custom comma-separated list of ids (run, library, or sample) for this run.--project list,of,project_ids
: A custom comma-separated list of project ids for this run The default run ids are"SCPCR000001,SCPCS000101"
.--repeat_mapping
: Use this flag to repeat mapping, even if results already exist.- By default, the workflow checks whether each library has existing
alevin-fry
orsalmon
mapping results, and skips mapping for libraries with existing results. Using this flag will override that default behavior and repeat mapping even if the given library's results exist. - For more implementation details, please refer to the external instructions.
- By default, the workflow checks whether each library has existing
--skip_genetic_demux
: Use this flag to skip genetic demultiplexing, which is turned on by default.- Genetic demultiplexing requires mapping of both bulk and single-cell data, followed by SNP calling and genetic demultiplexing, which can be quite time consuming.
- When genetic demultiplexing is skipped, the workflow will still perform cellhash-based demultiplexing, if available for a given library.
--repeat_genetic_demux
: Use this flag to repeat genetic demultiplexing, even if results already exist.- By default, the workflow checks whether each library has existing genetic demultiplexing results, and skips genetic demultiplexing for libraries with existing results. Using this flag will override that default behavior and repeat genetic demultiplexing even if the given library's results exist.
--perform_celltyping
: Use this flag to perform cell type annotation, which is turned off by default.--repeat_celltyping
: Use this flag to repeat cell type annotation, even if results already exist.- By default, the workflow checks whether each library has existing cell type annotation results for
SingleR
and/orCellAssign
(depending on references for that library). Using this flag will override that default behavior and repeat cell type annotation even if the given library's results exist. - This flag is only considered if
--perform_celltyping
is also used.
- By default, the workflow checks whether each library has existing cell type annotation results for
Please refer to nextflow.config
and other configuration files for other parameters which can be modified.
To test the workflow with the stub
, use the following command:
nextflow -log stub-run.log run main.nf -stub -profile stub
The run log will be saved to stub-run.log
.
By default, the output directory (outdir
parameter) is set as test/output
.
You may wish to verify that output contents in test/output/checkpoints
and test/output/results
are as expected when running the stub
workflow.
Please refer to our CONTRIBUTING.md
for more information on maintaining the stub workflow.
When running the workflow for a project or group of samples that is ready to be released on ScPCA portal, please use the tag for the latest release:
nextflow run AlexsLemonade/scpca-nf -r v0.8.5 -profile ccdl,batch --project SCPCP000000
We provide an example of the expected outputs after running scpca-nf
available for external users.
If there have been major updates to the directory structure or the contents of the output, the example data should be re-processed such that the example output we provide mimics the current expected output from scpca-nf
.
First, please check the metadata files present in s3://scpca-references/example-data
are up to date with changes in the workflow and reflect the contents of the files present in the examples
directory of this repository.
Each of these files should be present, with the expected input columns as described in each documentation link.
example_run_metadata.tsv
(documentation)example_sample_metadata.tsv
(documentation)example_project_celltype_metadata.tsv
(documentation)example_multiplex_pools.tsv
(documentation)
Once you have confirmed that the metadata looks correct, use the following commands to run the workflow and re-process the example data:
# Obtain the latest development version
nextflow pull AlexsLemonade/scpca-nf -r development
# Run the workflow with the example config
nextflow run AlexsLemonade/scpca-nf -r development -profile example,batch
After successful completion of the run, the scpca_out
folder containing the outputs from scpca-nf
should be zipped up and stored at the following location: s3://scpca-references/example-data/scpca_out.zip
.
Make sure to adjust the settings to make the zip file publicly accessible.
Inside the references
folder are files and scripts related to maintaining the reference files available for use with scpca-nf
.
-
ref-metadata.tsv
: Each row of this TSV file corresponds to a reference that is available for mapping withscpca-nf
. The columns included specify theorganism
(e.g.,Homo_sapiens
),assembly
(e.g.,GRCh38
), andversion
(e.g.,104
) of thefasta
obtained from Ensembl that was used to build the reference files. This file is used as input to thebuild-index.nf
workflow, which will create all required index files forscpca-nf
for the listed organisms in the metadata file, provided thefasta
andgtf
files are stored in the proper location on S3. See instructions for adding additional organisms for more details. -
scpca-refs.json
: Each entry of this file contains a supported reference for mapping withscpca-nf
and the name used to refer to that supported reference, e.g.,Homo_sapiens.GRCh38.104
. For each supported reference, a list of all the reference files that are needed to runscpca-nf
will be included. This file is required as input toscpca-nf
. -
celltype-reference-metadata.tsv
: Each row of this TSV file corresponds to a supported cell type reference available for cell type annotation. This file is required as input to thebuild-celltype-ref.nf
workflow to create and/or update cell type references. For all references, the following columns must be populated:celltype_ref_name
(e.g.,BlueprintEncodeData
orblood-compartment
)celltype_ref_source
(e.g.,celldex
orPanglaoDB
)celltype_method
(e.g.,SingleR
orCellAssign
)- All references obtained from the
PanglaoDB
source also require anorgans
column containing the list of supportedPanglaoDB
organs to include when building the reference. This should be a comma-separated list of all organs to include. To find all possible organs, see theorgans
column ofPanglaoDB_markers_2020-03-27.tsv
.
See instructions for adding additional cell type references for more details.
-
PanglaoDB_markers_2020-03-27.tsv
: This file is used to build the cell type references fromPanglaoDB
. This file was obtained from clicking theget tsv file
button on the PanglaoDB Dataset page and replacing the date in the filename with a date in ISO8601 format. This file is required as input to thebuild-celltype-ref.nf
workflow, which will create all required cell type references for the main workflow to use during cell type annotation.
Adding additional organisms is handled, in part, by the build-index.nf
workflow.
Follow the below steps to add support for additional references:
- Download the desired
fasta
andgtf
files for the organism of choice fromEnsembl
. Add these to theS3://scpca-references
bucket with the following directory structure, where the root directory here corresponds to theorganism
and the subdirectory corresponds to theEnsembl
version:
homo_sapiens
└── ensembl-104
├── annotation
│ └── Homo_sapiens.GRCh38.104.gtf.gz
└── fasta
├── Homo_sapiens.GRCh38.dna.primary_assembly.fa.fai
└── Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
- Add the
organism
,assembly
, andversion
associated with the new reference to theref-metadata.tsv
file. - Generate an updated
scpca-refs.json
by running the script,create-reference-json.R
, located in thescripts
directory. - Generate the index files using
nextflow run build-index.nf -profile ccdl,batch
from the root directory of this repository. - Ensure that the new reference files are public and in the correct location on S3 (
s3://scpca-references
).
Adding additional references to use for cell type annotation is handled by the build-celltype-ref.nf
workflow.
Reference files are created and automatically named by the build-celltype-ref.nf
.
We currently only support celldex
and PanglaoDB
for reference sources for SingleR
and CellAssign
cell type annotation, respectively.
Follow these steps to add support for additional cell type references.
-
Add the
celltype_ref_name
,celltype_ref_source
,celltype_method
, andorgans
(if applicable) for the new reference tocelltype-reference-metadata.tsv
.<celltype_ref_name>
represents the reference dataset name. For use withSingleR
, this should be taken directly from acelldex
dataset. ForCellAssign
, names are established by the Data Lab as<tissue/organ>-compartment
to represent a set of markers for a given tissue/organ.<celltype_ref_source>
represents the reference dataset source. Currently onlycelldex
andPanglaoDB
are supported forSingleR
andCellAssign
, respectively.<celltype_method>
represents which annotation method to use with the specified reference, eitherSingleR
orCellAssign
.organs
indicates which organs to be included in creation of references withPanglaoDB
as thecelltype_ref_source
. This must be a comma separated list of all organs to include.
-
Generate the new cell type reference using
nextflow run build-celltype-ref.nf -profile ccdl,batch
from the root directory of this repository. -
Ensure that the new reference files are public and in the correct location on S3.
SingleR
reference files, which are the full reference datasets from the celldex
package, should be in s3://scpca-references/celltype/singler_references
and named as <celltype_ref_name>_<celltype_ref_source>_<version>.rds
.
Corresponding "trained" model files for use in the cell type annotation workflow should be stored in s3://scpca-references/celltype/singler_models
, named as <celltype_ref_name>_<celltype_ref_source>_<version>_model.rds
.
<celltype_ref_name>
is a givencelldex
dataset.- Note that the workflow parameter
singler_label_name
will determine whichcelldex
dataset label is used for annotation; by default, we use"label.ont"
(ontology labels).
- Note that the workflow parameter
<celltype_ref_source>
iscelldex
.<version>
is thecelldex
version used during reference building, where we use dashes in place of periods (e.g., versionx.y.z
would be represented asx-y-z
).
CellAssign
organ-specific reference gene matrices should be stored in s3://scpca-references/celltype/cellassign_references
and named as <celltype_ref_name>_<celltype_ref_source>_<date>.tsv
.
<celltype_ref_name>
is a given reference name established by the Data Lab.<celltype_ref_source>
isPanglaoDB
<date>
is thePanglaoDB
date, which serves as their version, in ISO8601 format.
The workflow in merge.nf
merges (but does not integrate) processed objects produced by the main scpca-nf
workflow.
For a given SCPCA project id, the workflow creates a merged SCE
object, a merged AnnData
object, and an associated merged object HTML report.
Note that the workflow will also merge any present alternative experiments (e.g., ADT data from CITE-seq) except for cell hashing alternative experiments.
The merge workflow requires two parameters:
project
, the SCPCA project id whose objects should be merged. If running multiple projects, provide a comma separated list of project IDs.merge_run_ids
, the set of run ids, library ids, or sample ids to include in the merged object. By default all runs ids for the given project will be includedrun_metafile
, the metadata file (scpca-library-metadata.tsv
) which contains information about libraries to merge- This is specified in the
ccdl
profile configuration file
- This is specified in the
Data Lab members with access to AWS can run the workflow with the following command(s):
# Run a single project
nextflow run merge.nf -profile ccdl,batch --project SCPCP000000
# Run more than one project
nextflow run merge.nf -profile ccdl,batch --project SCPCP00000X,SCPCP00000Y
# Specify a set of run ids to use
nextflow run merge.nf -profile ccdl,batch --project SCPCP000000 --run_ids SCPCR00000X,SCPCR00000Y