Table of Contents
- Overview
- File organization
- Prepare the run metadata file
- Prepare the sample metadata file
- Configuring
scpca-nf
for your environment - Cell type annotation
- Output files
- Special considerations for specific data types
- Additional workflow settings
- The
merge.nf
workflow
Using scpca-nf
to process your own single-cell and single-nuclei RNA-seq data requires access to a high performance computing (HPC) environment that can accommodate up to 24 GB of RAM and 12 CPU cores.
Some datasets and processes (genetic demultiplexing and spatial transcriptomics) may require additional resources, and our default configuration allows up to 96 GB of RAM and 24 CPU cores.
While the workflow does support scaling down requirements in lower-resource environments, we have not tested extensively in those conditions, and some components may fail.
After identifying the system that you will use to execute the Nextflow workflow, you will need to follow the steps outlined in this document to complete the set up process.
Here we provide an overview of the steps you will need to complete:
-
Install the necessary dependencies. You will need to make sure you have the following software installed on your HPC where you plan to execute the workflow: - Nextflow, the main workflow engine that
scpca-nf
relies on. This can be downloaded and installed by any user, with minimal external requirements. - Docker or Singularity, which allows the use of container images that encapsulate other dependencies used by the workflow reproducibly. These usually require installation by system administrators, but most HPC systems have one available (usually Singularity). - Other software dependencies, as well as the workflow files themselves, are handled by Nextflow, which will download Docker or Singularity images as required. Thescpca-nf
workflow does not need to be downloaded separately. However, if nodes on your HPC do no not have direct internet access, you will need to follow our instructions to download reference files and container images. -
Organize your files. You will need to have your files organized in a particular manner so that each folder contains only the FASTQ files that pertain to a single library. See the section below on file organization for more information on how to set up your files.
-
Create a run metadata file and sample metadata file. Create two TSV (tab-separated values) files - one file with one sequencing library per row and pertinent information related to that sequencing run in each column (run metadata) and the other file with one sample per row and any relevant sample metadata (e.g., diagnosis, age, sex, cell line) (sample metadata). See the sections below on preparing a run metadata file and sample metadata file for more information on creating a metadata file for your samples.
-
Create a configuration file and define a profile. Create a configuration file that stores user defined parameters and a profile indicating the system and other system related settings to use for executing the workflow. See the section below on configuring
scpca-nf
for your environment for more information on setting up the configuration files to run Nextflow on your system.
The standard configuration the scpca-nf
workflow expects that compute nodes will have direct access to the internet, and will download reference files and container images with any required software as required.
If your HPC system does not allow internet access from compute nodes, you will need to download the required reference files and software before running, following the instructions we have provided.
Once you have set up your environment and created the metadata and configuration files, you will be able to start your run as follows, adding any additional optional parameters that you may choose:
nextflow run AlexsLemonade/scpca-nf \
-config <path to config file> \
-profile <name of profile>
Where <path to config file>
is the relative path to the configuration file that you have setup and <name of profile>
is the name of the profile that you chose when creating a profile.
This command will pull the scpca-nf
workflow directly from Github, and run it based on the settings in the configuration file that you have defined.
Note: scpca-nf
is under active development.
Using the above command will run the workflow from the main
branch of the workflow repository.
To update to the latest released version you can run nextflow pull AlexsLemonade/scpca-nf
before the nextflow run
command.
To be sure that you are using a consistent version, you can specify use of a release tagged version of the workflow, set below with the -r
flag.
The command below will pull the scpca-nf
workflow directly from Github using the v0.8.5
version.
Released versions can be found on the scpca-nf
repository releases page.
nextflow run AlexsLemonade/scpca-nf \
-r v0.8.5 \
-config <path to config file> \
-profile <name of profile>
For each library that is successfully processed, the workflow will return quantified gene expression data as a SingleCellExperiment
object stored in an RDS file along with a summary HTML report and any relevant intermediate files.
For a complete description of the expected output files, see the section describing output files.
You will need to have files organized so that all the sequencing files for each library are in their own directory or folder.
Each folder should be named with a unique ID, corresponding to the scpca_run_id
column of the metadata file.
Any sequencing runs that contain multiple libraries must be demultiplexed and FASTQ files must be placed into separate distinct folders, with distinct run IDs as the folder name.
If the same sequencing library was sequenced across multiple flow cells (e.g., to increase coverage), all FASTQ files should be combined into the same folder.
If a library has a corresponding ADT library and therefore a separate set of FASTQ files, the FASTQ files corresponding to the ADT library should be in their own folder, with a unique run ID.
Using scpca-nf
requires a run metadata file as a TSV (tab separated values) file, where each sequencing run to be processed is a row and columns contain associated information about that run.
For each sequencing run, you will need to provide a Run ID (scpca_run_id
), library ID (scpca_library_id
), and sample ID (scpca_sample_id
).
The run ID will correspond to the name of the folder that contains the FASTQ files associated with the sequencing run.
See the section on file organization above for more information.
The library ID will be unique for each set of cells that have been isolated from a sample and have undergone droplet generation. For single-cell/single-nuclei RNA-seq runs, the library ID should be unique for each sequencing run. For libraries that have corresponding ADT or cellhash runs, they should share the same library ID as the associated single-cell/single-nuclei RNA-seq run, indicating that the sequencing data has been generated from the same group of cells.
Finally, the sample ID will indicate the unique tissue or source from which a sample was collected. If you have two libraries that have been generated from the same original tissue, then they will share the same sample ID.
For more information on understanding the difference between library and sample IDs, see the FAQ on library and sample IDs in the ScPCA portal documentation.
Before using the workflow with data that you might plan to submit to ScPCA, please be sure to obtain a list of sample identifiers to use for your samples from the Data Lab.
We will provide IDs that can be used for scpca_run_id
, scpca_library_id
, and scpca_sample_id
based on the number and types of samples that are being processed to avoid overlap with existing sample identifiers.
To run the workflow, you will need to create a tab separated values (TSV) metadata file with the following required columns:
column_id | contents |
---|---|
scpca_run_id |
A unique run ID |
scpca_library_id |
A unique library ID for each unique set of cells |
scpca_sample_id |
A unique sample ID for each tissue or unique source. For multiplexed libraries, separate multiple samples with semicolons ( ; ) |
scpca_project_id |
A unique ID for each group of related samples. All results for samples with the same project ID will be returned in the same folder labeled with the project ID. |
technology |
Sequencing/library technology used For single-cell/single-nuclei libraries use either 10Xv2 , 10Xv2_5prime , 10Xv3 , or 10Xv31 . For ADT (CITE-seq) libraries use either CITEseq_10Xv2 , CITEseq_10Xv3 , or CITEseq_10Xv3.1 For cellhash libraries use either cellhash_10Xv2 , cellhash_10Xv3 , or cellhash_10Xv3.1 For bulk RNA-seq use either single_end or paired_end . For spatial transcriptomics use visium |
assay_ontology_term_id |
Experimental Factor Ontology term ID associated with the tech_version |
seq_unit |
Sequencing unit (one of: cell , nucleus , bulk , or spot ) |
sample_reference |
The name of the reference to use for mapping, available references include Homo_sapiens.GRCh38.104 and Mus_musculus.GRCm39.104 |
files_directory |
The full path/uri to directory containing fastq files (unique per run) |
The following optional columns may be necessary for running other data modalities (CITE-seq, spatial transcriptomics) or including existing cell type labels:
column_id | contents |
---|---|
feature_barcode_file |
The full path/uri to TSV file containing the feature barcode sequences (only required for ADT and cellhash samples); for samples with ADT tags, this file can optionally indicate whether antibodies are targets or controls |
feature_barcode_geom |
A salmon --read-geometry layout string. See https://github.com/COMBINE-lab/salmon/releases/tag/v1.4.0 for details (only required for ADT and cellhash samples) |
slide_section |
The slide section for spatial transcriptomics samples (only required for spatial transcriptomics) |
slide_serial_number |
The slide serial number for spatial transcriptomics samples (only required for spatial transcriptomics) |
submitter_cell_types_file |
The full path/uri to TSV file containing cell labels if you have cell type annotations results to include. See instructions below for more information about preparing this file |
We have provided an example run metadata file for reference.
View example run_metadata.tsv file |
---|
Using scpca-nf
requires a sample metadata file as a TSV (tab separated values) file, where each unique sample that is present in the scpca_sample_id
column of the run metadata file is a row, and columns contain any relevant sample metadata (e.g., diagnosis, age, sex, cell line).
For each library that is processed, the corresponding sample metadata will be added to the SingleCellExperiment
and AnnData
objects output by the workflow (see the section on Output files).
At a minimum, all sample metadata tables must contain a column with scpca_sample_id
as the header.
The contents of this column should contain all unique sample IDs that are present in the scpca_sample_id
column of the run metadata file.
We encourage you to use standard terminology, such as ontology terms, to describe samples when possible.
There is no limit to the number of columns allowed for the sample metadata, and you may include as many metadata fields as you please.
Some suggested columns include diagnosis, tissue, age, sex, stage of disease, cell line.
Additionally, you may include columns is_cell_line
and is_xenograft
to indicate the sample type:
is_cell_line
: UseTRUE
if the sample is from a cell line andFALSE
otherwise. Cell type annotation will not be performed for samples that areTRUE
.is_xenograft
: UseTRUE
if the sample is from a patient-derived xenograft andFALSE
otherwise.
We have provided an example run metadata file for reference.
View example sample_metadata.tsv file |
---|
Before using the workflow with data that you might plan to submit to ScPCA, please be sure to look at the guidelines for sample metadata.
Three workflow parameters are required for running scpca-nf
on your own data:
run_metafile
: the metadata file with library information, prepared according to the directions above.- This has a default value of
run_metadata.tsv
, but you will likely want to set your own file path.
- This has a default value of
sample_metafile
: the metadata file with sample information, prepared according to the directions above.- This has a default value of
sample_metadata.tsv
, but you will likely want to set your own file path.
- This has a default value of
outdir
: the output directory where results will be stored.- The default output is
scpca_out
, but again, you will likely want to customize this.
- The default output is
These parameters can be set at the command line using --run_metafile <path to run metadata file>
or --outdir <path to output>
, but we encourage you to set them in the configuration file, following the configuration file setup instructions below.
Note that workflow parameters such as --run_metafile
and --outdir
are denoted at the command line with double hyphen prefix, while options that affect Nextflow itself have only a single hyphen.
There are also a number of optional parameters that can be set, either at the command line or in a configuration file, including:
max_cpus
: the maximum number of CPU cores to use for a single process (default: 24)max_memory
: the maximum amount of memory to use for a single process (default:96.GB
)
Other customizable parameters can be found in the nextflow.config
file in the repository.
Note that all parameters can be overridden with a user config file or at the command line; nextflow.config
itself should not need modification.
Workflow parameters can also be set in a configuration file by setting the values params.run_metafile
, params.sample_metafile
, and params.outdir
as follows.
We could first create a file my_config.config
(or a filename of your choice) with the following contents:
// my_config.config
params.run_metafile = '<path to run metadata file>'
params.sample_metafile = '<path to sample metadata file>'
params.outdir = '<path to output>'
params.max_cpus = 24
params.max_memory = 96.GB
The max_cpus
and max_memory
parameters should reflect the maximum number of CPUs and memory available for a single process in your environment.
This file is then used with the -config
(or -c
) argument at the command line:
nextflow run AlexsLemonade/scpca-nf \
-config my_config.config
For reference, we provide an example template configuration file, user_template.config
, which includes some other workflow parameters that may be useful, as well as an example of configuring a profile for executing the workflow on a cluster, discussed below.
Note: This example tells Nextflow to use the configuration set up in the configuration file, but it does not invoke a specific profile, and will use the standard
profile.
Under the standard
profile, Nextflow will attempt to run the workflow locally using Docker.
This will most likely result in an error unless the minimum computing requirements (24 GB of RAM and 12 CPUs) are met on the local machine.
For more on creating and using a profile see the section below.
See the Nextflow documentation and the below sections for more detail on creating your own configuration file.
Processing single-cell and single-nuclei samples requires access to 24 GB of RAM and 12 CPUs so you will most likely want to run your workflow in a high performance computing environment (HPC), such as an institutional computing cluster or on a cloud service like AWS.
To do this, we recommend using Nextflow profiles to encapsulate settings like the executor
that will be used to run each process and associated details that may be required, such as queue names or the container engine (i.e., Docker or Singularity) your system uses.
You will likely want to consult your HPC documentation and/or support staff to determine recommended settings.
Note: To use the default index files and default cell type reference files, which are stored on S3, compute nodes must have access to the internet.
You may also need to supply AWS credentials for S3 access, or set aws.client.anonymous = true
within the Nextflow profile.
In our example template file user_template.config
, we define a profile named cluster
which could be invoked with the following command:
nextflow run AlexsLemonade/scpca-nf \
-config user_template.config \
-profile cluster
At the Data Lab, we use Nextflow with the Amazon Web Services (AWS) Batch compute environment.
If you are interested in using AWS batch with scpca-nf
, we provide some basic instructions here to get you started.
Be aware, AWS management can be quite complex, with many interacting parts and mysterious acronyms.
We encourage you to read the official Nextflow instructions for running pipelines on AWS, which includes information about security and permission settings that are beyond the scope of this document.
To run scpca-nf
, you will need to set up at least one batch queue and an associated compute environment configured with a custom Amazon Machine Image (AMI) prepared according to the Nextflow instructions.
You will also need an S3 bucket path to use as the Nextflow work
directory for intermediate files.
As the intermediate files can get quite large, you will likely want to set up a life cycle rule to delete files from this location after a fixed period of time (e.g., 30 days).
In most Batch queue setups, each AWS compute node has a fixed amount of disk space.
We found it useful to have two queues: one for general use and one for jobs that may require larger amounts of disk space.
The two compute environments use the same AMI, but use Launch Templates to configure the nodes on launch with different amounts of disk space.
Currently, our default queue is configured with a disk size of 128 GB for each node, and our "bigdisk"
queue has 1000 GB of disk space.
The queue used by each process is determined by Nextflow labels and associated profile settings.
The Data Lab's AWS Batch config file may be helpful as a reference for creating a profile for use with AWS, but note that the queues and file locations listed there are not publicly available, so these will need to be set to different values your own profile.
Some HPC systems limit the network traffic of compute nodes for security reasons.
The standard configuration of the scpca-nf
, however, expects that reference files and container images (for docker or singularity) can be downloaded as needed.
If your system does not allow direct internet access, you will need to pre-download the required reference files to a local directory and adjust parameters to direct the workflow to use the local files.
We provide the script get_refs.py
to download these reference files and optionally pull container images to the location of your choice.
If you have downloaded the full scpca-nf
repository, this script is included in the base directory.
Alternatively, you can download and this script on its own to the location of your choice with the following commands:
wget https://raw.githubusercontent.com/AlexsLemonade/scpca-nf/main/get_refs.py
chmod +x get_refs.py
Once you have downloaded the script and made it executable with the chmod
command, running the script will download the files required for mapping gene expression datasets to the subdirectory scpca-references
at your current location.
The script will also create a parameter file named localref_params.yaml
that defines the ref_rootdir
Nextflow parameter required to use these local data files.
To run with these settings
./get_refs.py
You can then direct Nextflow to use the parameters stored in localref_params.yaml
by using the -params-file
argument in a command such as the following:
nextflow run AlexsLemonade/scpca-nf \
-params-file localref_params.yaml \
-config user_template.config \
-profile cluster
Note that other configuration settings such as profiles, must still be set in the configuration file directly.
However, you should not put params.ref_rootdir
in the configuration file, as Nextflow may not properly create the sub-paths for the various reference files due to Nextflow's precedence rules of setting parameters.
The ref_rootdir
parameter should only be specified in a parameter file or at the command line with the --ref_rootdir
argument.
If you will be performing genetic demultiplexing for hashed samples, you will need STAR index files as well as the ones included by default.
To obtain these files, you can add the --star_index
flag:
./get_refs.py --star_index
If you will be analyzing spatial expression data, you will also need the Cell Ranger index as well, which can be obtained by adding the --cellranger_index
flag.
If your compute nodes do not have internet access, you will likely have to pre-pull the required container images as well.
When doing this, it is important to be sure that you also specify the revision (version tag) of the scpca-nf
workflow that you are using.
For example, if you would run nextflow run AlexsLemonade/scpca-nf -r v0.8.5
, then you will want to set -r v0.8.5
for get_refs.py
as well to be sure you have the correct containers.
By default, get_refs.py
will download files and images associated with the latest release.
If your system uses Docker, you can add the --docker
flag:
./get_refs.py --docker
For Singularity, you can similarly use the --singularity
flag to pull images and cache them for use by Nextflow.
These images will be placed by default in a singularity
directory at your current location.
If you would like to store them in a different location, use the --singularity_dir
argument to specify that path.
The example below stores the image files in $HOME/singularity
.
You will also need to set the singularity.cacheDir
variable to match this location in your configuration file profile.
./get_refs.py --singularity --singularity_dir "$HOME/singularity"
scpca-nf
can perform cell type annotation using two complementary methods: the reference-based method SingleR
and the marker-gene based method CellAssign
.
By default, no cell type annotation is performed. You can turn on cell type annotation by taking the following steps:
- Select appropriate reference dataset(s) to use with each method of interest.
- Prepare a project cell type metadata file to provide reference dataset information for each of
SingleR
andCellAssign
to the workflow. You will need to provide the path/uri to this file as a workflow parameter (project_celltype_metafile
), which you will need to define in your configuration file. For more information on adding parameters to your configuration file, see Configuring scpca-nf for your environment. - Run the workflow with the
--perform_celltyping
flag.
Once you have followed the above steps and added the path/uri to the project cell type metadata file to your configuration file, you can use the following command to run the workflow with cell type annotation:
nextflow run AlexsLemonade/scpca-nf \
--perform_celltyping
The Data Lab has compiled several references, listed in celltype-reference-metadata.tsv
.
All references listed in this table are publicly available on S3 for use with cell type annotation.
It is possible to provide your own references as well; instructions for this are forthcoming.
Note that you must use one of the references described here to be eligible for inclusion in the ScPCA Portal.
If you wish to use your own cell type reference rather than one of those we have compiled, please refer to these instructions for creating custom references for use with SingleR
and/or CellAssign
.
The Data Lab has compiled SingleR
references from the celldex
package, as described in this TSV file.
In this file, the column filename
provides the reference file name, and the column reference_name
provides the name of the reference.
Please consult the celldex
documentation to determine which of these references, if any, is most suitable for your dataset.
The Data Lab has compiled CellAssign
marker gene references from PanglaoDB, as described in this TSV file.
In this file, the column filename
provides the reference file name, and the column reference_name
provides the name of the reference.
The Data Lab compiled each reference by combining marker gene lists from organ-specific sets of cell types described in PanglaoDB
.
The specific organs used to compile each reference are listed in celltype-reference-metadata.tsv
.
For example, the reference blood-compartment
includes cell types categorized in PanglaoDB
with the organ names Blood
, Bone
, and Immune system
.
All libraries within a given project will use the same reference dataset for each of SingleR
and CellAssign
, respectively.
The project cell type metadata file should contain these five columns with the following information:
column_id | contents |
---|---|
scpca_project_id |
Project ID matching values in the run metadata file |
singler_ref_name |
Reference name for SingleR annotation, e.g., BlueprintEncodeData . Use NA to skip SingleR annotation |
singler_ref_file |
SingleR reference file name, e.g., BlueprintEncodeData_celldex_1-10-1_model.rds . Use NA to skip SingleR annotation |
cellassign_ref_name |
Reference name for CellAssign annotation, e.g. blood-compartment . Use NA to skip CellAssign annotation |
cellassign_ref_file |
CellAssign reference file name, e.g., blood-compartment_PanglaoDB_2020-03-27.tsv . Use NA to skip CellAssign annotation |
We have provided an example project cell type metadata file for reference.
View example project_celltype_metadata.tsv file |
---|
When cell typing is turned on with --perform_celltyping
, scpca-nf
will skip annotation for any libraries whose cell type annotation results already exist in the checkpoints
folder, as long as the cell type reference file name is unchanged.
The cell type annotations in the checkpoints
folder will have the following structure:
checkpoints
└── celltype
└── library01
│ ├── library01_cellassign
│ └── library01_singler
└── library02
├── library02_cellassign
└── library02_singler
This saves substantial processing time if the cell type annotation reference versions are unchanged. However, you may wish to repeat the cell typing process if there have been other changes to the data or analysis.
To force repeating the cell type annotation process, use the --repeat_celltyping
flag along with the --perform_celltyping
flag at the command line:
nextflow run AlexsLemonade/scpca-nf \
--perform_celltyping \
--repeat_celltyping
If you have already performed cell type annotation and wish to include these labels in the final workflow results, you can include the column submitter_cell_types_file
in your run metadata file.
This column should be filled with the path or uri to a TSV file containing cell type labels for the cells in the run.
The cell type label file is a TSV file with the following required columns:
column_id | contents |
---|---|
scpca_library_id |
Library ID matching values in the run metadata file |
cell_barcode |
The cell ID with the given annotation label |
cell_type_assignment |
The annotation label for that cell |
Optionally, you can also include a column cell_type_ontology
with ontology labels corresponding to the given annotation label.
Upon completion of the scpca-nf
workflow, the results will be published to the specified outdir
.
Within the outdir
, two folders will be present, results
and checkpoints
.
The results
folder will contain the final output files produced by the workflow and the files that are typically available for download on the ScPCA portal.
Within the results
folder, all files pertaining to a specific sample will be nested within a folder labeled with the sample ID.
All files in that folder will be prefixed by the library ID.
The files with the suffixes _unfiltered.rds
, _filtered.rds
, and _processed.rds
provide quantified gene expression data as SingleCellExperiment
objects.
The files with the suffixes _unfiltered_rna.h5ad
, _filtered_rna.h5ad
, and _processed_rna.h5ad
provide the quantified gene expression data as AnnData
objects.
If the input data contains libraries with ADT tags, three additional files with the suffixes _unfiltered_adt.h5ad
, _filtered_adt.h5ad
, and _processed_adt.h5ad
will be provided for each library.
These files contain the quantified ADT tag data as an AnnData
object.
Note: We currently do not output AnnData
objects (.h5ad
files) for any multiplexed libraries.
Only SingleCellExperiment
objects (.rds
files) will be provided for multiplexed libraries.
For more information on the contents of these files, see the ScPCA portal docs section on single cell gene expression file contents.
See below for the expected structure of the results
folder:
results
└── sample_id
├── library_id_unfiltered.rds
├── library_id_filtered.rds
├── library_id_processed.rds
├── library_id_unfiltered_rna.h5ad
├── library_id_filtered_rna.h5ad
├── library_id_processed_rna.h5ad
├── library_id_metadata.json
└── library_id_qc.html
If bulk libraries were processed, a bulk_quant.tsv
and bulk_metadata.tsv
summarizing the counts data and metadata across all libraries will also be present in the results
directory.
If you performed cell type annotation, an additional QC report specific to cell typing results called library_id_celltype-report.html
will also be present in the results
directory.
The checkpoints
folder will contain intermediate files that are produced by individual steps of the workflow, including mapping with salmon
.
The contents of this folder are used to allow restarting the workflow from internal checkpoints (in particular so the initial read mapping does not need to be repeated, see repeating mapping steps), and may contain log files and other outputs useful for troubleshooting or alternative analysis.
The rad
folder (nested inside the checkpoints
folder) contains the output from running salmon alevin
with the --rad
flag.
If bulk libraries are processed, there will be an additional salmon
folder that contains the output from running salmon quant
on each library processed.
All files pertaining to a specific library will be nested within a folder labeled with the library ID.
Additionally, for each run, all files related to that run will be inside a folder labeled with the run ID followed by the type of run (i.e. rna
or features
for libraries with ADT tags) and nested within the library ID folder.
See below for the expected structure of the checkpoints
folder:
checkpoints
├── rad
│ ├── library01
│ │ ├── run01-rna
│ │ └── run02-features
│ └── library02
│ └── run03-rna
└── salmon
By default, the direct output from running alevin-fry
is not provided.
Within scpca-nf
, the counts matrix output from alevin-fry
is directly imported into R as a SingleCellExperiment
object and can be obtained in the _unfiltered.RDS
file.
If you would like to obtain all files typically output from running alevin-fry
, you may run the workflow with the --publish_fry_outs
option at the command line.
This will tell the workflow to save the alevin-fry
outputs to a folder labeled alevinfry
nested inside the checkpoints
folder.
nextflow run AlexsLemonade/scpca-nf \
--publish_fry_outs
If genetic demultiplexing was performed, there will also be a checkpoints folder called vireo
with the output from running vireo using genotypes identified from the bulk RNA-seq.
Note that we do not output the genotype calls themselves for each sample or cell, as these may contain identifying information.
If cell type annotation was performed, there will also be a checkpoints folder called celltype
with the output from running SingleR
and CellAssign
.
Libraries processed using multiple modalities, such as those that include runs with ADT or cellhash tags, will require a file containing the barcode IDs and sequences.
The file location should be specified in the feature_barcode_file
for each library as listed in the run metadata file; multiple libraries can and should use the same feature_barcode_file
if the same feature barcode sequences are expected.
The feature_barcode_file
itself is a tab separated file with one line per barcode and no header.
The first column will contain the barcode or antibody ID and the second column the barcode nucleotide sequence.
For example:
TAG01 CATGTGAGCT
TAG02 TGTGAGGGTG
For libraries with ADT tags, you can optionally include a third column in the feature_barcode_file
to indicate the purpose of each antibody, which can take one of the following three values:
target
: antibody is a true targetneg_control
: a negative control antibodypos_control
: a spike-in positive control
For example, the following shows that two antibodies are targets and one is a negative control:
TAG01 CATGTGAGCT target
TAG02 TGTGAGGGTG neg_control
TAG03 GTAGCTCCAA target
If this third column is not provided, all antibodies will be treated as targets. Similarly, if information in this column is not one of the allowed values, a warning will be printed, and the given antibodies will be treated as target(s).
If there are negative control antibodies, these will be taken into account during post-processing filtering and normalization. Positive controls are currently unused, but if provided, this label will be included in final output files.
When processing multiplexed libraries that combine multiple samples into a pooled single-cell or single-nuclei library, we perform cellhash-based demultiplexing for all libraries and genetic demultiplexing when reference bulk RNA-seq data is available.
To support demultiplexing, we currently require ALL of the following for multiplexed libraries:
- A single-cell RNA-seq run of the pooled samples
- A matched cellhash sequencing run for the pooled samples
- A TSV file,
feature_barcode_file
, defining the cellhash barcode sequences- This file should have one line per barcode and no header. The first column should contain the cellhash barcode ID, and the second column should contain the barcode nucleotide sequence.
- A TSV file,
cellhash_pool_file
that defines the sample-barcode relationship for each library/pool of samples- This file should have one line per pool and no header. The first column should contain the cellhash pool ID, and the second column should contain the barcode nucleotide sequence.
For genetic demultiplexing, we also require:
- Separate bulk RNA-seq libraries for each sample in the pool
If any sample in a pool is missing a matched bulk RNA-seq library, then genetic demultiplexing will be skipped and only cellhash-based demultiplexing will be performed.
To skip genetic demultiplexing for all libraries and perform cellhash-based demultiplexing only use the --skip_genetic_demux
flag at the command line:
nextflow run AlexsLemonade/scpca-nf \
--skip_genetic_demux
The feature_barcode_file
for each library should be listed in the metadata file.
The cellhash_pool_file
location will be defined as a parameter in the configuration file, and should contain information for all libraries to be processed.
This file will contain one row for each library-sample pair (i.e. a library containing 4 samples will have 4 rows, one for each sample within), and should contain the following required columns:
column_id | contents |
---|---|
scpca_library_id |
Multiplexed library ID matching values in the run metadata file. |
scpca_sample_id |
Sample ID for a sample contained in the listed multiplexed library |
barcode_id |
The barcode ID used for the sample within the library, as defined in feature_barcode_file |
Other columns may be included for reference (such as the feature_barcode_file
associated with the library), but these will not be used directly.
We have provided an example multiplex pool file for reference that can be found in examples/example_multiplex_pools.tsv
.
To process spatial transcriptomic libraries, all FASTQ files for each sequencing run and the associated .jpg
file must be inside the files_directory
listed in the metadata file.
The metadata file must also contain columns with the slide_section
and slide_serial_number
.
You will also need to provide a docker image that contains the Space Ranger software from 10X Genomics. For licensing reasons, we cannot provide a Docker container with Space Ranger for you. As an example, the Dockerfile that we used to build Space Ranger can be found here.
After building the docker image, you will need to push it to a private docker registry and set params.SPACERANGER_CONTAINER
to the registry location and image ID in the user_template.config
file.
Note: The workflow is currently set up to work only with spatial transcriptomic libraries produced from the Visium Spatial Gene Expression protocol and has not been tested using output from other spatial transcriptomics methods.
By default, scpca-nf
is set up to skip the salmon
mapping steps for any libraries in which the output files from the mapping step exist in the checkpoints
folder of the output directory (i.e. the .rad
files from salmon alevin
and quant.sf
files from salmon quant
).
If the salmon
version and transcriptome index are unchanged, this will save substantial processing time and cost, and avoids some of the sensitivity of the caching system used by nextflow -resume
, which can sometimes result in rerunning steps unnecessarily.
However, if there have been updates to the scpca-nf
workflow that include changes to the salmon version or transcriptome index (or if you change those on your own), you may want to repeat the mapping process.
To force repeating the mapping process, use the --repeat_mapping
flag at the command line:
nextflow run AlexsLemonade/scpca-nf \
--repeat_mapping
In addition to the main scpca-nf
workflow, this repository contains a separate workflow called merge.nf
that will merge a set of processed ScPCA SingleCellExperiment
objects into a single merged SingleCellExperiment
object containing all counts from the specified libraries.
This workflow creates a merged SingleCellExperiment
object, a merged AnnData
object, and an associated merged object HTML report encompassing all libraries with the same scpca_project_id
.
This workflow only merges objects; it does not integrate libraries or perform any batch-correction.
This workflow is specifically designed to run on processed SingleCellExperiment
object files output by the scpca-nf
workflow.
Therefore, you will need to take the following steps to run the merge.nf
workflow:
- Follow the instructions above to prepare to run the
scpca-nf
workflow, including organizing your files, preparing both the run metadata and sample metadata files, and configuring your environment. Thescpca_project_id
values you specify in the metadata files will be used to determine which libraries should be merged together in themerge.nf
workflow. - Run the
scpca-nf
workflow. - Run the
merge.nf
workflow, as described below.
The merge.nf
workflow requires two parameters to run:
project
, thescpca_project_id
whose objects should be merged- A comma-separated list of
scpca_project_id
values can also be provided. In this case, a separate merged object will be created for each ID.
- A comma-separated list of
run_metafile
, the run metadata file which was previously prepared when running the main workflow
The merge.nf
workflow runs by first finding all libraries present, for each project, in the specified params.outdir
, which represents the output directory where scpca-nf
will have stored results from a prior run.
If you specified a different parameter value from the default scpca-out
for the outdir
parameter in your scpca-nf
configuration file, you will need to ensure that same value is provided to merge.nf
.
Results from running the merge.nf
workflow will also be added to this params.outdir
directory in a sub-directory called merged
.
The workflow can be run as shown:
nextflow run AlexsLemonade/scpca-nf/merge.nf \
-config <path to config file> \
-profile <name of profile> \
--project <project ID whose libraries should be merged>
To be sure that you are using a consistent version, you can specify use of a release tagged version of the workflow, set below with the -r
flag.
The command below will pull the scpca-nf
workflow directly from Github using the v0.7.2
version.
Released versions can be found on the scpca-nf
repository releases page.
nextflow run AlexsLemonade/scpca-nf/merge.nf \
-r v0.7.2 \
-config <path to config file> \
-profile <name of profile> \
--project <project ID whose libraries should be merged>
The merge.nf
workflow will output, for each specified project ID, an .rds
file containing a merged SingleCellExperiment
object, an .h5ad
file containing a merged AnnData
object, and a report which provides a brief summary of the types of libraries and their samples' diagnoses included in the merged object, as well as UMAP visualizations highlighting each library.
These output files will follow this structure:
merged
└── <project_id>
├── <project_id>_merged.rds
├── <project_id>_merged_rna.h5ad
└── <project_id>_merged-summary-report.html
There are some additional considerations to be aware of for libraries which contain additional modalities, such as ADT counts from CITE-seq or HTO counts from multiplexing.
If any libraries in a merge group have ADT counts, these counts will also be merged and included in the final merged object.
In the case of SingleCellExperiment
objects, ADT counts will be provided as an alternative experiment called "adt"
in same object.
In the case of AnnData
, a separate file will be exported with the extension _adt.h5ad
that contains the merged ADT counts.
The output files will follow this structure if CITE-seq data is present:
merged
└── <project_id>
├── <project_id>_merged.rds
├── <project_id>_merged_rna.h5ad
├── <project_id>_merged_adt.h5ad
└── <project_id>_merged-summary-report.html
The merge.nf
workflow currently does not support merging HTO counts from multiplexed libraries.
If any libraries contain HTO counts, the RNA counts will still be merged and exported, but the HTO counts will not be included.