- call-sSNV
The call-sSNV nextflow pipeline performs somatic SNV calling given a pair of tumor/normal BAM files. Four somatic SNV callers are available: SomaticSniper, Strelka2, Mutect2 and MuSE. The user may request one or more callers, and each caller produces an independently generated filtered VCF file.
If two or more callers are requested, additional output includes both a VCF and an MAF file with the set of SNVs shared by two or more callers, and a Venn Diagram showing counts of shared and private SNVs.
SomaticSniper, Strelka2, and MuSE require there to be exactly one pair of input tumor/normal BAM files, but Mutect2 will take tumor-only input (no paired normal), as well as tumor/normal BAM pairs for multiple samples from the same individual.
-
SomaticSniper is an older tool yielding high specificity single nucleotide somatic variants.
-
Strelka2 uses candidate indels from
Manta
and calls somatic short mutations (single nucleotide and small indel), filtered with a random forest model. -
GATK Mutect2 calls somatic short mutations via local assembly of haplotypes.
-
MuSE accounts for tumor heterogeneity and calls single nucleotide somatic variants.
Below is a summary of how to run the pipeline. See here for more information on running Nextflow pipelines.
Note: Because this pipeline uses an image stored in the GitHub Container Registry, you must follow the steps listed in the Docker Introduction on Confluence to set up a PAT for your GitHub account and log into the registry on the cluster before running this pipeline.
-
The recommended way of running the pipeline is to directly use the source code located here:
/hot/software/pipeline/pipeline-call-sSNV/Nextflow/release/
, rather than cloning a copy of the pipeline.- The source code should never be modified when running our pipelines
-
Copy and edit the input config file
Make sure the reference .fa file in config file matches the reference genome in the input BAM files.
- Copy and edit the input YAML
- The pipeline can be executed locally using the command below:
nextflow run path/to/main.nf -config path/to/input.config -params-file input.yaml`
For example,
path/to/main.nf
could be:/hot/software/pipeline/pipeline-call-sSNV/Nextflow/release/5.0.0/main.nf
path/to/input.config
is the path to where you saved your project-specific copy of template.configpath/to/input.yaml
is the path to where you saved your project-specific copy of template.yaml
To submit to UCLAHS-CDS's Azure cloud, use the submission script here with the command below:
python path/to/submit_nextflow_pipeline.py \
--nextflow_script path/to/main.nf \
--nextflow_config path/to/input.config\
--nextflow_yaml path/to/input.yaml \
--pipeline_run_name <run_name> \
--partition_type F72 \
--email [email protected]
Note: Although --partition_type F2 is an available option for small data sets, Mutect2 and Muse will fail due to lack of memory.
Compare a pair of tumor and normal BAM files and output an unfiltered list of single nucleotide positions that are different between tumor and normal, in VCF format.
This takes several steps, listed below, and starts with the same input files given to SomaticSniper
. These are used to generate a list of high confidence indels to assist SNV filtering.
Summarize counts of reads that support reference, alternate and other alleles for given sites. This is done for both of the input BAM files and the results are used in the next step.
Use samtools.pl varFilter
to filter each pileup output (tumor and normal), then further filter each to keep only indels with QUAL > 20. samtools.pl
is packaged with SomaticSniper
.
Use snpfilter.pl
(packaged with SomaticSniper
):
i. filter VCF using normal indel pileup (from step b
).
ii. filter VCF output from step i
using tumor indel pileup (from step b
).
Extract positions from filtered VCF file and use with bam-readcount
to generate a summary of read alignment metrics for each position.
Use fpfilter.pl
and highconfidence.pl
(packaged with SomaticSniper), resulting in a final high confidence VCF file.
The input pair of tumor/normal BAM files are used by Manta to produce candidate small indels via the Manta
somatic configuration protocol. Note, larger (structural) variants are also produced and can be retrieved from the intermediate files directory if save intermediate files is enabled.
The input pair of tumor/normal BAM files, along with the candidate small indel file produced by Manta
are used by Strelka2
to create lists of somatic single nucleotide and small indel variants, both in VCF format. Lower quality variants that did not pass filtering are subsequently removed, yielding .SNV-pass.vcf.gz
and .Indel-pass.vcf.gz
files.
The params.intersect_regions
of the reference genome are split into x intervals for parallelization, where x is defined by params.scatter_count
.
Call somatic variants with Mutect2
.
Merge scattered outputs (VCFs, statistics).
Create artifact prior table based on read orientations with GATK's LearnReadOrientationModel
.
Filter variants with GATK's FilterMutectCalls
, using read orientation prior table and contamination table as well as standard filters.
Split filtered VCF into separate files for each variant type: SNVs, MNVs and INDELs.
Pre-filtering and calculating position-specific summary statistics using the Markov substitution model.
Computes tier-based cutoffs from a sample-specific error model.
MuSE
output has SNVs labeled as PASS
or one of Tier 1-5
for the lower confidence calls (Tier 5
is lowest). This step keeps only SNVs labeled PASS
.
If two or more algorithms were selected the Intersect workflow will run. Currently the resulting VCF and MAF files include any SNVs found by two or more algorithms.
Determines presence/absence of each SNV within each algorithm's set of filtered SNVs. Results are listed in the output files: isec-1-or-more/README.txt
and isec-1-or-more/sites.txt
, and are summarized in a Venn Diagram plot (TIFF format).
Determines presence/absence of SNVs found in two or more of each algorithm's set of filtered SNVs, and outputs a consensus
VCF for each algorithm containing SNVs found by that algorithm plus at least one other algorithm. Results are also listed in the output files: isec-2-or-more/README.txt
and isec-2-or-more/sites.txt
.
Concatenates the 2+ algorithm consensus
SNVs into one VCF (SNV-concat.vcf.gz). The output header is a uniquified concatenation of all input VCF headers. The output fields INFO
, FORMAT
, NORMAL
and TUMOR
are from the first listed VCF that has the SNV. Input VCFs are sorted alphanumerically by the algorithm name.
Converts SNV-concat.vcf.gz from step 3 into MAF format. Output includes allele counts and flanking basepairs, but most fields are blank. Details can be found here.
A stripplot is generated to display the distribution of allele frequencies for sets of SNVs categorized by the number of callers that include the variant.
To run the pipeline, one input.yaml
and one input.config
are needed, as follows.
input.yaml. (see template)
Input | Type | Description |
---|---|---|
patient_id | string | The name/ID of the patient |
tumor_BAM | path | The path to the tumor .bam file (.bai file must exist in same directory) |
normal_BAM | path | The path to the normal .bam file (.bai file must exist in same directory) |
contamination_table | path | Optional, but only for tumor samples. The path of the contamination.table , which is generated from the GATK's CalculateContamination in pipeline-call-gSNP . The contamination.table path can be found under pipeline-call-gSNP 's output QC folder |
input.yaml
should follow the standardized structure:
patient_id: 'patient_id'
input:
normal:
- BAM: /path/to/normal.bam
tumor:
- BAM: /path/to/tumor.bam
contamination_table: /path/to/contamination.table
Mutect2
can take other inputs: tumor-only sample and one patient's multiple samples. For tumor-only samples, remove the normal input ininput.yaml
, e.g. template_tumor_only.yaml. For multiple samples, put all the input BAMs in theinput.yaml
, e.g. template_multi_sample.yaml. Note, for these non-standard inputs, the configuration file must have 'mutect2' listed as the only algorithm.
input.config (see template)
Input | Required | Type | Description |
---|---|---|---|
algorithm |
yes | list | List containing a combination of somaticsniper, strelka2, mutect2 and muse |
reference |
yes | string | The reference .fa file (.fai and .dict file must exist in same directory) |
intersect_regions * |
yes | string | A bed file listing the genomic regions for variant calling. Excluding decoy regions is HIGHLY recommended * |
output_dir |
yes | string | The location where outputs will be saved |
dataset_id |
yes | string | The name/ID of the dataset |
exome |
yes | boolean | The option will be used by Strelka2 and MuSE . When true , it will add the --exome option to Manta and Strelka2, and -E option to MuSE |
save_intermediate_files |
yes | boolean | Whether to save intermediate files |
work_dir |
no | string | The path of working directory for Nextflow, storing intermediate files and logs. The default is /scratch with ucla_cds and should only be changed for testing/development. Changing this directory to /hot or /tmp can lead to high server latency and potential disk space limitations, respectively |
docker_container_registry |
no | string | Registry containing tool Docker images, optional. Default: ghcr.io/uclahs-cds |
base_resource_update |
optional | namespace | Namespace of parameters to update base resource allocations in the pipeline. Usage and structure are detailed in template.config and below. |
*Providing intersect_regions
is required and will limit the final output to just those regions. All regions of the reference genome could be provided as a bed
file with all contigs, however it is HIGHLY recommended to remove decoy
contigs from the human reference genome. Including these thousands of small contigs will require the user to increase available memory for Mutect2
and will cause a very long runtime for Strelka2
. See Discussion here. For uclahs-cds
users, a GRCh38 bed.gz
file can be found here: /hot/resource/tool-specific-input/pipeline-call-sSNV-6.0.0/GRCh38-BI-20160721/Homo_sapiens_assembly38_no-decoy.bed.gz
.
To optionally update the base resource (cpus or memory) allocations for processes, use the following structure and add the necessary parts to the input.config file. The default allocations can be found in the node-specific config files
base_resource_update {
memory = [
[['process_name', 'process_name2'], <multiplier for resource>],
[['process_name3', 'process_name4'], <different multiplier for resource>]
]
cpus = [
[['process_name', 'process_name2'], <multiplier for resource>],
[['process_name3', 'process_name4'], <different multiplier for resource>]
]
}
Note Resource updates will be applied in the order they're provided so if a process is included twice in the memory list, it will be updated twice in the order it's given.
Examples:
- To double memory of all processes:
base_resource_update {
memory = [
[[], 2]
]
}
- To double memory for
call_sSNV_Mutect2
and triple memory forrun_validate_PipeVal
andrun_sump_MuSE
:
base_resource_update {
memory = [
['call_sSNV_Mutect2', 2],
[['run_validate_PipeVal', 'run_sump_MuSE'], 3]
]
}
- To double CPUs and memory for
run_sump_MuSE
and double memory forrun_validate_PipeVal
:
base_resource_update {
cpus = [
['run_sump_MuSE', 2]
]
memory = [
[['run_sump_MuSE', 'run_validate_PipeVal'], 2]
]
}
Input | Required | Type | Description |
---|---|---|---|
bgzip_extra_args | no | string | The extra option used for compressing VCFs |
tabix_extra_args | no | string | The extra option used for indexing VCFs |
Input | Required | Type | Description |
---|---|---|---|
split_intervals_extra_args | no | string | Additional arguments for the SplitIntervals command |
mutect2_extra_args | no | string | Additional arguments for the Mutect2 command |
filter_mutect_calls_extra_args | no | string | Additional arguments for the FilterMutectCalls command |
gatk_command_mem_diff | yes | nextflow.util.MemoryUnit | How much to subtract from the task's allocated memory where the remainder is the Java heap max. (should not be changed unless task fails for memory related reasons) |
scatter_count | yes | int | Number of intervals to split the desired interval into. Mutect2 will call each interval seperately. |
germline_resource_gnomad_vcf | no | path | A stripped down version of the gnomAD VCF stripped of all unneeded INFO fields, keeping only AF, currently available for GRCh38:/hot/resource/tool-specific-input/GATK/GRCh38/af-only-gnomad.hg38.vcf.gz and GRCh37: /hot/resource/tool-specific-input/GATK/GRCh37/af-only-gnomad.raw.sites.vcf . |
panel_of_normals_vcf | no | path | VCF file of sites observed in normal. Currently available for GRCh38: /hot/resource/tool-specific-input/GATK/GRCh38/1000g_pon.hg38.vcf.gz . This could be useful for tumor only mode. |
Input | Required | Type | Description |
---|---|---|---|
dbSNP | yes | path | The path to NCBI's dbSNP database of known SNPs in VCF format, e.g. GCF_000001405.40.gz |
Input | Required | Type | Description |
---|---|---|---|
ncbi_build | yes | string | vcf2maf requires the reference genome build ID, e.g. GRCh38 |
vcf2maf_extra_args | no | string | additional arguments for the vcf2maf command |
Tool Outputs | Type | Description |
---|---|---|
SomaticSniper-{version}_{sample_id}_SNV.vcf.gz | .vcf.gz | Filtered SNV VCF (somaticsniper) |
Strelka2-{version}_{sample_id}_SNV.vcf.gz | .vcf.gz | Filtered SNV VCF(strelka2) |
Strelka2-{version}_{sample_id}_Indel.vcf.gz | .vcf.gz | Filtered Indel VCF (strelka2) |
Mutect2-{version}_{sample_id}_SNV.vcf.gz | .vcf.gz | Filtered SNV VCF (mutect2) |
Mutect2-{version}_{sample_id}_Indel.vcf.gz | .vcf.gz | Filtered Indel VCF (mutect2) |
Mutect2-{version}_{sample_id}_MNV.vcf.gz | .vcf.gz | Filtered MNV VCF (mutect2) |
Mutect2-{version}_{sample_id}_filteringStats.tsv | .tsv | FilterMutectCalls output (mutect2 QC) |
MuSE-{version}_{sample_id}_SNV.vcf.gz | .vcf.gz | Filtered SNV VCF (MuSE) |
report.html, timeline.html, trace.txt | .html, .txt | Nextflow logs |
Intersect Outputs | Type | Description |
---|---|---|
isec-1-or-more | directory | BCFtools isec README.txt and sites.txt, all variants |
isec-2-or-more | directory | BCFtools isec README.txt and sites.txt, variants shared by 2 or more tools |
SomaticSniper-{version}_{sample_id}_consensus-variants.vcf.gz | .vcf.gz | 2-or-more SNV VCF |
Strelka2-{version}_{sample_id}_consensus-variants.vcf.gz | .vcf.gz | 2-or-more SNV VCF |
Mutect2-{version}_{sample_id}_consensus-variants.vcf.gz | .vcf.gz | 2-or-more SNV VCF |
MuSE-{version}_{sample_id}_consensus-variants.vcf.gz | .vcf.gz | 2-or-more SNV VCF |
BCFtools-{version}_{sample_id}_Venn-diagram.tiff | .tiff | Venn Diagram with intersection counts for all variants (1-or-more ) |
BCFtools-{version}_{sample_id}_SNV-concat.vcf.gz | .vcf.gz | Single SNV VCF with all 2-or-more variants and mixed annotation |
BCFtools-{version}_{sample_id}_SNV-concat.maf.gz | .maf.gz | Single SNV MAF with all 2-or-more variants and mixed annotation |
BPG-{version}{dataset_id}{sample_id}_adjVAF.png | .png | Stripplot of adjusted VAFs with combinations of callers |
Testing was performed in the Boutros Lab SLURM Development cluster. Metrics below will be updated where relevant with additional testing and tuning outputs. Pipeline version used here is v4.0.0-rc.1
General estimates, with wide variation, are that whole exome sequences require 16 cpus and 32 GB of memory to run all of the pipeline algorithms. If MuSE is excluded 8 cpus and 16 GB of memory are sufficient. Run time for a test pair of exome tumor/normal input BAMs of 4 GB/5 GB was in both cases 1 to 2 hours.
General estimates, with wide variation, are that whole genome sequences require 72 cpus and 144 GB of memory to run all of the pipeline algorithms. If MuSE is excluded 8 cpus and 16 GB of memory are sufficient, but run time could be very long. Run time for a test pair of WGS tumor/normal input BAMs of 400 GB/200 GB was 15 hours for 72 cpus/144 GB, and 52 hours for 8 cpus/16 GB excluding MuSE.
Duration: 3h 25m 24s
- Process
call_sSNVInAssembledChromosomes_Mutect2
has been split into 50 intervals, so the following table shows one of those processes:
process_name | max_duration | max_cpu | max_peak_vmem |
---|---|---|---|
call_sSNVInNonAssembledChromosomes_Mutect2 | 32m 44s | 142.0% | 33.1 GB |
call_sSNVInAssembledChromosomes_Mutect2 | 1h 20m 12s | 123.8% | 7.8 GB |
run_LearnReadOrientationModel_GATK | 31m 5s | 106.8% | 10.2 GB |
Duration: 9h 21m 23s
process_name | max_duration | max_cpu | max_peak_vmem |
---|---|---|---|
convert_BAM2Pileup_SAMtools | 4h 18m 29s | 98.2% | 1.9 GB |
call_sSNV_SomaticSniper | 8h 48m 45s | 98.7% | 511.6 MB |
generate_ReadCount_bam_readcount | 29m 33s | 75.9% | 261.5 MB |
Strelka2's runtime will be significantly improved when using --callRegions
option to exclude the non-canoincal regions of the genome, here is the results of CPCG0196:
Sample: CPCG0196
Normal BAM: /hot/software/pipeline/pipeline-align-DNA/Nextflow/development/outputs/bwa-mem2_and_hisat2-2.2.1/bwa-mem2/bams/a-full-CPCG0196-B1/align-DNA-20210424-024139/pipeline-alignDNA.inputs.CPCG0196-B1.bam
Tumor BAM: /hot/resource/pipeline_testing_set/WGS/GRCh38/A/full/CPCG0000000196-T001-P01-F.bam
process_name | max_duration | max_cpu | max_peak_vmem |
---|---|---|---|
call_sIndel_Manta | 1h 24m 26s | 2724.2% | 23.2 GB |
call_sSNV_Strelka2 | 22h 32m 24s | 511.3% | 17.4 GB |
process_name | max_duration | max_cpu | max_peak_vmem |
---|---|---|---|
call_sIndel_Manta | 1h 35m 25s | 1848.6% | 11.7 GB |
call_sSNV_Strelka2 | 59m 19s | 3234.0% | 8.2 GB |
Therefore, we strongly suggest to use the --callRegions
if the non-canonical region is unnecessary. -callRegions
's input bed.gz
file can be found here: /hot/resource/tool-specific-input/Strelka2/GRCh38/strelka2_call_region.bed.gz
. For other genome version, you can use UCSC Liftover to convert.
MuSE v2.0 was tested with a normal/tumor paired CPCG0196 WGS sample on a F32 slurm-dev node. Duration: 1d 11h 6m 54s
process_name | max_duration | max_cpu | max_peak_vmem |
---|---|---|---|
call_sSNV_MuSE | 3h 44m 15s | 3181.7% | 60.4 GB |
run_sump_MuSE | 1d 7h 22m 2s | 100.0% | 41.6 GB |
- Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28, 311–317 (2012).
- Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).
- McKenna, A. et al. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
- Fan, Y. et al. MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biol. 17, 178 (2016).
Authors: Mao Tian ([email protected]), Bugh Caden, Helena Winata ([email protected]), Sorel Fitz-Gibbon ([email protected]).
pipeline-call-sSNV is licensed under the GNU General Public License version 2. See the file LICENSE for the terms of the GNU GPL license.
This pipeline performs somatic SNV calling on a pair of normal/tumor BAMs, utilizing SomaticSniper, Strelka2, Mutect2 and MuSE.
Copyright (C) 2020-2024 University of California Los Angeles ("Boutros Lab") All rights reserved.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.