- Installation
- Before you start
- Single chromosome mode (test example included)
- WGS mode
- Merge VCF
- Use VolcanoSV-vc for other assemblies
- Improve assembly for regions enriched in segmental duplications
- Truvari evaluation
- Computation resource usage
- Citation
- Troubleshooting
git clone --recurse-submodules https://github.com/maiziezhoulab/VolcanoSV.git
cd VolcanoSV;sh Install.sh
VolcanoSV utilizes Python3.8.3 To set up the environment, you need to have conda installed. Then, simply run
conda env create -f VolcanoSV/requirement.yaml
Then you will have a virtual environment called volcanosv
created. Before running any VolcanoSV commands, please activate this environment first.
conda activate volcanosv
You can set
path_to_volcanosv=/path/to/VolcanoSV
for convenience or just use the full path of ${path_to_volcanosv}/bin/VolcanoSV-asm/volcanosv-asm.py
, ${path_to_volcanosv}/bin/VolcanoSV-vc/Large_INDEL/volcanosv-vc-large-indel.py
, ${path_to_volcanosv}/bin/VolcanoSV-vc/Complex_SV/volcanosv-vc-complex-sv.py
and ${path_to_volcanosv}/bin/VolcanoSV-vc/Small_Indel/volcanosv-vc-small-indel.py
.
Note: To ensure you have installed VolcanoSV successfully, we highly recommend trying the tool on the provided test data and checking whether you have the correct result before trying it on your data. For the guide on how to run the pipeline, please refer to Single chromosome mode (test example included).
1.VolcanoSV is designed only for autosome chromosomes since the phasing procedure for sex chromosomes is very different!
2.The reference file should have contig name in a format like "chr1, chr2, chr3 ..." and does not have descriptive field after it.
3.We highly recommend you to run the test data first before using VolcanoSV on your own data!
For the single chromosome mode, we provided the chr10 BAM file, contigs file and VCF file for Hifi, CLR and ONT data. You can download them from zenodo.
In the following sessions, we will provide the code to run the Hifi data. If you wish to reproduce the result for CLR data or ONT data, you can just simply change the input BAM file and the argument "dtype" to the corresponding data type (CLR/ONT).
The example data is aligned to hg19 reference. You can download the reference files(genome.fa and genome.fa.fai) from zenode(https://zenodo.org/records/10520476).
Alternatively, you can download the more complete reference data (including more indexing and meta data) using the command below
wget https://cf.10xgenomics.com/supp/genome/refdata-hg19-2.1.0.tar.gz
tar -xzvf refdata-hg19-2.1.0.tar.gz
Note: since translocation detection requires WGS BAM file as support, it does not make sense to run it on single chromosome level. Therefore, we only provide the complex SV pipeline in WGS mode.
The VolcanoSV assembly pipeline is designed to run by chromosomes. We integrated multiple state-of-the-art assemblers into the pipeline, including hifiasm,Flye,wtdbg2,miniasm,Shasta,NextDenovo,and Hicanu/Canu. Users can select the appropriate assembler based on their needs. The main script is ${path_to_volcanosv}/bin/VolcanoSV-asm/volcanosv-asm.py
. The input arguments for this script are explained below:
--bam_file INBAM, -bam INBAM, could be either wgs bam or single-chromosome bam file
--output_dir output_dir, -o output_dir
--reference REFERENCE, -ref REFERENCE
--n_thread N_THREAD, -t N_THREAD
--chrnum {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22}, -chr {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22}
--assembler {wtdbg2,canu,miniasm,shasta,nextdenovo,hifiasm,hicanu,flye}, -asm {wtdbg2,canu,miniasm,shasta,nextdenovo,hifiasm,hicanu,flye}
optional; if not set, VolcanoSV use hifiasm for Hifi data and flye for CLR and ONT data by default.
--data_type {CLR,ONT,Hifi}, -dtype {CLR,ONT,Hifi}
--pacbio_subtype {rs,sq}, -pb {rs,sq}
must provide when using wtdbg2 on CLR data (default: None)
--shasta_ont_config {Nanopore-OldGuppy-Sep2020}, -shacon {Nanopore-OldGuppy-Sep2020}
--prefix PREFIX, -px PREFIX
Please select from hifiasm and hicanu for Hifi data, and the rest of the assemblers are for CLR and ONT data.
By default, VolcanoSV uses hifiasm for Hifi data and Flye for CLR and ONT data.
After running the above code, you will have output contigs in <ouput_folder>/chr<chrnum>/assembly/final_contigs/<prefix>_final_contigs.fa
.
For example, if you want to use hifiasm for hifi data, you can use the below scripts
python3 ${path_to_volcanosv}/bin/VolcanoSV-asm/volcanosv-asm.py \
-bam Hifi_L2_hg19_minimap2_chr10.bam \
-o volcanosv_asm_output \
-ref refdata-hg19-2.1.0/fasta/genome.fa \
-t 10 \
-chr 10 \
-dtype Hifi \
-px Hifi_L2 \
-asm hifiasm
The final contig will be volcanosv_asm_output/chr10/assembly/final_contigs/Hifi_L2_final_contigs.fa
.
If the volcanosv-asm pipeline is executed successfully, your final contig file should have roughly the same size as the Hifi_L2_contigs.fa from zenodo.
VolcanoSV-asm already includes the executable version of all assemblers, so you do not need to install them individually.
However, if you want more detailed information on these assemblers, you can click here.
Different assemblers vary in their ability to assemble regions enriched in segmental duplications (SDs) and other complex regions. Therefore, it is often advantageous to utilize different assemblers for different genomic regions. We thus also provide a hybrid mode: users can input a BED file, and specify an "in-BED" assembler and an "out-BED" assembler. The phase blocks that overlap with the BED file will be assembled using the in-BED assembler, while the rest will be assembled by the out-BED assembler. The script for this mode is ${path_to_volcanosv}/bin/VolcanoSV-asm/volcanosv-asm_hybrid.py
.
For example, if you provide a segdups.bed
, and want to use hicanu for the segdup regions and hifiasm for the other rest regions, you can use the code below:
python3 ${path_to_volcanosv}/bin/VolcanoSV-asm/volcanosv-asm_hybrid.py \
-bam Hifi_L2_hg19_minimap2_chr10.bam \
-o volcanosv_asm_output \
-ref refdata-hg19-2.1.0/fasta/genome.fa \
-bed segdups.bed \
--inbed_assembler hicanu \
--outbed_assembler hifiasm \
-t 10 \
-chr 10 \
-dtype Hifi \
-px Hifi_L2
The final contig will be volcanosv_asm_output/chr10/assembly/final_contigs/Hifi_L2_final_contigs.fa
.
If the volcanosv-asm pipeline is executed successfully, your final contig file should have roughly the same size as the Hifi_L2_contigs.fa from zenodo.
The main code is ${path_to_volcanosv}/bin/VolcanoSV-vc/Large_INDEL/volcanosv-vc-large-indel.py
. The input arguments for this code are explained below:
--input_dir INPUT_DIR, -i INPUT_DIR
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
--data_type DATA_TYPE, -dtype DATA_TYPE
Hifi;CLR;ONT
--bam_file RBAM_FILE, -bam RBAM_FILE
reads bam file for reads signature extraction
--reference REFERENCE, -ref REFERENCE
wgs reference file
--chrnum {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22}, -chr {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22}
--n_thread N_THREAD, -t N_THREAD
--n_thread_align N_THREAD_ALIGN, -ta N_THREAD_ALIGN
--mem_per_thread MEM_PER_THREAD, -mempt MEM_PER_THREAD
Set maximum memory per thread for alignment; suffix K/M/G recognized; default = 768M
--prefix PREFIX, -px PREFIX
The input directory should be the output directory of volcanoSV-asm. This code is compatible with either single chromosome mode or wgs mode: when the argument "chrnum" is provided, it will execute in single chromosome mode, otherwise, it will assume the input_dir contains chr1-chr22 contigs and execute in wgs mode. Please note that prefix
should remain consistent with what is set in volcanosv-asm.
After running the above code, you will have output VCF in <ouput_folder>/volcanosv_large_indel.vcf
.
For example, if you want to reproduce the VCF file for large indels on Hifi_L2 data, you can use the following command:
python3 ${path_to_volcanosv}/bin/VolcanoSV-vc/Large_INDEL/volcanosv-vc-large-indel.py \
-i volcanosv_asm_output/ \
-o volcanosv_large_indel_output/ \
-dtype Hifi \
-bam Hifi_L2_hg19_minimap2_chr10.bam \
-ref refdata-hg19-2.1.0/fasta/genome.fa \
-chr 10 -t 10 \
-px Hifi_L2
The VCF file will be volcanosv_large_indel_output/chr10/Hifi_L2_volcanosv_large_indel_chr10.vcf
.
If the volcanosv-vc-large-indel pipeline is executed successfully, your VCF file should have roughly the same number of variants as the Hifi_L2_variants.vcf from zenodo.
Note that, due to the randomness in assembly and alignment procedure, your VCF file may have 1 or 2 variants more or less than the Hifi_L2_variants.vcf. If that happens, we may still consider the pipeline as executed successfully, as long as the difference is minor.
The main script is ${path_to_volcanosv}/bin/VolcanoSV-vc/Small_INDEL/volcanosv-vc-small-indel.py
. The input arguments for this script are explained below:
--input_dir INPUT_DIR, -i INPUT_DIR
--bam_file READ_BAM, -bam READ_BAM
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
--reference REFERENCE, -ref REFERENCE
--bedfile BEDFILE, -bed BEDFILE
optional; a high confidence bed file (default: None)
--region REGION, -r REGION
optional; exmaple: chr21:2000000-2100000 (default: None)
--n_thread N_THREAD, -t N_THREAD
--kmer_size KMER_SIZE, -k KMER_SIZE
--ratio RATIO, -rt RATIO
maximum bad kmer ratio (default: 0.3)
--min_support MIN_SUPPORT, -ms MIN_SUPPORT
maximum support for bad kmer (default: 5)
--prefix PREFIX, -px PREFIX
The input directory should be the output directory of volcanoSV-asm.
The example code is as below:
python3 ${path_to_volcanosv}/bin/VolcanoSV-vc/Small_INDEL/volcanosv-vc-small-indel.py \
-i volcanosv_asm_output/ \
-o volcanosv_small_indel \
-bam Hifi_L2_hg19_minimap2_chr10.bam \
-ref refdata-hg19-2.1.0/fasta/genome.fa \
-r chr10 \
-t 30 \
-px Hifi_L2
After running the above code, you will have output VCF in volcanosv_small_indel/Hifi_L2_volcanosv_small_indel.vcf
.
We adopt Longshot's SNP call result as the final SNP call. After successfully running the assembly pipeline, you will have the phased SNP VCF file: volcanosv_asm_output/<chromosome_name>/phasing_result/<prefix>_phased.vcf
.
The VolcanoSV assembly is designed to operate on a per-chromosome basis. If you have access to a distributed computing system that supports multiple job submissions, we recommend submitting one job per chromosome and running them concurrently. You can follow the template below to construct your job script:
python3 ${path_to_volcanosv}/bin/VolcanoSV-asm/volcanosv-asm.py \
-bam <wgs_bam> \
-o volcanosv_asm_output \
-ref <reference_file> \
-t 10 \
-chr <chromosome_number> \
-dtype <datatype> \
-px <prefix>
To create job scripts for different chromosomes, simply change the <chromosome_number> in each script. All scripts share the same output folder. After all jobs are finished, you will see 22 subfolders under the output folder you specified, which are the chr1-chr22 assembly results. This is the most efficient way to run volcanosv-asm pipeline on a distributed system.
However, if you do not have a distributed computing system, you may write a for loop to run different chromosomes:
for i in {1..22}
do
echo "***********************assembly for chr${i}*******************"
python3 ${path_to_volcanosv}/bin/VolcanoSV-asm/volcanosv-asm.py \
-bam <wgs_bam> \
-o volcanosv_asm_output \
-ref <reference_file> \
-t 10 \
-chr ${i} \
-dtype <datatype> \
-px <prefix>
done
The chr1-chr22 will be saved under volcanosv_asm_output
. This is slower but can still get the work done.
The main script is ${path_to_volcanosv}/bin/VolcanoSV-vc/Large_INDEL/volcanosv-vc-large-indel.py
. This script is designed for both single chromosome mode and wgs mode. To run it in WGS mode, you should first finish running the volcanosv-asm pipeline and provide the assembly output folder as the input folder for this code. You should not provide the `chr' argument in WGS mode.
An example command is as below:
python3 ${path_to_volcanosv}/bin/VolcanoSV-vc/Large_INDEL/volcanosv-vc-large-indel.py \
-i volcanosv_asm_output/ \
-o volcanosv_large_indel_output/ \
-dtype <datatype> \
-bam <wgs_reads_bamfile> \
-ref <reference> \
-t 11 \
-px <prefix>
After running the above code, you will have output VCF in volcanosv_large_indel_output/<prefix>_volcanosv_large_indel.vcf
.
The main code is ${path_to_volcanosv}/bin/VolcanoSV-vc/Complex_SV/volcanosv-vc-complex-sv.py
. The input arguments for this code are explained below:
--input_dir INPUT_DIR, -i INPUT_DIR
--indelvcf INDELVCF, -vcf INDELVCF
--bamfile BAMFILE, -bam BAMFILE
--reference REFERENCE, -ref REFERENCE
--datatype {Hifi,CLR,ONT}, -dtype {Hifi,CLR,ONT}
--output_dir output_dir, -o output_dir
--n_thread N_THREAD, -t N_THREAD
--prefix PREFIX, -px PREFIX
The input directory should be the output directory of volcanoSV-asm.
The example code is as below:
python3 ${path_to_volcanosv}/bin/VolcanoSV-vc/Complex_SV/volcanosv-vc-complex-sv.py \
-i volcanosv_asm_output/ \
-vcf volcanosv_large_indel_output/volcanosv_large_indel.vcf \
-o volcanosv_complex_sv \
-dtype <datatype> \
-bam <wgs_reads_bamfile> \
-ref <reference> \
-t 11 \
-px <prefix>
After running the above code, you will have output VCF in volcanosv_complex_sv/<prefix>_volcanosv_complex_SV.vcf
.
The main code is ${path_to_volcanosv}/bin/VolcanoSV-vc/Small_INDEL/volcanosv-vc-small-indel.py
. The input arguments for this code are explained below:
--input_dir INPUT_DIR, -i INPUT_DIR
--bam_file READ_BAM, -bam READ_BAM
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
--reference REFERENCE, -ref REFERENCE
--bedfile BEDFILE, -bed BEDFILE
optional; a high confidence bed file (default: None)
--region REGION, -r REGION
optional; exmaple: chr21:2000000-2100000 (default: None)
--n_thread N_THREAD, -t N_THREAD
--kmer_size KMER_SIZE, -k KMER_SIZE
--ratio RATIO, -rt RATIO
maximum bad kmer ratio (default: 0.3)
--min_support MIN_SUPPORT, -ms MIN_SUPPORT
maximum support for bad kmer (default: 5)
--prefix PREFIX, -px PREFIX
The input directory should be the output directory of volcanoSV-asm.
The example code is as below:
python3 ${path_to_volcanosv}/bin/VolcanoSV-vc/Small_INDEL/volcanosv-vc-small-indel.py \
-i volcanosv_asm_output/ \
-o volcanosv_small_indel \
-bam <wgs_reads_bamfile> \
-ref <reference> \
-t 30 \
-px <prefix>
After running the above code, you will have output VCF in volcanosv_small_indel/<prefix>_volcanosv_small_indel.vcf
.
We adopt Longshot's SNP call result as the final SNP call. After successfully running the assembly pipeline, you will have the phased SNP VCF file: volcanosv_asm_output/<chromosome_name>/phasing_result/<prefix>_phased.vcf
.
After running the whole assembly-variant calling process, you may want to collect variants of all type and all chromosomes into one single VCF file. We provided script to do that.
If you run large and small indel detection in WGS mode, then you can do this
python3 ${path_to_volcanosv}/bin/Utils/Merge_VCF.py VolcanoSV_Variants.vcf \
volcanosv_large_indel_output/${prefix}_volcanosv_large_indel.vcf \
volcanosv_complex_sv/${prefix}_volcanosv_complex_SV.vcf \
volcanosv_small_indel/${prefix}_volcanosv_small_indel.vcf \
volcanosv_asm_output/*/phasing_result/${prefix}_phased.vcf
If you run large and small indel detection in single chromosome mode, and the result is in a format like volcanosv_large_indel_output/<chromosome_name>/${prefix}_volcanosv_large_indel.vcf
, then you can use '*' to substitute different chromosome name. The code is like
python3 ${path_to_volcanosv}/bin/Utils/Merge_VCF.py VolcanoSV_Variants.vcf \
volcanosv_large_indel_output/*/${prefix}_volcanosv_large_indel.vcf \
volcanosv_complex_sv/${prefix}_volcanosv_complex_SV.vcf \
volcanosv_small_indel/*/${prefix}_volcanosv_small_indel.vcf \
volcanosv_asm_output/*/phasing_result/${prefix}_phased.vcf
Through either way, the merged result will be VolcanoSV_Variants.vcf
.
You may also use VolcanoSV-vc for other haplotype-resolved assemblies. Note: you need to reformat your assemblies file before you run it. You assemblies file should have "hp1" or "hp2" in the contig name to differentiate different haplotype.
Here is an example of using other assemblies, when the data type is Hifi.
python3 ${path_to_volcanosv}/bin/VolcanoSV-vc/Large_INDEL/volcanosv-vc-large-indel-otherasm.py \
-i contigs.fa \
-o out \
-d Hifi \
-bam ${reads_bamfile} \
-ref ${reference_file} \
-t 10 -px ${prefix}
python3 ${path_to_volcanosv}/bin/VolcanoSV-vc/Complex_SV/volcanosv-vc-complex-sv-otherasm.py \
-hp1 hp1.fa \
-hp2 hp2.fa \
-vcf volcanosv_large_indel_output/volcanosv_large_indel.vcf \
-o volcanosv_complex_sv \
-dtype <datatype> \
-bam <wgs_reads_bamfile> \
-ref <reference> \
-t 11 \
-px <prefix>
After WGS assembly, if you would like to evaluate assembly for SDs and further achieve better assembly in SD-enriched regions, you can run the below pipeline, which includes 3 steps.
Align reads to the contig fasta file, and then utilize Flagger to annotate assembly for collapse components (collapsed SD regions). To run this step, you need Java and Docker installed in your system.
python3 ${path_to_volcanosv}/bin/VolcanoSV-asm/Evaluate_Assembly.py \
--input_dir <volcanosv_output> \
--output_dir <SD_recovery_dir> \
--fastq_file <raw_reads_fastq> \
--data_type <DATA_TYPE> \
--n_thread <t> \
--mem_per_thread <mem> \
--sample_name <sample> \
--lib_name <lib>
After this step, you will have a <sample>_<lib>_collapsed_hp_namex.txt
file generated in the output folder. <sample>_<lib>_collapsed_hp_namex.txt
file contains the collapsed phase block names. In step 2, you will use this file as input to perform assembly only focusing on these collapsed phase blocks.
Perform assembly only in those collapsed regions using a specified assembler.
The main script is General_Assembly_Workflow.py
. The arguments are:
--hap_file HAP_FILE, -haps HAP_FILE
--fastq_dirs FASTQ_DIRS [FASTQ_DIRS ...], -fqds FASTQ_DIRS [FASTQ_DIRS ...]
the folder that includes FASTQ files. (default: None)
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
--assemblers {wtdbg2,canu,miniasm,shasta,nextdenovo,hifiasm,hicanu,flye} [{wtdbg2,canu,miniasm,shasta,nextdenovo,hifiasm,hicanu,flye} ...], -asms {wtdbg2,canu,miniasm,shasta,nextdenovo,hifiasm,hicanu,flye} [{wtdbg2,canu,miniasm,shasta,nextdenovo,hifiasm,hicanu,flye} ...]
the assemblers used for fastq_dirs; order corresponds to the order of fastq_dirs (default: None)
--data_type {CLR,ONT,Hifi}, -d {CLR,ONT,Hifi}
--pacbio_subtype {rs,sq}, -pb {rs,sq}
must provide when using wtdbg2 on CLR data (default: None)
--shasta_ont_config {Nanopore-OldGuppy-Sep2020}, -shacon {Nanopore-OldGuppy-Sep2020}
--n_thread N_THREAD, -t N_THREAD
--asm_thread ASM_THREAD, -ta ASM_THREAD
--clean, -cl
--prefix PREFIX, -px PREFIX
file prefix in the output folder (default: Sample)
Example Usage:
python3 ${path_to_volcanosv}/bin/VolcanoSV-asm/General_Assembly_Workflow_SD.py \
-hap <sample>_<lib>_collapsed_hp_namex.txt \
-i <volcanosv_output>/ \
-o <volcanosv_output>/SD_recovery
-asms <your_specified_assembler> \
-d <type> \
-t <t>
You will have a <volcanosv_output>/SD_recovery/final_contigs/final_contigs.fa
generated.
Use the newly generated contigs to replace the previously collapsed contigs.
python3 ${path_to_volcanosv}/bin/VolcanoSV-asm/Replace_Collapsed_Contigs.py \
-og <SD_recovery_dir>/assemblies.fa \
-new <volcanosv_output>/SD_recovery/final_contigs/final_contigs.fa \
-o <volcanosv_output>/SD_recovery/SD_recovered.fa \
-hap <sample>_<lib>_collapsed_hp_namex.txt
<SD_recovery_dir>/assemblies.fa is generated in Step1. <volcanosv_output>/SD_recovery/SD_recovered.fa is the SD recovered contig file.
We use truvari4.0.0 to perform benchmarking against the Genome in a Bottle (GIAB) gold standard set in a high confidence region. The parameter we use is
p=0.5 P=0.5 r=500 S=30 O=0.01
Data Type | Coverage | Memory | ncpu | Run time | CPU hours |
---|---|---|---|---|---|
Hifi | 56X | 168GB | 30 | 2-01:07:59 | 1474 |
CLR | 89x | 259GB | 20 | 5-09:05:59 | 2582 |
ONT | 48x | 85GB | 30 | 6-02:33:59 | 4397 |
Data Type | Coverage | Memory | ncpu | Run time | CPU hours |
---|---|---|---|---|---|
Hifi | 56X | 21GB | 50 | 00:25:11 | 21 |
CLR | 89x | 215GB | 50 | 02:27:35 | 123 |
ONT | 48x | 34GB | 50 | 00:38:24 | 32 |
Luo C, Liu YH, Zhou XM. VolcanoSV enables accurate and robust structural variant calling in diploid genomes from single-molecule long read sequencing. Nat Commun. 2024 Aug 13;15(1):6956. PubMed PMID: 39138168. https://doi.org/10.5281/zenodo.12671886