-
Notifications
You must be signed in to change notification settings - Fork 46
options
The workflow can be customized through the configuration file vpipe.config
. This configuration file is a text file written using a basic struture composed of sections, properties and values. For instance, we suggest to provide as input a tabular file specifying sample unique identifiers (e.g., patient identifiers), and dates for different sequencing runs related to the same patient. The name of this file (here, samples.tsv
) can be provided by specifying the section as input
and the property as samples_file
, as follows,
[input]
samples_file = samples.tsv
As shown above, sections are expected in squared brackets, and properties are followed by corresponding values.
Below, we provide a comprehensive list of all user-configurable options stratified by sections.
This option should be used to specify the default number of threads for all multi-threaded rules. That is, unless the number of threads is specified for each rule, this value is set as default. Default value is 4.
There are three options for mapping reads, either using ngshmmalign (ngshmmalign
, default), BWA MEM (bwa
) [1], or Bowtie 2 (bowtie
) [2]. To use a different aligner than the default, indicate which aligner you want to use by setting the property aligner
. E.g., to use Bowtie 2:
[general]
aligner = bowtie
There are two options available for calling single nucleotide variants, either using ShoRAH (shorah
) or LoFreq (lofreq
) [3]. ShoRAH is used by default. If you prefer to use LoFreq, then indicate so in the configuration file as shown below,
[general]
snv_caller = lofreq
There are two options available for haplotype reconstruction, namely SAVAGE [3] or HaploClique. SAVAGE is used by default. If you wish to use HaploClique, then indicate it in the configuration file as below,
[general]
haplotype_reconstruction = haploclique
Directory where samples are stored. By default, it is set to samples
.
File containg sample unique identifiers and dates as tab-separated values, e.g.,
patient1 20100113
patient1 20110202
patient2 20081130
Here, we have two samples from patient 1 and one sample from patient 2.
By default, V-pipe searches for a file named samples.tsv
, if this file does not exist, a list of samples is built by globbing datadir
directory contents.
Optionally, the samples file can contain a third column specifying the read length. This is particularly useful when samples are sequenced using protocols with different read lengths.
Indicate whether the input sequencing reads correspond to paired-end reads. By default, it is set to True
.
Fastq files are expected to be stored on a subdirectory named raw_data
. For example, for patient 1 with one paired-end readsdataset, the hierarchy should look like
samples
└── patient1
└── 20100113
└──raw_data
├──patient1_20100113_R1.fastq
└──patient1_20100113_R2.fastq
By default, V-pipe finds the fastq file matching the following pattern: prefix + R + {1,2} + .(fastq|fq|fastq.gz|fg.gz)
. If a suffix should be introducing after R1 and R2, the user needs to specify it using this option.
Using this parameter, the user can specify the read-length threshold that should be apply during the quality trimming as a percentage (0 < trim_percent_cutoff
< 1). This is particularly useful when samples are sequenced using protocols with different read lengths and different from the default (i.e., 250 bp). In such case, the <samples_file>.tsv
file should be provided, and it is expected to contain 3 columns. The sample specific read-length is parsed from this file.
Indicate whether single nucleotide variant calling (snv
), local (local
), and/or global haplotype reconstruction (global
) should be performed, by setting corresponding properties to True
. At the moment, local haplotype reconstruction and SNV calling are preformed by ShoRAH, and both steps are intertwided. Global haplotype reconstruction is perfomed using the software SAVAGE. In addition, we provide a module for detecting flow-cell cross-contamination, which can be anable by setting QA
to True
. In the current implementation, the latter is specific for HIV applications.
The path to the different software packages can be specified using this section, e.g.,
[applications]
bwa = /path/to/bwa
haploclique = /path/to/haploclique
It is especially useful when dependencies are not obtained via conda, and when the software packages are not in the PATH
.
NOTE we strongly recommend to use conda environments, by adding the --use-conda
flag to the V-pipe execution command, e.g. ./vpipe --use-conda
. If you prefer to use your own installations, this section allows you to specify the location of the executables
Allocation of resources can variate with different input sizes (e.g. number of reads) and number of samples. Therefore, users can specify memory and time requirements for all rules. For multi-threaded software packages, threads can be also customized.
Available configurable options for cluster environments: mem
, and time
.
Available configurable options for cluster environments: mem
, and time
.
We use software PRINSEQ [5] for quality control. By default, we use options -ns_max_n 4 -min_qual_mean 30 -trim_qual_left 30 -trim_qual_right 30 -trim_qual_window 10
, which indicates to trim reads using an sliding window with size 10 bp, and trim bases if their quality scores are less than 30. Additionally, reads are filtered out if the average quality score is below 30 and if they contain more than 4 N's. The user can choose to overwrite the default settings or use additional parameters by using the property extra
. E.g., if many reads are filtered out in this step, the user can choose to lower the quality threshold as follows:
[preprocessing]
extra = -ns_max_n 4 -min_qual_mean 20 -trim_qual_left 20 -trim_qual_right 20 -trim_qual_window 10
Please do not modify PRINSEQ options -out_format
, -out_good
, nor -min_len
. Instead of using -min_len
to define threshold on the read length after trimming, use property trim_percent_cutoff
).
Available configurable options for cluster environments: mem
, and time
.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
.
NOTE The conda environment for this rule doesn’t work properly. The package on the bioconda channel, mvicuna, is slightly different from VICUNA and it has different command-line arguments. Moreover, VICUNA and mvicuna are no longer maintained. In the future, this rule will be deprecated.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
.
NOTE Obtaining a initial reference de novo is implemented for more than one sample.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
. Other options:
Property | Explanation |
---|---|
extra | pass additional options to run ngshmmalign |
leave_msa_temp | this option is useful for debugging purposes |
V-pipe uses option -R <path/to/initial_reference>
, thus option -r arg
is not allowed. Also, instead of passing -l
via the property extra
, set leave_msa_temp
to True
. Lastly, please do not modify options -o arg
, -w arg
, -t arg
, and -N arg
. These are already managed by V-pipe.
Available configurable options for cluster environments: mem
, and time
.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
.
This rule takes all previously aligned reads by hmm_align
. Therefore, resources should be allocated accordingly. Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
.
Available configurable options for cluster environments: mem
, and time
.
Available configurable options for cluster environments: mem
, and time
. With property extra
, users can pass additional options to run BWA MEM. For more details on BWA MEM configurable options refer to the software documentation.
Available configurable options for cluster environments: mem
, and time
.
Property | Default | Explanation |
---|---|---|
phred | --phred33 | indicate if qualities are Phred+33 (default) or Phred++64 (--phred64 ) |
preset | --local --sensitive-local | specify Bowtie 2 presets |
extra | pass additional options to run Bowtie 2. V-pipe handles the input and output files, as well as the reference sequence. Thus, do not modify these options |
For more details on Bowtie 2 configurable options refer to the software documentation.
Available configurable options for cluster environments: mem
, and time
. Other options:
Property | Default | Explanation |
---|---|---|
min_coverage | 50 | minimum read depth for reporting variants per locus |
qual_thrd | 15 | minimum phred quality score for a base to be included |
min_freq | 0.05 | minimum frequency for an ambiguous nucleotide |
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
. Other options:
Property | Default | Explanation |
---|---|---|
min_coverage | 100 | minimum read depth for reporting variants per locus |
frequencies | False | output a numpy array file containing frequencies of all bases, including gaps and also the most abundant base accross samples |
Available configurable options for cluster environments: mem
, and time
. This rule is used to find windows on the read alignment with relative high coverage, i.e., higher than coverage
.
Property | Default | Explanation |
---|---|---|
coverage | 50 | minimum read depth. A region spanning the reference genome is returned if coverage is set to 0 |
liberal | True | indicate whether to apply a more liberal shifting on intervals' right-endpoint |
overlap | False | construct intervals based on overlapping windows of the read alignment. By default, regions with high coverage are built based on the position-wise read depth |
Available configurable options for cluster environments: mem
, and time
.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
. Other options:
Property | Default | Explanation |
---|---|---|
alpha | 0.1 | hyperparameter used for instantiating a new cluster |
ignore_indels | False | ignore SNVs adjacent to indels |
coverage | 0 | omit windows with coverage less than this value |
shift | 3 | ShoRAH performs local haplotype reconstruction on windows of the read alignment. The overlap between these windows is defined by the window shifts. By default, it is set to 3, i.e., apart from flaking regions each position is covered by 3 windows |
keep_files | False | indicate whether to move files produced in previous/interrumpted runs to subdirectory named old
|
Available configurable options for cluster environments: mem
, and time
. Additionally, property extra
allows to pass additional options to run lofreq call
.
Available configurable options for cluster environments: mem
, and time
. Other options:
Property | Default | Explanation |
---|---|---|
split | 20 | size of the batches of reads to be processed by SAVAGE. It is recommended that 500 < coverage/split < 1000 |
NOTE This rule only works in linux.
Available configurable options for cluster environments: mem
, and time
. Other options:
Property | Default | Explanation |
---|---|---|
relax | True | if set to True (default) a predefined set of parameter values is used for drawing edges between reads in the read graph |
no_singletons | True | singletons are defined as proposed haplotypes which are supported by a single read. If this property is set to True , singletons are discarded |
no_prob0 | True | if set to True (default) probability of the overhangs is ignored |
clique_size_limit | 3 | sets a threshold to limit the size of cliques |
max_num_cliques | 10000 | indicates the maximum number of clique to be considered in the next iteration |
Available configurable options for cluster environments: mem
, and time
. Other options:
Property | Default | Explanation |
---|---|---|
region_start | 0 | use to specify a region of interest |
region_end | 9719 | use to specify a region of interest |
msa | when the ground truth is available (e.g., simulation studies), a multiple sequence alignment of types making up the population can be provided, and additional checks are performed |
[1] Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013. [2] Langmead, B. and Salzberg, S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012. [3] Wilm et al. LoFreq: A sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 20121. [4] Baaijens, JA et al., De novo assembly of viral quasispecies using overlap graphs. Genome Res. 2017. [5] Schmieder, R. and Edwards, R. Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011.