-
Notifications
You must be signed in to change notification settings - Fork 46
options
The workflow can be customized through the configuration file vpipe.config
. This configuration file is a text file written using a basic struture composed of sections, properties and values. For instance, we suggest to provide as input a tabular file specifying sample unique identifiers (e.g., patient identifiers), and dates for different sequencing runs related to the same patient. The name of this file (here, samples.tsv
) can be provided by specifying the section as input
and the property as samples_file
, as follows,
[input]
samples_file = samples.tsv
As shown above, sections are expected in squared brackets, and properties are followed by corresponding values.
Below, we provide a comprehensive list of all user-configurable options stratified by sections.
This option should be used to specify the default number of threads for all multi-threaded rules. That is, unless the number of threads is specified for each rule, this value is set as default. Default value is 4.
There are two options for mapping reads, either using ngshmmalign (default) or bwa [1]. For the latter, the user should indicate in the configuration file that bwa is the aligner to be used, e.g.,
[general]
aligner = bwa
There are two options available for haplotype recosntruction, namely SAVAGE [2] or HaploClique. SAVAGE is used by default. If you wish to use HaploClique, then indicate it in the configuration file as below,
[general]
haplotype_reconstruction = haploclique
Directory where samples are stored. By default, it is set to samples
.
File containg sample unique identifiers and dates as tab-separated values, e.g.,
patient1 20100113
patient1 20110202
patient2 20081130
Here, we have two samples from patient 1 and one sample from patient 2.
By default, V-pipe searches for a file named samples.tsv
, if this file does not exist, a list of samples is built by globbing datadir
directory contents.
Optionally, the samples file can contain a third column specifying the read length. This is particularly useful when samples are sequenced using protocols with different read lengths. In this case, option trim_cutoff
should correspond to the a fraction between 0 and 1 (see below)
Indicate whether the input sequencing reads correspond to paired-end reads. By default, it is set to True
.
Fastq files are expected to be stored on a subdirectory named raw_data
. For example, for patient 1 and the first sample, the hierarchy should look like
samples
└── patient1
└── 20100113
└──raw_data
├──patient1_20100113_R1.fastq
└──patient1_20100113_R2.fastq
By default, V-pipe finds the fastq file matching the following pattern: prefix + R + {1,2} + .fastq
. If a suffix should be introducing after R1 and R2, user needs to specify it thorugh this option.
Using this parameter, the user can specify the read-length threshold that should be apply during the quality trimming as a percentage (0 < trim_cutoff
< 1). This is particularly useful when samples are sequenced using protocols with different read lengths and different from the default (250 bp). In such case, the <samples_file>.tsv
file should be provided, as the sample specific read-length is parsed from this file.
Indicate whether single nucleotide variant calling (snv
), local (local
), and/or global haplotype reconstruction (global
) should be performed, by setting corresponding properties to True
. At the moment, local haplotype reconstruction and SNV calling are preformed by ShoRAH, and both steps are intertwided. Global haplotype reconstruction is perfomed using the software SAVAGE. In addition, we provide a module for detecting flow-cell cross-contamination, which can be anable by setting QA
to True
. In the current implementation, the latter is specific for HIV applications.
The path to the different software packages can be specified using this section, e.g.,
[applications]
bwa = /path/to/bwa
haploclique = /path/to/haploclique
It is especially useful when dependencies are not obtained via conda, and when the software packages are not in the PATH
.
Allocation of resources can variate with different input sizes (e.g. number of reads) and number of samples. Therefore, users can specify memory and time requirements for all rules. For multi-threaded software packages, threads can be also customized.
Available configurable options for cluster environments: mem
, and time
.
Available configurable options for cluster environments: mem
, and time
.
Available configurable options for cluster environments: mem
, and time
. Other options:
Mean quality score used for filtering low-quality reads.
Reads shorter than min_len are filtered out.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
.
NOTE The conda environment for this rule doesn’t work properly. The package on the bioconda channel, mvicuna, is slightly different from VICUNA and it has different command-line arguments. Moreover, VICUNA and mvicuna are no longer maintained. In the future, this rule will be deprecated.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
.
NOTE Obtaining a initial reference de novo is implemented for more than one sample.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
. Other options:
This option is useful for debugging purposes.
Available configurable options for cluster environments: mem
, and time
.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
.
This rule takes all previously aligned reads by hmm_align
. Therefore, resources should be allocated accordingly. Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
.
Available configurable options for cluster environments: mem
, and time
.
Available configurable options for cluster environments: mem
, and time
. Other options for reporting consensus sequences:
Minimum read depth for reporting variants per locus.
Minimum phred quality score for a base to be included.
Minimum frequency for an ambiguous nucleotide.
Available configurable options for cluster environments: mem
, and time
. Other options for reporting consensus sequences:
Minimum read depth for reporting variants per locus.
Minimum phred quality score for a base to be included.
Minimum frequency for an ambiguous nucleotide.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
.
Available configurable options for cluster environments: mem
, and time
.
Available configurable options for cluster environments: mem
, and time
.
Available configurable options for cluster environments: mem
, and time
, and options for single-node as well as cluster environments threads
. Other options:
ShoRAH performs local haplotype reconstruction on windows of the read alignment. The overlap between these windows is defined by the window shifts. By default, it is set to 3.
ShoRAH can reuse results from previous (e.g., interrumpted) runs. By default this option is set to False
.
Available configurable options for cluster environments: mem
, and time
.
Available configurable options for cluster environments: mem
, and time
. Other options:
Size of the batches of reads to be processed by SAVAGE. It is recommended that 500 < coverage/split < 1000. By default, it is set to 20.
Available configurable options for cluster environments: mem
, and time
. Other options:
If set to True
(default) a predefined set of parameter values is used for drawing edges between reads in the read graph.
Singletons are defined as proposed haplotypes whcih are supported by a single read. If this property is set to True
(default), singletons are discarded.
if set to True
(default) probability of the overhangs is ignored.
Sets a threshold to limit the size of cliques. By default is 3.
Indicates the maximum number of clique to be considered in the next iteration. By default is 10000.
Available configurable options for cluster environments: mem
, and time
. Other options:
Use to specify a region of interest
Use to specify a region of interest
When the ground truth is available (e.g., simulation studies), a multiple sequence alignment of types making up the population can be provided, and additional checks are performed.
Defaults for user configurable options are provided in vpipe.snake
.
[1] Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013. [2] Baaijens, JA et al., De novo assembly of viral quasispecies using overlap graphs. Genome Res. 2017.