options

V-pipe: user configurable options

The workflow can be customized through the configuration file vpipe.config. This configuration file is a text file written using a basic struture composed of sections, properties and values. For instance, we suggest to provide as input a tabular file specifying sample unique identifiers (e.g., patient identifiers), and dates for different sequencing runs related to the same patient. The name of this file (here, samples.tsv) can be provided by specifying the section as input and the property as samples_file, as follows,

[input]
samples_file = samples.tsv

As shown above, sections are expected in squared brackets, and properties are followed by corresponding values.

Below, we provide a comprehensive list of all user-configurable options stratified by sections.

general

threads

This option should be used to specify the default number of threads for all multi-threaded rules. That is, unless the number of threads is specified for each rule, this value is set as default. Default value is 4.

aligner

There are two options for mapping reads, either using ngshmmalign (default) or bwa [1]. For the latter, the user should indicate in the configuration file that bwa is the aligner to be used, e.g.,

[general]
aligner = bwa

haplotype_reconstruction

There are two options available for haplotype recosntruction, namely SAVAGE [2] or HaploClique. SAVAGE is used by default. If you wish to use HaploClique, then indicate it in the configuration file as below,

[general]
haplotype_reconstruction = haploclique

input

datadir

Directory where samples are stored. By default, it is set to samples.

samples_file

File containg sample unique identifiers and dates as tab-separated values, e.g.,

patient1    20100113
patient1    20110202
patient2    20081130

Here, we have two samples from patient 1 and one sample from patient 2. By default, V-pipe searches for a file named samples.tsv, if this file does not exist, a list of samples is built by globbing datadir directory contents.

Optionally, the samples file can contain a third column specifying the read length. This is particularly useful when samples are sequenced using protocols with different read lengths. In this case, option trim_cutoff should correspond to the a fraction between 0 and 1 (see below)

paired

Indicate whether the input sequencing reads correspond to paired-end reads. By default, it is set to True.

fastq_suffix

Fastq files are expected to be stored on a subdirectory named raw_data. For example, for patient 1 and the first sample, the hierarchy should look like

samples
└── patient1
    └── 20100113
        └──raw_data
           ├──patient1_20100113_R1.fastq
           └──patient1_20100113_R2.fastq

By default, V-pipe finds the fastq file matching the following pattern: prefix + R + {1,2} + .fastq. If a suffix should be introducing after R1 and R2, user needs to specify it thorugh this option.

trim_percent_cutoff

Using this parameter, the user can specify the read-length threshold that should be apply during the quality trimming as a percentage (0 < trim_cutoff < 1). This is particularly useful when samples are sequenced using protocols with different read lengths and different from the default (250 bp). In such case, the <samples_file>.tsv file should be provided, as the sample specific read-length is parsed from this file.

output

Indicate whether single nucleotide variant calling (snv), local (local), and/or global haplotype reconstruction (global) should be performed, by setting corresponding properties to True. At the moment, local haplotype reconstruction and SNV calling are preformed by ShoRAH, and both steps are intertwided. Global haplotype reconstruction is perfomed using the software SAVAGE. In addition, we provide a module for detecting flow-cell cross-contamination, which can be anable by setting QA to True. In the current implementation, the latter is specific for HIV applications.

applications

The path to the different software packages can be specified using this section, e.g.,

[applications]
bwa = /path/to/bwa
haploclique = /path/to/haploclique

It is especially useful when dependencies are not obtained via conda, and when the software packages are not in the PATH.

Allocation of resources can variate with different input sizes (e.g. number of reads) and number of samples. Therefore, users can specify memory and time requirements for all rules. For multi-threaded software packages, threads can be also customized.

gunzip

Available configurable options for cluster environments: mem, and time.

extract

Available configurable options for cluster environments: mem, and time.

preprocessing

Available configurable options for cluster environments: mem, and time. Other options:

qual_threshold

Mean quality score used for filtering low-quality reads.

min_len

Reads shorter than min_len are filtered out.

initial_vicuna

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads. NOTE The conda environment for this rule doesn’t work properly. The package on the bioconda channel, mvicuna, is slightly different from VICUNA and it has different command-line arguments. Moreover, VICUNA and mvicuna are no longer maintained. In the future, this rule will be deprecated.

initial_vicuna_msa

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads. NOTE Obtaining a initial reference de novo is implemented for more than one sample.

hmm_align

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads. Other options:

leave_msa_temp

This option is useful for debugging purposes.

sam2bam

Available configurable options for cluster environments: mem, and time.

bwa_QA

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads.

coverage_QA

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads.

msa

This rule takes all previously aligned reads by hmm_align. Therefore, resources should be allocated accordingly. Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads.

convert_to_ref

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads.

ref_index

Available configurable options for cluster environments: mem, and time.

bwa_align

Available configurable options for cluster environments: mem, and time. Other options for reporting consensus sequences:

min_coverage

Minimum read depth for reporting variants per locus.

qual_thrd

Minimum phred quality score for a base to be included.

min_freq

Minimum frequency for an ambiguous nucleotide.

bowtie_align

Available configurable options for cluster environments: mem, and time. Other options for reporting consensus sequences:

min_coverage

Minimum read depth for reporting variants per locus.

qual_thrd

Minimum phred quality score for a base to be included.

min_freq

Minimum frequency for an ambiguous nucleotide.

minor_variants

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads.

coverage_intervals

Available configurable options for cluster environments: mem, and time.

shorah_regions

Available configurable options for cluster environments: mem, and time.

snv

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads. Other options:

shift

ShoRAH performs local haplotype reconstruction on windows of the read alignment. The overlap between these windows is defined by the window shifts. By default, it is set to 3.

keep_files

ShoRAH can reuse results from previous (e.g., interrumpted) runs. By default this option is set to False.

aggregate

Available configurable options for cluster environments: mem, and time.

savage

Available configurable options for cluster environments: mem, and time. Other options:

split

Size of the batches of reads to be processed by SAVAGE. It is recommended that 500 < coverage/split < 1000. By default, it is set to 20.

haploclique

Available configurable options for cluster environments: mem, and time. Other options:

relax

If set to True (default) a predefined set of parameter values is used for drawing edges between reads in the read graph.

no_singletons

Singletons are defined as proposed haplotypes whcih are supported by a single read. If this property is set to True (default), singletons are discarded.

no_prob0

if set to True (default) probability of the overhangs is ignored.

clique_size_limit

Sets a threshold to limit the size of cliques. By default is 3.

max_num_cliques

Indicates the maximum number of clique to be considered in the next iteration. By default is 10000.

haploclique_visualization

Available configurable options for cluster environments: mem, and time. Other options:

region_start

Use to specify a region of interest

region_end

Use to specify a region of interest

msa

When the ground truth is available (e.g., simulation studies), a multiple sequence alignment of types making up the population can be provided, and additional checks are performed.

Defaults for user configurable options are provided in vpipe.snake.

References

[1] Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013. [2] Baaijens, JA et al., De novo assembly of viral quasispecies using overlap graphs. Genome Res. 2017.

options

V-pipe: user configurable options

general

threads

aligner

haplotype_reconstruction

input

datadir

samples_file

paired

fastq_suffix

trim_percent_cutoff

output

applications

gunzip

extract

preprocessing

qual_threshold

min_len

initial_vicuna

initial_vicuna_msa

hmm_align

leave_msa_temp

sam2bam

bwa_QA

coverage_QA

msa

convert_to_ref

ref_index

bwa_align

min_coverage

qual_thrd

min_freq

bowtie_align

min_coverage

qual_thrd

min_freq

minor_variants

coverage_intervals

shorah_regions

snv

shift

keep_files

aggregate

savage

split

haploclique

relax

no_singletons

no_prob0

clique_size_limit

max_num_cliques

haploclique_visualization

region_start

region_end

msa

References

Clone this wiki locally