options

V-pipe: user configurable options

The workflow can be customized through the configuration file vpipe.config. This configuration file is a text file written using a basic struture composed of sections, properties and values. For instance, we suggest to provide as input a tabular file specifying sample unique identifiers (e.g., patient identifiers), and dates for different sequencing runs related to the same patient. The name of this file (here, samples.tsv) can be provided by specifying the section as input and the property as samples_file, as follows,

[input]
samples_file = samples.tsv

As shown above, sections are expected in squared brackets, and properties are followed by corresponding values.

Below, we provide a comprehensive list of all user-configurable options stratified by sections.

general

threads

This option should be used to specify the default number of threads for all multi-threaded rules. That is, unless the number of threads is specified for each rule, this value is set as default. Default value is 4.

aligner

There are three options for mapping reads, either using ngshmmalign (ngshmmalign, default), BWA MEM (bwa) [1], or Bowtie 2 (bowtie) [2]. To use a different aligner than the default, indicate which aligner you want to use by setting the property aligner. E.g., to use Bowtie 2:

[general]
aligner = bowtie

snv_caller

There are two options available for calling single nucleotide variants, either using ShoRAH (shorah) or LoFreq (lofreq) [3]. ShoRAH is used by default. If you prefer to use LoFreq, then indicate so in the configuration file as shown below,

[general]
snv_caller = lofreq

haplotype_reconstruction

There are two options available for haplotype reconstruction, namely SAVAGE [3] or HaploClique. SAVAGE is used by default. If you wish to use HaploClique, then indicate it in the configuration file as below,

[general]
haplotype_reconstruction = haploclique

input

datadir

Directory where samples are stored. By default, it is set to samples.

samples_file

File containg sample unique identifiers and dates as tab-separated values, e.g.,

patient1    20100113
patient1    20110202
patient2    20081130

Here, we have two samples from patient 1 and one sample from patient 2. By default, V-pipe searches for a file named samples.tsv, if this file does not exist, a list of samples is built by globbing datadir directory contents.

Optionally, the samples file can contain a third column specifying the read length. This is particularly useful when samples are sequenced using protocols with different read lengths.

paired

Indicate whether the input sequencing reads correspond to paired-end reads. By default, it is set to True.

fastq_suffix

Fastq files are expected to be stored on a subdirectory named raw_data. For example, for patient 1 with one paired-end readsdataset, the hierarchy should look like

samples
└── patient1
    └── 20100113
        └──raw_data
           ├──patient1_20100113_R1.fastq
           └──patient1_20100113_R2.fastq

By default, V-pipe finds the fastq file matching the following pattern: prefix + R + {1,2} + .(fastq|fq|fastq.gz|fg.gz). If a suffix should be introducing after R1 and R2, the user needs to specify it using this option.

trim_percent_cutoff

Using this parameter, the user can specify the read-length threshold that should be apply during the quality trimming as a percentage (0 < trim_percent_cutoff < 1). This is particularly useful when samples are sequenced using protocols with different read lengths and different from the default (i.e., 250 bp). In such case, the <samples_file>.tsv file should be provided, and it is expected to contain 3 columns. The sample specific read-length is parsed from this file.

output

Indicate whether single nucleotide variant calling (snv), local (local), and/or global haplotype reconstruction (global) should be performed, by setting corresponding properties to True. At the moment, local haplotype reconstruction and SNV calling are preformed by ShoRAH, and both steps are intertwided. Global haplotype reconstruction is perfomed using the software SAVAGE. In addition, we provide a module for detecting flow-cell cross-contamination, which can be anable by setting QA to True. In the current implementation, the latter is specific for HIV applications.

applications

The path to the different software packages can be specified using this section, e.g.,

[applications]
bwa = /path/to/bwa
haploclique = /path/to/haploclique

It is especially useful when dependencies are not obtained via conda, and when the software packages are not in the PATH.

NOTE we strongly recommend to use conda environments, by adding the --use-conda flag to the V-pipe execution command, e.g. ./vpipe --use-conda. If you prefer to use your own installations, this section allows you to specify the location of the executables

Allocation of resources can variate with different input sizes (e.g. number of reads) and number of samples. Therefore, users can specify memory and time requirements for all rules. For multi-threaded software packages, threads can be also customized.

gunzip

Available configurable options for cluster environments: mem, and time.

extract

Available configurable options for cluster environments: mem, and time.

preprocessing

We use software PRINSEQ [5] for quality control. By default, we use options -ns_max_n 4 -min_qual_mean 30 -trim_qual_left 30 -trim_qual_right 30 -trim_qual_window 10, which indicates to trim reads using an sliding window with size 10 bp, and trim bases if their quality scores are less than 30. Additionally, reads are filtered out if the average quality score is below 30 and if they contain more than 4 N's. The user can choose to overwrite the default settings or use additional parameters by using the property extra. E.g., if many reads are filtered out in this step, the user can choose to lower the quality threshold as follows:

[preprocessing]
extra = -ns_max_n 4 -min_qual_mean 20 -trim_qual_left 20 -trim_qual_right 20 -trim_qual_window 10

Please do not modify PRINSEQ options -out_format, -out_good, nor -min_len. Instead of using -min_len to define threshold on the read length after trimming, use property trim_percent_cutoff).

Available configurable options for cluster environments: mem, and time.

initial_vicuna

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads. NOTE The conda environment for this rule doesn’t work properly. The package on the bioconda channel, mvicuna, is slightly different from VICUNA and it has different command-line arguments. Moreover, VICUNA and mvicuna are no longer maintained. In the future, this rule will be deprecated.

initial_vicuna_msa

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads.

NOTE Obtaining a initial reference de novo is implemented for more than one sample.

hmm_align

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads. Other options:

Property	Explanation
extra	pass additional options to run ngshmmalign
leave_msa_temp	this option is useful for debugging purposes

V-pipe uses option -R <path/to/initial_reference>, thus option -r arg is not allowed. Also, instead of passing -l via the property extra, set leave_msa_temp to True. Lastly, please do not modify options -o arg, -w arg, -t arg, and -N arg. These are already managed by V-pipe.

sam2bam

Available configurable options for cluster environments: mem, and time.

bwa_QA

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads.

coverage_QA

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads.

msa

This rule takes all previously aligned reads by hmm_align. Therefore, resources should be allocated accordingly. Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads.

convert_to_ref

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads.

ref_index

Available configurable options for cluster environments: mem, and time.

bwa_align

Available configurable options for cluster environments: mem, and time. With property extra, users can pass additional options to run BWA MEM. For more details on BWA MEM configurable options refer to the software documentation.

bowtie_align

Available configurable options for cluster environments: mem, and time.

Property	Default	Explanation
phred	--phred33	indicate if qualities are Phred+33 (default) or Phred++64 (`--phred64`)
preset	--local --sensitive-local	specify Bowtie 2 presets
extra		pass additional options to run Bowtie 2. V-pipe handles the input and output files, as well as the reference sequence. Thus, do not modify these options

For more details on Bowtie 2 configurable options refer to the software documentation.

consensus_sequences

Available configurable options for cluster environments: mem, and time. Other options:

Property	Default	Explanation
min_coverage	50	minimum read depth for reporting variants per locus
qual_thrd	15	minimum phred quality score for a base to be included
min_freq	0.05	minimum frequency for an ambiguous nucleotide

minor_variants

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads. Other options:

Property	Default	Explanation
min_coverage	100	minimum read depth for reporting variants per locus
frequencies	False	output a numpy array file containing frequencies of all bases, including gaps and also the most abundant base accross samples

coverage_intervals

Available configurable options for cluster environments: mem, and time. This rule is used to find windows on the read alignment with relative high coverage, i.e., higher than coverage.

Property	Default	Explanation
coverage	50	minimum read depth. A region spanning the reference genome is returned if `coverage` is set to 0
liberal	True	indicate whether to apply a more liberal shifting on intervals' right-endpoint
overlap	False	construct intervals based on overlapping windows of the read alignment. By default, regions with high coverage are built based on the position-wise read depth

shorah_regions

Available configurable options for cluster environments: mem, and time.

snv

Available configurable options for cluster environments: mem, and time, and options for single-node as well as cluster environments threads. Other options:

Property	Default	Explanation
alpha	0.1	hyperparameter used for instantiating a new cluster
ignore_indels	False	ignore SNVs adjacent to indels
coverage	0	omit windows with coverage less than this value
shift	3	ShoRAH performs local haplotype reconstruction on windows of the read alignment. The overlap between these windows is defined by the window shifts. By default, it is set to 3, i.e., apart from flaking regions each position is covered by 3 windows
keep_files	False	indicate whether to move files produced in previous/interrumpted runs to subdirectory named `old`

lofreq

Available configurable options for cluster environments: mem, and time. Additionally, property extra allows to pass additional options to run lofreq call.

savage

Available configurable options for cluster environments: mem, and time. Other options:

Property	Default	Explanation
split	20	size of the batches of reads to be processed by SAVAGE. It is recommended that 500 < coverage/`split` < 1000

NOTE This rule only works in linux.

haploclique

Available configurable options for cluster environments: mem, and time. Other options:

Property	Default	Explanation
relax	True	if set to `True` (default) a predefined set of parameter values is used for drawing edges between reads in the read graph
no_singletons	True	singletons are defined as proposed haplotypes which are supported by a single read. If this property is set to `True`, singletons are discarded
no_prob0	True	if set to `True` (default) probability of the overhangs is ignored
clique_size_limit	3	sets a threshold to limit the size of cliques
max_num_cliques	10000	indicates the maximum number of clique to be considered in the next iteration

haploclique_visualization

Available configurable options for cluster environments: mem, and time. Other options:

Property	Default	Explanation
region_start	0	use to specify a region of interest
region_end	9719	use to specify a region of interest
msa		when the ground truth is available (e.g., simulation studies), a multiple sequence alignment of types making up the population can be provided, and additional checks are performed

References

[1] Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013. [2] Langmead, B. and Salzberg, S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012. [3] Wilm et al. LoFreq: A sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 20121. [4] Baaijens, JA et al., De novo assembly of viral quasispecies using overlap graphs. Genome Res. 2017. [5] Schmieder, R. and Edwards, R. Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011.

options

V-pipe: user configurable options

general

threads

aligner

snv_caller

haplotype_reconstruction

input

datadir

samples_file

paired

fastq_suffix

trim_percent_cutoff

output

applications

gunzip

extract

preprocessing

initial_vicuna

initial_vicuna_msa

hmm_align

sam2bam

bwa_QA

coverage_QA

msa

convert_to_ref

ref_index

bwa_align

bowtie_align

consensus_sequences

minor_variants

coverage_intervals

shorah_regions

snv

lofreq

savage

haploclique

haploclique_visualization

References

Clone this wiki locally