This is a step-by-step tutorial how the WGS data of willows was analyzed from Illumina sequencing. Please notice that all analyses were run in a cluster due to the amount of data generated. This tutorial is supposed to be user-friendly but I will be delighted to answer any questions.
For this project, we need to install the pipeline (SECAPR; Andermann et al. 2018) and an additional program to trim the alignments (Gblocks; Castresana, 2002)
In our willow project, we used the pipeline called Sequence Capture Processor (hereafter SECAPR; Andermann et al. 2018). Please notice that the installation and setup here is just a copy of the original SECAPR repository on github. The complete documentation and installation files can be found at: http://antonellilab.github.io/seqcap_processor/
SECAPR is a pipeline written in python by Tobias Andermann. It comprises a number of steps that allows you to process your raw sequencing data from de novo assembly to multiple sequencing alignment. It also has a number of data visualization codes where it allows you to re-adjust your parameters or move forward with phylogenetic inference analyses.
SECAPR analytical workflow. The flowchart shows the basic SECAPR functions, which are separated into two separate steps (colored boxes). Blue box (1. reference library from raw data): in this step the raw reads are cleaned and assembled into contigs (de novo assembly); Orange box (2. reference based assembly with custom reference library): the contigs from the previous step are used for reference-based assembly, enabling allele phasing and additional quality control options, e.g. concerning read-coverage. Black boxes show SECAPR commands and white boxes represent the input and output data of the respective function. Boxes marked in grey represent multiple sequence alignments (MSAs) generated with SECAPR, which can be used for phylogenetic inference. **Extracted from SECAPR Github.
Download the Python2.7 version of Miniconda and install it by executing the downloaded sh-file (see commands below).
Download conda (MacOS 64bit):
wget https://repo.continuum.io/miniconda/Miniconda2-latest-MacOSX-x86_64.sh
Download conda (Linux 64bit):
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
Download conda (Linux 32bit):
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86.sh
Install conda:
sh Miniconda2-latest-*.sh
Activate conda (on metacentrum cluster):
module load conda-modules-py37
Add Bioconda channels (containing bioinformatics software):
conda config --add channels defaults; conda config --add channels conda-forge; conda config --add channels bioconda; conda config --add channels https://conda.anaconda.org/faircloth-lab
Conda automatically downloads and installs all necessary software dependencies. We strongly recommend to install SECAPR and all it's dependencies in a separate virtual environment, in order to not interfere with potentially already installed verisons of the software dependencies.
Install SECAPR in virtual environment (here named secapr_env):
wget https://raw.githubusercontent.com/AntonelliLab/seqcap_processor/master/recipe/install_secapr_env.sh
sh install_secapr_env.sh
To activate the newly created environment, type:
Activate environment:
source activate secapr_env
Check the installed version:
secapr_env --version
When the environment is activated, all the necessary software dependencies will be available in the standard path, e.g. when you type samtools the samtools version required for SECAPR will be executed. After you are done using secapr, you can deactivate the environment to switch back to your standard environment with this command:
source deactivate
Check if you are connected to the correct environment (there should eb a star in front of secapr_env in the output of this command):
Check active environment:
conda info --envs
For the purpose of this project, we are using the SECAPR version 2.0.2. Therefore we need to install the version by following the instructions below:
Connect to your secapr environment
source activate secapr_env
Remove the current secapr installation
pip uninstall secapr
Download the version 2.0.2 (Source code.zip) from github
wget https://github.com/AntonelliLab/seqcap_processor/releases/tag/v2.0.2
Unzip the downloaded file
unzip seqcap_processor-2.0.2.zip
Move the unzipped directory to a safe location on your computer (in our case to the metacentrum cluster), i.e. not on your Desktop or Download folder, since this will be the path where secapr will be executed from in the future
Enter the unzipped secapr directory
cd seqcap_processor-2.0.2
Install secapr from the folder
python -m pip install -e .
Check the version installed
secapr --version
SECAPR version 2.0.2 uses an updated version of Biophyton in which the Bio.alphabet is removed. Therefore, when we will have an error when doing the alignments saying that Bioalphabet is not installed. In order to have our alignments running we need to downgrade our version of Biopython.
First activate the SECAPR pipeline:
source activate secapr_env
Second downgrade the Biopython version to 1.77
conda install biopython==1.77
Gblocks is a software that eliminates poorly aligned positions and divergent regions of DNA or protein sequence allignments (Castresana, 2000; Talavera & Castresana, 2007).
For this project we are using Gblocks version 0.91. You can download this version through conda.
First activate conda:
conda install -c bioconda gblocks
Try to see if the Gblock is working by typing
Gblocks
Alternatively, you can download the software through its own website at http://molevol.cmima.csic.es/castresana/Gblocks.html
We used our Whole-genome sequencing data in order to propose a evolutionary hypothesis for the enigmatic genus Salix. To do this, we used the alignments from Wagner et al. 2020 as targets in our dataset. Wagner's dataset comprised 23,393 loci for 133 taxa.
- Download the DNA matrices and import to Geneious
- Select all alignments and create a consensus for every alignment by clicking on
Tools -> Generate Sequencing Consensus
- Export all the consensus sequences (fasta format) to a folder in your computer
- Open the folder in a terminal and join all the consensus sequences into a single file
cat * > reference.fasta
The file generated here will be used as the target.
Go to the folder comprising all the raw data reads generated from the last step and upload all of them to the metacentrum cluster:
scp * [email protected]:/storage/plzen1/home/paolaferreira/1.raw_data
You should also upload the fasta file reference in a new folder:
scp reference.fasta [email protected]:/storage/plzen1/home/paolaferreira/0.reference
Note our willow data is paired end, which means we sequenced sequence both ends of a fragment and generate high-quality, alignable sequence data. Therefore for every sample we have, SampleA_R1.fq.gz and SampleA_R2.fq.gz must have the same amount of reads. (Read more about Illumina Paired-End Sequencing at https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html)
for i in *.fq.gz; do echo $i; gunzip -c $i | wc -l | awk '{print $1/4}'; done
The results should look like this:
ACU2_1.fq.gz
48960515
ACU2_2.fq.gz
48960515
Lastly, save all the information for the number of reads because this is usually reported in the supplementary materials for any NGS publications.
In order to increase the depth coverage of our samples, we have sequenced S. glaucosericea (GSR7), S. lapponum (LAP1), S. mielichhoferi (MIE5), S. myrsinifolia (MYS5) twice. Before proceeding with further analyses, let's merge the sequences. For example, in S. myrsinifolia you will see four reads (two in every sequencing)
MYS5_1.fq.gz
46385628
MYS5_2_1.fq.gz
39397922
MYS5_2_2.fq.gz
39397922
MYS5_2.fq.gz
46385628
Merge the files MYS5_1.fq.gz and MYS5_2_1.fq.gz which represents both R1 sequencing:
cat MYS5_1.fq.gz MYS5_2_1.fq.gz > MYS5_R1.fq.gz
Do the same with the R2
cat MYS5_2.fq.gz MYS5_2_2.fq.gz > MYS5_R2.fq.gz
Check if the merging was performed correctly by checking the number of reads.
for i in MYS5*.fq.gz; do echo $i; gunzip -c $i | wc -l | awk '{print $1/4}'; done
The results should be the sum of both reads:
MYS5_R1.fq.gz
85783550
MYS5_R2.fq.gz
85783550
Since we are going to use the SECAPR pipeline, we need to have our files in a way that the pipeline will be able to recognize it. First unzip our .fastq files using the gunzip program:
gunzip *
Second rename your files: A simple sample ID is enough followed by _R1 for the forward reads and _R2 for the backward reads.
Our pipeline comprises of two main approaches. First, the clean reads were de novo assembled into contigs and mapped against the consensus alignment of 23,393 loci from Wagner et al. (2020). Target-enriched loci were aligned, trimmed and submitted to a new alignment consensus. Second, the consensus alignment was mapped against the clean reads. Multiple sequence alignments for each loci was performed which included 4 or more taxa and larger than 500 base pairs, and trimming unambiguous aligned regions. The resulting matrices used in downstream phylogenetic analyses using concatenation (IQtree), and coalescent approach (ASTRAL-III).
Supplementary Figure 1. Phylogenomics workflow.
In our case, the sequencing company was responsible for removing the adapters and do a quality control of the reads. However, most of the time you will receive the raw data. To convince yourself that the data is in a good quality, let's double check our reads:
secapr quality_check --input /storage/plzen1/home/paolaferreira/1.raw_data --output /storage/plzen1/home/paolaferreira/2.Checking_quality_raw_data
- If you are satisfied with the quality of your reads, move onto the next step.
After checking the quality (and/or cleaning and trimming the reads), we are now ready to use the fastq-reads for de-novo contig assembly. In this step, the overlap between fastq reads is being used to build long, uninterrupted sequences, with no apriori knowledge of the correct sequence or order of those fragments. We will use the contigs to hopefully find the target regions that were selected for during sequence capture. For our willow project, we used the development version of SECAPR which includes the SPADES assembler. This is a much faster program compared to ABySS and Trinity (mostly used for transcriptome analysis) and also allows you to test several k-mer sizes at once. But first, create a folder for each sample followed by "clean" and transfer both fastq reads to there. This is necessary since we skipped the cleaning step, otherwise SECAPR would create a folder. We ran our assembly using:
secapr assemble_reads --input /storage/plzen1/home/paolaferreira/1.raw_data --output /storage/plzen1/home/paolaferreira/3.assembling --assembler spades --kmer 21,33,55,77,99,127 --contig_length 50 --cores 16
Mapping is the process to align contigs (or reads) obtained by high-throughput genome sequencing to a reference (genome, genes, cds, etc). To do this we will be using the consensus sequence from the multiple sequence alignment uploaded previously.
Figure 1: Illustration of the mapping process. The input consists of a set of reads and a reference genome. In the middle, it gives the results of mapping: the locations of the reads on the reference genome. The first read is aligned at position 100 and the alignment has two mismatches. The second read is aligned at position 114. It is a local alignment with clippings on the left and right. The third read is aligned at position 123. It consists of a 2-base insertion and a 1-base deletion. Illustration and Figure legend extracted from https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/mapping/tutorial.html
We mapped our contigs using the following commands:
secapr find_target_contigs --contigs /storage/plzen1/home/paolaferreira/3.assembling --output /storage/plzen1/home/paolaferreira/4.mapping --reference /storage/plzen1/home/paolaferreira/0.reference/reference.fasta
Now, we have our selected contigs which were mapped against the RADseq references and we need to perform Multiple Sequence Alignments (MSAs) using mafft or muscle. This function builds a separate alignment for each locus with matching contigs for ≥3 samples.
secapr align_sequences --sequences /storage/plzen1/home/paolaferreira/4.mapping/combined_joined_unphased_fastas.fasta --output /storage/plzen1/home/paolaferreira/4.mapping/5.Alignment --aligner mafft --output-format fasta --no-trim --cores 12
Once we get our multiple sequence alignments, we usually need to remove poor aligned and divergent regions that may not be homologous or saturated by multiple substitutions.
Go to the folder where your aligments are and create a loop in bash using default parameters in Gblocks:
for i in *fasta; do Gblocks ${i} -t=y -p=y; done
This step is going to create two outputs: your_file_alignment.fasta-gb - This file is our trimmed multiple sequence alignment your_file_alignment.fasta-gb.htm - This file comprises the trimming statistics for each alignment.
Let's create a new folder and move the cleaned alignments:
paolaferreira@skirit:/storage/plzen1/home/paolaferreira$ cd ../
paolaferreira@skirit:/storage/plzen1/home/paolaferreira$ mkdir 6.cleaned_alignments
mv *gb ../6.cleaned_alignments
Usually an analysis of most NGS datasets comprises only one step approach and performing the previous steps that we have gone through (cleaning and trimming of the reads, de-novo assembly of contigs, mapping, multiple sequence alignments and trimming). However, the sequence targets used during the mapping were based on 23,393 alignments from a RADseq study (Wagner et al., 2020) which were only ~80 base pairs long. These short reference targets could cause some bias during the mapping and misleading phylogenies. In order to overcome these challenges, we generated new sequence targets from our trimmed alignments. Now, our new reference sequence averages around 2800 base pairs long and should be appropriated to infer a phylogenetic hypothesis. We ran our mapping using:
secapr reference_assembly --reads /storage/plzen1/home/paolaferreira/1.raw_data --reference_type alignment-consensus --reference /storage/plzen1/home/paolaferreira/6.cleaned_alignments --output /storage/plzen1/home/paolaferreira/8.reference_based_mapping --cores 16 --min_coverage 4
Multiple sequence alignments were again trimmed to remove poor aligned and divergent regions.
Go to the folder where your alignments are located and create a loop in bash using default parameters in Gblocks:
for i in *fasta; do Gblocks ${i} -t=y -p=y; done
Create a new folder and move the cleaned alignments:
paolaferreira@skirit:/storage/plzen1/home/paolaferreira$ cd ../
paolaferreira@skirit:/storage/plzen1/home/paolaferreira$ mkdir 9.cleaned_alignments
mv *gb ../9.cleaned_alignments
Now, we have our clean alignments ready to infer a phylogeny. For this study we import all the alignments to Geneious and only select those which were larger than ≥500 bp long and comprises ≥ four taxa. We ran our phylogenies using IQTREE version 2.1.2 (Minh et al., 2020).
Andermann, T., Cano, Á., Zizka, A., Bacon, C., & Antonelli, A. (2018). SECAPR-a bioinformatics pipeline for the rapid and user-friendly processing of targeted enriched Illumina sequences, from raw reads to alignments. PeerJ, 6, e5175. https://doi.org/10.7717/peerj.5175
Castresana, J. (2000). Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution 17, 540-552. https://doi.org/10.1093/oxfordjournals.molbev.a026334
Minh, B.Q., Schmidt, H.A., Chernomor, O., Schrempf, D., Woodhams, M.D., von Haeseler, A., Lanfear, R. 2020. IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol., 37, 1530-1534. https://doi.org/10.1093/molbev/msaa015
Talavera, G., and Castresana, J. (2007). Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Systematic Biology 56, 564-577. https://doi.org/10.1080/10635150701472164
Wagner, N. D., He, L., & Hörandl, E. (2020). Phylogenomic Relationships and Evolution of Polyploid Salix Species Revealed by RAD Sequencing Data. Frontiers in plant science, 11, 1077. https://doi.org/10.3389/fpls.2020.01077