-
Notifications
You must be signed in to change notification settings - Fork 11
/
Alignment.Rmd
291 lines (148 loc) · 10.2 KB
/
Alignment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
# Sequences alignment {#seqs_aln}
## Heredity
Today, we are capable of reconstructing kinship relationships between organisms through bioinformatics analysis of biological sequences, particularly proteins, and DNA. This is possible thanks to the principle of heredity. The biological sequences are transmitted from one generation to the next through sexual and asexual reproduction. From one generation to the next, the sequences accumulate variation as a result of stochastic or selective processes. Despite this variation, sequences from two closely related organisms are expected to be more similar to each other than sequences from two distantly related organisms. Between two organisms with a distant relationship, more time has elapsed since the last shared ancestor, accumulating more variation. This is the principle on which the search for homology between sequences is based, two DNA or protein sequences are homologous if they share ancestry. This means that the similarity between these sequences is the product of sharing the same ancestor placed somewhere in the phylogenetic tree of life. The higher the sequence similarity, the more recent the common ancestor between the sequences is expected to be. However, the identification of a hit with a high similarity is not enough to determine the homology of a sequence; it is also necessary to model the evolutionary process through phylogenetic tools. This chapter will address the methodological principles of pairwise alignments, the alignment between two biological sequences. Multiple sequence alignment is the topic of the next chapter. Before introducing pairwise alignment, I would like to introduce some homology concepts that are essential to the application and interpretation of alignments and phylogenies.
## Orthologs and paralogs
Homology relationships between sequences are defined by two important events in the evolution of species: duplication and speciation. Two sequences (genes or proteins) are orthologous if they are the product of speciation. Two sequences are paralogs if they are the products of duplication events. When two different organisms present sequences that are the product of duplication events prior to speciation, these sequences are out-paralogs. When duplication occurs after the speciation process, the sequences are called in-paralogs.
## BLAST
### Conda Installation
One of the easy ways to install bioinformatics software is with [Conda](https://docs.conda.io/en/latest/).
Download miniconda for ubuntu (If you have another OS choose the appropriate file)
```
https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
```
Set the correct authorizations
```
chmod +x Miniconda3-latest-Linux-x86_64.sh
```
Install
```
bash Miniconda3-latest-Linux-x86_64.sh
```
Export path
```
export PATH="$HOME/miniconda3/bin:$PATH"
```
Reinitialize the session.
## Install [BLAST+](https://www.ncbi.nlm.nih.gov/books/NBK279690/) with Conda
```
conda install -c bioconda blast=2.12.0
```
## Install database reference sequences
The first step is to create the directories where the databases will be installed.
For example, in your home directory:
```
mkdir NCBI
```
```
cd NCBI
```
Create two subdirectories, one for nucleic acid sequences and one for protein sequences.
```
mkdir nt
```
```
mkdir nr
```
### There are different ways to install the pre-formatted BLAST databases:
#### The 'wget' command
Download NCBI nucleotide sequence database
```
cd nt
```
```
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz
```
Uncompress the database files. You will find the uncompress.sh script in the scripts directory of this repository
```
./uncompress.sh
```
Download NCBI amino acid sequence database (In the nr directory)
```
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.??.tar.gz
```
```
./uncompress.sh
```
#### The update_blastdb.pl script.
BLAST+ offers the [update_blastdb.pl](https://www.ncbi.nlm.nih.gov/books/NBK569850/) script that allows you to download the NCBI databases. This script uncompress the files for you.
Obtain information from the available databases to download
```
update_blastdb.pl --showall [*]
```
Download the nucleotide sequence database (NCBI_DB/nt directory)
```
update_blastdb.pl --decompress nt [*]
```
Download the amino acid database (NCBI_DB/nr directory)
```
update_blastdb.pl --decompress nr [*]
```
## Pairwise sequence alignment
### BLASTn
Here is an example of alignment with BLASTn. Remember that BLAST offers different softwares to perform alignments according to the type of sequence and the reference database ([BLAST software](https://blast.ncbi.nlm.nih.gov/Blast.cgi)). BLASTn involves nucleotide-to-nucleotide alignments, this means that both the query sequences and the reference database are nucleotide sequences.
```
blastn -db /home/alexarmero2022/shared/NCBI/nt/nt -evalue 0.0000000001 -num_threads 10 -outfmt '6 std qcovs slen' -query test.fasta > testout.csv
```
In the last command line there are several important points to keep in mind:
1. Specify the full directory path to the NCBI database. In this example I'm using my path (/home/alexarmero2022/shared/NCBI/nt), replace it with the appropriate directory path on your system.
2. The -db option expects the name of a database. It is for this reason that in addition to the path, it is also necessary to give the name of the database in the previous example nt.
3. The -num_threads option refers to the number of cores or logic CPUs that the user wants to assign to the task. Specify it according to availability in your system.
4. The results of an alignment can be reported in different types of formats (-outfmt). In the example, the tabular format (6) was used; this format is easy to manipulate. In this example the standard alignment information (std) was requested to be printed, but also some other information such as the alignment coverage in the query sequence (qcovs) and the length of the target sequence (slen). [Tabular format options](https://www.metagenomics.wiki/tools/blast/blastn-output-format-6).
### BLASTp
Here an example of protein-protein alignment with BLASTp. Notice that the path to the database changed; now it points to the nr directory, where the amino acid sequences were installed. In the same way the name of the database also changed (nr). The sequences in the test.fasta file must necessarily be amino acid sequences.
```
blastp -db /home/alexarmero2022/shared/NCBI/nr/nr -evalue 0.0000000001 -num_threads 10 -outfmt '6 std qcovs slen' -query test.fasta > testout.csv
```
### Configuration files and environmental variables
BLAST+ allows you to configure the blast runs. Configuration can be done through a .ncbirc file. In the scripts directory of this directory, there is an example of this file. Once the variables of interest have been modified, place this file in your home directory. For example, you could indicate where your NCBI database is located:
```
; Start the section for BLAST configuration
[BLAST]
; Specifies the path where BLAST databases are installed
BLASTDB=/home/alexarmero2022/shared/NCBI/
```
In this last example we modify the variable **BLASTDB** and we assign the directory path to the NCBI databases. Note, that is the path we use in the blastn and blastp alignment examples. With this modification the blastn command line would subtly change:
```
blastn -db nt/nt -evalue 0.0000000001 -num_threads 10 -outfmt '6 std qcovs slen' -query test.fasta > testout.csv
```
You could also use environmental variables. For example from your command line, you could write:
```
BLASTDB=/home/alexarmero2022/shared/NCBI/
```
There are several other [variables](https://www.ncbi.nlm.nih.gov/books/NBK569858/) that can modify.
This last line is equivalent to using the .ncbirc file.
Note. The .ncbirc is a hidden file. To be able to observe it in your directory, execute the following line:
```
ls -ad .*
```
## BLAST with high-throughput data
BLAST is perhaps the best software in the identification of homology relationships between biological sequences. The sensitivity of this tool has a cost both in the use of physical computing resources and in processing time. These limitations are particularly evident when are used NCBI databases. Several efforts have been made to use high performance computing (HPC) resources to efficiently run BLAST on large datasets. Here is one such tool: [Crocroblast](https://webchem.ncbr.muni.cz/Platform/App/CrocoBLAST). This software is presented as an example, and a starting point for higher performance strategies. Crocroblast determines what is the optimal file size to perform the alignment according to the HPC physical capabilities. In this way this software divides a large fasta file into files with sizes optimized to the available system resources. This is an easy to use tool; we recommend a good reading of the CrocroBLAST [documentation](https://webchem.ncbr.muni.cz/Wiki/CrocoBLAST:UserManual). The following example shows how to perform a CrocoBLAST blastn alignment.
### Installation
Get the right version for your [operating system](https://webchem.ncbr.muni.cz/Platform/App/CrocoBLAST).
Install the software in an appropriate directory
```
mkdir Crocroblast
cd Crocroblast
unzip CrocoBLAST_linux_64.zip
chmod +x crocoblast
```
### Add sequence database
In order to use CrocroBLAST you need to add the sequence database.
Please adapt the following line according to the directory path where your database is located:
```
./crocoblast -add_database --formatted_db /home/alexarmero2022/shared/NCBI/nt/nt.00.nsq
```
### Add the job to the queue
CrocroBLAST handles job queues; this software has different functions to stop, pause, delete and run a job. The next line adds a job to the queue:
```
./crocoblast -add_to_queue blastn nt data/test.fa output_test_4 --blast_options -outfmt 6
```
The following line will allow you to check if the job was added to the queue
```
./crocoblast -status
```
### Run the job
```
./crocoblast -run --num_threads 4
```
CrocoBLAST shows messages with the time remaining to complete the process. We hope that this example motivates you to carefully read the CrocoBLAST [wiki](https://webchem.ncbr.muni.cz/Wiki/CrocoBLAST:UserManual) to efficiently exploit the different possibilities of this nice tool.