forked from bixcop18/module_3_intro_NGS
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Exercise3_UsingOtherAlignmentMethods.txt
44 lines (27 loc) · 2.43 KB
/
Exercise3_UsingOtherAlignmentMethods.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
Exercise 3: Using other alignment methods
BLAST is a very useful tool for sequence alignment, but other programs may be better suited to certain applications.
For example, say we have the assembled genome sequences of multiple varieties of a certain species but only one variety has been annotated with gene models.
However, we would like to know where the genes are located in the other genomes. How should we go about finding the positions of these gene models in the genome assemblies of the other varieties?
In this exercise, we will take the genome assemblies from multiple different accessions of Arabidopsis (seattle, chi2, C24 and Bur) and try using BLAST and gmap to identify to locations of the Col-0 gene models in the other assemblies.
1. Download/copy the genome assemblies for the different Arabidopsis accessions
- you can copy the files from here /home/lgardiner/Jemima/Intro_to_NGS_exercises/Exercise_3/arabidopsis_accessions/
2. Produce some summary statistics about the genome sequences.
e.g. How many chromosomes?
Are the genomes the same size?
Are all the chromosomes the same length in the different accessions?
3. Download the Col-0 (Reference) Arabidopsis transcriptome fasta file.
4. How many transcripts in this file?
5. Choose one of the Arabidopsis accessions to test out the different aligners
6. Use BLAST to find the locations of the Col-0 transcripts in the genome of your chosen accession (this will take some time, move onto step 9 while this runs!)
TIP: As we are BLASTing many sequences, output to a file and in tabular format
TIP: Can you get BLAST to just give you the top hit?
7. What is the average sequence identity?
8. On average, what percentage of the gene length is covered in these BLAST hits?
9. Use gmap to find the locations of the Col-0 transcripts in your chosen accession.
HINT: You will need to use the gmap manual to find out how to use gmap.
10. Play around with the different output options and parameters in gmap.
11. Which aligner is better suited to finding the locations of the transcripts? Why?
Further exercises:
12. Use gmap to identify the locations of transcripts in all the different Arabidopsis accessions.
13. Extract the gDNA of the transcripts from the Arabidopsis accessions.
14. Choose a random set of 20 transcripts and write a script to perform multiple alignments of each gene using the gDNA from the different Arabidopsis accessions.