This repository has been archived by the owner on Apr 16, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
09-reference-free-graphs-cactus.Rmd
218 lines (148 loc) · 4.91 KB
/
09-reference-free-graphs-cactus.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# Reference-Free Graphs with Cactus
## Cactus
https://github.com/ComparativeGenomicsToolkit/cactus
+ Reference-free whole genome MSA
+ Constructs graph based on MSA
## Cactus Graphs
Cactus Graphs “naturally decompose the common substructures in a set of related genomes into a hierarchy of chains that can be visualized as two-dimensional multiple alignments and nets that can be visualized in circular genome plots”
https://www.liebertpub.com/ doi/abs/10.1089/cmb.2010.02 52
![Cactus](./Figures/Cactus.png){width=100%}
## Cactus Algorithm
1. Multiple sequence aligner
2. Originally developed for multi-species alignments
3. Fast because it uses a guide tree (Newick format)
+ https://evolution.genetics.washington.edu/phylip/newicktree.html
4. Now supports minigraph GFA in place of guide tree for pangenome
alignments
+ https://github.com/lh3/minigraph
![Yeast](./Figures/Yeast.png){width=100%}
## Reference-Free Graphs
https://academic.oup.com/bioinformatics/article/30/24/3476/2422268
![Input Genomes](./Figures/InputGenomes.png){width=100%}
## Pipeline
1. minigraph
2. Prepare the input
3. cactus
4. vg
5. View with Bandage
### Set up Directories
1. Make sure you're working in a **screen**
2. Make sure you've sourced the pangenomics environment file
```
source /home/pangenomics/pangenomics_env
```
3. Make Directory
```
mkdir cactus
```
4. Navigate to the Directory
```
cd cactus
```
5. Link to data
```
cp -r /home/pangenomics/data/yprp/assemblies .
```
*Note:* Don't use "ln -s /home/pangenomics/data/yprp/ ."
## Yeast Data
Reference:
+ S288C
Using all 12 YPRP assemblies
## Preparing the Input
**(already done for you)**
1. FASTA files
+ Chromosome names should be unique across files
+ We’re using:
```
<strain name>.<chromosome>*
```
E.G.
>S288C.chrI
## minigraph
Use the graph we previously built: yprp.minigraph.gfa
## Preparing the Input (exercise)
1. Cactus seqFile tells Cactus where to load sequences from
+ Maps sequence names to file paths
+ We’re using:
```
“strain name>\t<path to sequence>”
```
+ Must include “_MINIGRAPH_” entry for path to minigraph GFA
E.G.
seqFile:
S288C ./S288C.genome.fa _MINIGRAPH_ yprp.minigraph.gfa
+ It’s recommended the minigraph contains all the sequences in the seqFile
2. Call it **yprp.seqFile.txt**
## Cactus
1. Align each input FASTA to the minigraph (2min):
```
cactus-graphmap jobStore yprp.seqFile.txt yprp.minigraph.gfa
yprp.cactus.paf --outputFasta yprp.minigraph.gfa.fa
--maxCores 20
```
+ **jobstore**
+ a directory where intermediate files should be stored (shouldn’t exist)
+ **yprp.seqfile.txt**
+ text file mapping sequence names to file paths
+ **yprp.minigraph.gfa**
+ the graph constructed by minigraph
+ **yprp.cactus.paf**
+ what to name the output pairwise mapping file
+ **yprp.minigraph.gfa.fa**
+ a FASTA to output the GFA’s sequence to
*NOTE:* This command modifiers the seqFile. Make a copy before running!
2. Generate multiple alignment and VG graph (19min):
*NOTE*: Don’t run this command if you intend to use
```
cactus-graphmap-split
cactus-align jobStore yprp.seqFile.txt yprp.cactus.paf
yprp.cactus.hal --pangenome --pafInput --outVG
--reference S288C --maxCores 20
```
+ **yprp.cactus.hal**
+ what to name the output multiple sequence alignment file
+ **reference S288C**
+ the name of the reference in the seqFile (should be the same as
the reference in the minigraph GFA)
## Preparing the Input (exercise)
1. Cactus contigs files tells Cactus to contigs to split the graph on
+ A list of all the S288C contigs
+ Call it S288C.contigs.txt
### Preparing the Input (solution)
Make a reference contigs file:
```
grep -Po "^>\K.*" ~/cactus/yprp/assemblies/S288C.genome.fa >
S288C.contigs.txt
```
## Cactus Split
1. Align each input FASTA to the minigraph and split by reference chromosome (4min):
```
cactus-graphmap-split jobStore yprp.seqFile.txt
yprp.minigraph.gfa yprp.cactus.paf
--refContigsFile S288C.contigs.txt --reference S288C
--outDir chroms --maxCores 20
```
+ **yprp.seqFile.txt**
+ modified version from previous *cactus-graphmap* command
+ **yprp.cactus.paf**
+ output from previous *cactus-graphmap* command
+ **-refContigsFile**
+ the names of the chromosomes in the reference FASTA
+ **–reference**
+ the name of the reference in the seqFile
+ **-outDir**
+ where the split outputs should be placed
2. Generate multiple alignment and graph for a chromosome (35min total):
```
cactus-align jobStore chroms/seqfiles/S288C.chrI.seqFile
chroms/S288C.chrI/S288C.chrI.paf
yprp.S288C.chrI.cactus.hal --pangenome --pafInput --outVG
--reference S288C --maxCores 20
```
+ Recommend automating with a bash script;
+ see: /home/pangenomics/scripts/cactus-align-chromosomes.sh
*NOTE*: Other options are:
+ cactus-align --batch
+ cactus-align-batch
## Viewing with Bandage
View **one chromosome at a time** with Bandage