This repository has been archived by the owner on Apr 16, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
10-reference-free-graphs-pggb.Rmd
262 lines (183 loc) · 7.53 KB
/
10-reference-free-graphs-pggb.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
# Reference-Free Graphs with *pggb*
## pggb
https://github.com/pangenome/pggb
+ All-pairs whole genome alignment
+ Induces a graph from the alignments
## PanGenome Graph Builder
pggb is built on the idea that a pangenome graph represents an alignment of the genomes in the graph (this is literally what Cactus does), but infers the graph from all pairwise alignments instead of a multiple alignment.
pggb computes all pairwise alignments efficiently by focusing on long, colinear homologies, instead of using the more traditional k-mer matching alignment approach.
Critically, pggb performs graph *normalization* to ensure that paths through the graph (e.g. chromosomes) have a linear structure while allowing for cyclic graph structures that capture structural variation.
![Input Genomes](./Figures/pggbFlowDiagram.png){width=100%}
## Reference-Free Graphs
https://academic.oup.com/bioinformatics/article/30/24/3476/2422268
![Input Genomes](./Figures/InputGenomes.png){width=100%}
## pggb Algorithm
1. All-pairs align genomes with [wfmash](https://github.com/waveygang/wfmash)
2. Convert alignments into a graph using [seqwish](https://github.com/ekg/seqwish)
3. Progressively normalize graph with [smoothxg](https://github.com/pangenome/smoothxg) and [gfaffix](https://github.com/marschall-lab/GFAffix)
## Pipeline
1. Prepare the input
3. pggb
4. View with Bandage
### Set up Directories
1. Make sure you're working in a **screen**
2. Load the pggb conda environment:
```
conda activate pggb
```
3. Make Directory
```
mkdir ~/pggb
```
4. Navigate to the Directory
```
cd ~/pggb
```
5. Link to data
```
ln -s /home/data/pangenomics-2402/yprp/ .
```
6. When you want to leave the conda environment, use:
```
conda deactivate
```
## Yeast Data
Reference:
+ S288C
Using all 12 YPRP assemblies
## Preparing the Input (exercise)
1. Create a FASTA file containing all the yprp assemblies. Call it `yprp.all.fa`.
2. Create a FASTA file containing chromosome VIII from every assembly. Call it `yprp.chrVIII.fa`.
3. How can you confirm that the contents of these files is correct?
### Preparing the Input (solution)
1. `cat yprp/assemblies/*.fa > yprp.all.fa`
2. `awk 'BEGIN{RS=">";FS="\n"} NR>1{fnme=$1".fa"; print ">" $0 > fnme; close(fnme);}' yprp/assemblies/*.genome.fa && cat *.chrVIII.fa > yprp.chrVIII.fa` && rm `ls | grep -v yprp`
3. `grep -c ‘>’ yprp.all.fa`
## Preparing the Input
[bgzip](https://www.htslib.org/doc/bgzip.html) the FASTA files (~1min):
```
bgzip -c yprp.all.fa > yprp.all.fa.gz
bgzip -c yprp.chrVIII.fa > yprp.chrVIII.fa.gz
```
+ **-c**
+ output the bgzipped file to standard output
+ **>**
+ redirect the standard output into a file
Index the bgzip files with [samtools](http://www.htslib.org/doc/samtools.html) [faidx](http://www.htslib.org/doc/samtools-faidx.html):
```
samtools faidx yprp.all.fa.gz
samtools faidx yprp.chrVIII.fa.gz
```
## Running pggb on Chromosome VIII
Build a graph containing all the yprp assemblies (2min):
```
pggb build -i yprp.chrVIII.fa.gz -o output_chrVIII -n 12 -t 20 -p 95
```
+ **-i yprp.chrVIII.fa**
+ an input FASTA containing all sequences
+ **-o output_chrVIIII**
+ the directory where all output files should be placed
+ **-n 12>**
+ the number of haplotypes (assemblies) in the input file
+ **-t 20**
+ the number of threads to use
+ **-p 95**
+ minimum sequence identity of alignment segments
+ **-s 5000**
+ nucleotide segment length when scaffolding the graph
NOTE: These arguments were taken from the [pggb paper](https://github.com/pangenome/pggb-paper/blob/main/workflows/AllSpecies.md). Refer to the paper for other species.
Create a copy of the output graph with a simpler name
```
cp output_chrVIII/yprp.chrVIII.fa.gz.<gibberish>.smooth.final.gfa yprp.chrVIII.pggb.gfa
```
## Viewing with Bandage
~~View your Chromsome VIII graph with Bandage (exercise):~~
1. ~~Find CUP1 and YHR054C~~
2. ~~Take a screenshot~~
**pggb graphs are too detailed to view all of Chromosome VIII in Bandage.**
Convert the GFA to VG (<1min):
```
vg convert yprp.chrVIII.pggb.gfa -v > yprp.chrVIII.pggb.vg
```
+ **-v**
+ convert input to vg format
Create a subgraph only containing the CUP1 region (<1min):
```
vg chunk -p S288C.chrVIII:212000-230000 -x yprp.chrVIII.pggb.vg -c 10 > yprp.chrVIII.pggb.cup1_chunk.vg
```
+ **-p S288C.chrVIII:212000-230000**
+ the path (region) to build the chunk from
+ **-x yprp.chrVIII.pggb.vg**
+ the graph to chunk
+ **-c 10**
+ expand the context of the chunk by 10 node steps
Convert the subgraph into a GFA for viewing (<1min):
```
vg view yprp.chrVIII.pggb.cup1_chunk.vg -g > yprp.chrVIII.pggb.cup1_chunk.gfa
```
View your Chromsome VIII chunk graph with Bandage (exercise):
1. ~~Find CUP1 and YHR054C~~
2. ~~Take a screenshot~~
## BLAST the graph manually
Create a FASTA file containing the graph sequence
```
gfatools gfa2fa yprp.chrVIII.pggb.gfa > yprp.chrVIII.pggb.fa
```
Copy the FASTA to you computer
```
scp pangenomics-2402:~/pggb/yprp.chrVIII.pggb.fa .
```
Build a BLAST database for the FASTA
```
makeblastdb -in yprp.chrVIII.pggb.fa -input_type fasta -dbtype nucl
```
+ **-in yprp.chrVIII.pggb.fa**
+ the file to build a database for
+ **-input_type fasta**
+ the input file is a FASTA
+ **-dbtype nucl**
+ the type of sequence in the input file is DNA
Query the database for [CUP1](https://www.yeastgenome.org/locus/S000001095) and [YHR054C](https://www.yeastgenome.org/locus/S000001096)
```
blastn -db yprp.chrVIII.pggb.fa -query S288C_YHR053C_CUP1-1_genomic.fsa
```
View your Chromsome VIII chunk graph with Bandage (exercise):
1. Find the CUP1 and YHR054C BLAST hits by node ID
2. Take a screenshot
## Exercises
1. Use vg to index the chromosome VIII graph
2. Use vg to map SK1 reads to the chromosome VIII graph
3. Use vg to call variants on chromosome VIII read mapping GAMS
## Running pggb on all Chromosomes
Partition the sequences before building the graph (<1min):
```
partition-before-pggb -i yprp.all.fa.gz -o output_all -n 12 -t 20 -p 95 -s 5000
```
This partitions the input FASTA into smaller FASTAs containing sequences that should be in the same subgraph:
+ Will likely correspond to chromosomes if you have complete assemblies
+ May improve run-time of normalization step and make downstream analysis easier
+ Consider skipping if your assemblies/organism has complex structure you want represented in the graph, e.g. polyploidy, translocations, etc.
This will print a `pggb` command for every partition to the command line.
+ Commands are also written to a log file: `output_all/yprp.all.fa.gz.<gibberish>.log`
### Exercise
1. Make a bash script to run all of the generated commands
+ Run the command **NOT DURING THE WORKSHOP** to build the graphs (~30min)
## How do you handle larger graphs? Nextflow
[Nextflow](https://www.nextflow.io/docs/latest/index.html) - Software for running workflows in parallel / distributed computing environments
+ Supports Singularity and Docker containers
+ Nextflow pipelines are open-source
+ You can use other people’s pipelines
+ You can build and share your own pipelines
+ [nf-core/pangenome](https://nf-co.re/pangenome) contains pggb along with other pangenome software
## Pros and Cons
Pros:
+ Graphs are truly reference-free
+ Includes normalization step
+ Especially good for species with complex structural variation, such as plants
Cons:
+ Normalization step is currently a bottleneck
+ Partitioning removes inter-chromosome structural variation
+ Graphs contain lots of variation
+ Makes the files large
+ Makes them hard to view with Bandage
+ Can't BLAST GFA within Bandage