This repository has been archived by the owner on Apr 16, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
11-cuttlefish.Rmd
199 lines (123 loc) · 6.02 KB
/
11-cuttlefish.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
# Cuttlefish
"Cuttlefish is a program to produce the compacted de Bruijn graph from sequencing reads or reference sequences."
<center>
![<font size="0.5">David Sim from London, United Kingdom, CC BY 2.0 <https://creativecommons.org/licenses/by/2.0>, via Wikimedia Commons</font>](./Figures/Cuttlefish.jpeg){width=50%}
</center>
Cuttlefish Code
https://github.com/COMBINE-lab/cuttlefish
Original Cuttlefish
https://academic.oup.com/bioinformatics/article/37/Supplement_1/i177/6319696?login=false
Cuttlefish 2
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02743-6
## Colored compact De Bruijn Graphs
Cuttlefish produces colored compact de Bruijn graphs. The "color" is the paths of each assembly through the graph. It is really fast, memory efficient, and highly scalable.
<center>
![<font size="0.5">https://www.biorxiv.org/content/10.1101/2020.03.06.980979v1.full.pdf</font>](./Figures/cdbg.png){width=50%}
</center>
## Kmers
It uses a version of KMC3, a fast and low memory algorithm for identifying and manipulating kmers. KMC3 and Cuttlefish have an overlaping set of authors.
<center>
![KMC3](./Figures/KMC3.png){width=80%}
</center>
Canonical kmers are used (kmers that are reverse complements of each other are treated as the same kmer).
Kmers must be odd. Max kmer sizes are 63 in the default installation but can be configured up to 127 when installing.
## Cuttlefish 1 vs 2
Cuttlefish 1 and Cuttlefish 2 are packaged together. Some parameters are specific for each version and you cannot choose parameters specific to version 1 and to version 2 in the same run. Cuttlefish 2 is also supposed to be faster, though it didn't seem to be on the yeast data.
**Cuttlefish 1**
Input = assembled reference sequences
Output = GFA (version 1 or 2), reduced GFA, fasta
**Cuttlefish 2**
Input = sequencing reads or reference sequences (not both)
Output = fasta (others coming soon)
A few notes on output formats:
+ The fasta format for both versions is the sequence of the maximal unitigs (non-branching paths) of the de Bruijn graph
+ When using GFA, we'll use GFA1, which is more appropriate for pangenomics. More info at the websites below if you want to learn more.
+ http://gfa-spec.github.io/GFA-spec/GFA2.html
+ https://github.com/GFA-spec/GFA-spec/issues/49
+ The reduced GFA format has 2 files with extensions .cf_seg (maximal unitigs = GFA1 segments) and .cf_seq (tiling with orientation = GFA1 paths).
"For moderate to large sized genomes, this [reduced-GFA] output format is preferrable to the GFA ones as the GFA formats can be quite verbose for this particular scenario, while the reduced representation provides effitively the same information, while taking much less space. For example, for the 7-human genome dataset (experimented with in the manuscripts) and using k = 31, the compacted graph takes 112 GB in GFA2, but only 29.3 GB in this reduced format."
https://github.com/COMBINE-lab/cuttlefish#io-formats
## Running Cuttlefish
### Increase the ulimit {-}
Cuttlefish often exceeds the open file-handle limit and gives an error that is hard to decipher. We'll increase it before we run.
```{bash, eval=FALSE}
ulimit -n 2048
```
### Input data {-}
Cuttlefish requires a single fasta with all the genomes (wildcards that point to all the indiviual genome fasta files result in only the first one being run).
```{bash, eval=FALSE}
mkdir ~/cuttlefish
cd ~/cuttlefish
cat /home/data/pangenomics-2402/yprp/assemblies/*fa > 12genomes.fa
```
### Parameters {-}
You can see all the parameters with:
```{bash, eval=FALSE}
cuttlefish build
```
or
```{bash, eval=FALSE}
cuttlefish build --help
```
Here are the parameters that we'll focus on:
>-s input fasta
>-t threads (we'll use 6; default 22)
>-k kmer size (we'll use 63)
>-o output
>-f format (0: FASTA, 1: GFA 1.0, 2: GFA 2.0, 3: GFA-reduced)
### GFA 1.0 {-}
```{bash, eval=FALSE}
cuttlefish build -s 12genomes.fa -t 6 -k 63 -o 12genomes.k63.f1 -f 1
```
List the files with the newest ones last
```{bash, eval=FALSE}
ls -ltr
```
Take a look at the json file
```{bash, eval=FALSE}
cat 12genomes.k63.f1.json
```
What line types are there in the GFA output and how many are there of each?
```{bash, eval=FALSE}
cut -f 1 12genomes.k63.f1.gfa1 | sort | uniq -c
```
How much overlap is there between each segment (Hint: look at the end of the link lines)?
Note that as we run different formats, we have to change the output name because it will recognize that the json file is already printed and it will exit.
### reduced GFA {-}
```{bash, eval=FALSE}
cuttlefish build -s 12genomes.fa -t 6 -k 63 -o 12genomes.k63.f3 -f 3
```
What files were output (list the files with the newest first)?
What is in each of the files?
### FASTA {-}
We'll try the FASTA output with both Cuttlefish 1 and Cuttlefish 2.
**Cuttlefish 1**
```{bash, eval=FALSE}
cuttlefish build -s 12genomes.fa -t 6 -k 63 -o 12genomes.k63.f0 -f 0
```
**Cuttlefish 2**
```{bash, eval=FALSE}
cuttlefish build -s 12genomes.fa -t 6 -k 63 -o 12genomes.k63.ref --ref
```
How many sequences are there in each?
```{bash, eval=FALSE}
grep -c '>' 12genomes.k63.f0.fa
grep -c '>' 12genomes.k63.ref.fa
```
How long are the shortest and longest sequences?
```{bash, eval=FALSE}
grep -v '>' 12genomes.k63.f0.fa | awk '{print length($1)}' | sort -n |head -1
grep -v '>' 12genomes.k63.f0.fa | awk '{print length($1)}' | sort -n |tail -1
```
Now find the shortest and longest sequences for the fasta from cuttlefish 2.
## 1011 yeast data (127 genomes)
Create GFA1 Cuttlefish output (k=63) for the 1011 yeast assemblies (127 genomes). We'll keep track of how much time it takes by using the time command.
Link to the data (make sure you are using the file with all the assemblies in it.)
```{bash, eval=FALSE}
ln -s /home/data/pangenomics-2402/1011yeast/assemblies/1011genomes.fasta.gz .
```
Now run cuttlefish.
```{bash, eval=FALSE}
time cuttlefish build -s 1011genomes.fasta.gz -t 6 -k 63 -o 1011genomes.k63.f1 -f 1
```
We'll do more with the cuttlefish files, including looking at the CUP1 region in the next section.