forked from genome/pindel
-
Notifications
You must be signed in to change notification settings - Fork 0
/
FAQ
258 lines (200 loc) · 11.5 KB
/
FAQ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
*** Frequently Asked Questions ***
** Contents **
-How do I run pindel?
-Compile time errors
-Runtime problems
-What do the terms and numbers in the Pindel output file (sample_D,
sample_SI, sample_TD etc.) and in the VCF output file mean?
-Contacting the author(s)
** How do I run pindel? **
Typing "./pindel" at the command line will show you all possible pindel
options, which are many.
What you usually need to do, however, is
a) know where your bam file is/bam files are
b) create a small textfile that consists of one line for each bam file
you want to analyze, consisting of the bamfile name followed by
whitespace (like a tab) followed by the insert size (you should know
this from the person who sequenced the data, if you really do not or
cannot know it, 500 is a decent default value), and finally again a
whitespace (like a tab) followed by your name for the sample
Example:
pop1_ind1.bam 450 P1_1
pop1_ind2.bam 450 P1_2
pop1_ind3.bam 450 P1_3
c) run pindel with its basic parameters: -f to specify the reference fasta
file, -i to specify the input text file you just created, and -o
to speficy the output filename. For example, with the test data delivered
with the Pindel package, this would look like:
./pindel -f test/SmallTest/sim1chrVs2.fa -i test/SmallTest/sim1chrVs2.conf_for_demo -o simulated_test.out
d) You will now see a lot of files appearing in the directory you started
Pindel from, for example simulated_test.out_D, which contains the
deletions found in the samples. _SI contains the short insertions, _INV
the inversions, and _TD the tandem duplications. There are also some
other output files, which can be filled at will with specific options
(like -l makes Pindel output long insertions as well). You can view the
files (they are ordinary text files) with any text editor of your choosing
to get detailed information on the SVs detected. You can also transform
them into VCF files, by using the pindel2vcf tool. How to use that is
explained later in this FAQ. You can also find other examples on how
to run Pindel and pindel2vcf in the demo directory.
** Compile time errors **
* Contents *
-Problems compiling pindel on OS X /
"fatal error: <omp.h> file not found"
-"fatal error: khash.h: No such file or directory"
* Problems compiling pindel on OS X * /
* "fatal error: <omp.h> file not found"
For speed, Pindel uses the openmp-library to allow multithreaded performance.
However, OS X does not seem to support openMP in general.
This problem can be tackled by applying the following steps:
1) check if you have homebrew installed on your computer
[if you don't know whether you have homebrew, just type the command for
the next step, "brew reinstall gcc --without-multilib". If brew is an
unknown command, you need to install it.]
To install homebrew, google "homebrew", or go to http://brew.sh/, or use
the ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" command.
2) Ensure you have a proper gcc installation without multilib, by using the
following command:
"brew reinstall gcc --without-multilib"
3) Allow pindel to use the real GCC (of homebrew), instead of the 'fake' gcc
which is default for OS X, by going to pindel/src and using
make clean
make CXX=g++-4.9
[or any other number than 4.9, just whatever your homebrewed gcc version
number is]
4) go back to the ./pindel directory and again try the ./INSTALL command. If that
does not work, please contact us.
* "fatal error: khash.h: No such file or directory" *
At the moment of this FAQ update (Feb 20, 2016) we do not know exactly what
causes this issue, possibly check the htslib-directory that you give to
Pindel's ./INSTALL . Change path into a relative path, for example? Or
cloning htslib directly? Anyway, one way to solve this is to make a
(relatively) clean htslib/pindel install. Try the following steps:
1) create a directory (for example "mkdir pindeltest", then
"cd pindeltest")
2) clone pindel into it ("git clone https://github.com/genome/pindel")
3) clone htslib into it ("git clone https://github.com/samtools/htslib")
4) go to the htslib directory ("cd htslib")
a) ? I did not do so, but one user replaced the value of prefix in the
htslib Makefile by the htslib path (/usr/local/Apps/pindeltest/htslib)
Not certain if this is needed, but you could definitely try if the
following steps do not work otherwise.
b) "make", "sudo make install"
5) go to the pindel directory and install ("cd ../pindel",
"./INSTALL ../htslib")
6) if all things have gone well, you can now run pindel ("./pindel")
** Runtime problems **
* Contents *
- Memory usage of Pindel is very high
- Pindel runs (too) slowly
- Pindel does not output VCF files
- pindel2vcf4tgca does not produce valid VCF files
* Memory usage of Pindel is very high *
The most common reason for high memory usage of Pindel is that Pindel tries
to process the centromer regions, which contain lots of 'weird' reads. In such
cases, it is best to use the -j or -J options to specify which chromosomal
regions should be searched by pindel (-j) or skipped (-J). Note that in some
cases you may need to extend the excluded regions a bit (by 10k or such), as
pindel by default also checks for reads just outside the edges of the
officially specified regions, so you may want to slightly enlarge your
centromers, telomers or other problematic regions.
The window size option (-W) can also reduce memory usage. This can be needed
for very large data sets.
* Pindel runs (too) slowly *
Pindel can run quite slowly if there are many reads to process; for a part,
long runtime can be caused by high coverage. However, there are some options
to speed things up:
1) eliminate 'unproductive' searching by excluding the centromeric and
telomeric regions from the search process, this can be done with the -j and
-J options.
2) Parallelize: there are two ways in which to speed up Pindel using
parallelization:
a) allow Pindel to use multithreading (multiple processors/processor cores)
by using the -T option (default is 1; setting it to 2, 4 or 6 may help,
depending on your system)
b) commit jobs per chromosome or half-chromosome in parallel to a cluster,
this can be done by using the -c option to specify a chromosome (like
-c 20) or per chromosome region (-c 20:500000-2000000)
3) lower the search thoroughness. This can be done in many ways, check the
pindel options (./pindel without any arguments) for further information.
Some of the options are lowering the sensitivity (-E), or disabling
genotyping (not using -g)
* Pindel does not output VCF files *
Pindel itself does not output VCF files. Instead, it outputs a file format
that gives far more detail on the called SVs than would be possible with a
VCF file. This has the advantage that one can manually check promising SVs
to see whether they are likely a real SV or just some artifact of
sequencing (like an extra base in a homopolymer). The disadvantage is of
course that most genomic tools need a standard format like VCF. Therefore,
we provide a conversion utility named pindel2vcf. You can find it as an
executable file in the same directory as Pindel itself.
It can be used as follows:
pindel2vcf -p sample3chr20_D -r human_g1k_v36.fasta -R 1000GenomesPilot-NCBI36
-d 20101123 -v sample3chr20_D.vcf
where the -p option is the name of the pindel output file, -r the name of the
reference fasta file, -R the official name of the reference, -d the date
at which the reference was originally created (if you don't care for submitting things
to official archives, you can just make up those two and fill in things
like -R x and -d 00000000, though I would recommend to just take the time
and effort to find those things out).
In any case, pindel2vcf will produce a vcf file that can be used by
other tools.
Note that pindel2vcf has many options for filtering SVs and also an
option for outputting a GATK-format VCF. Run ./pindel2vcf to see all the
available options.
* Pindel2vcf4tcga does not produce valid VCF files *
Pindel2vcf4tcga, despite its name, does not produce valid VCF files. This
is because valid TCGA vcf files require a lot of input data that Pindel2vcf
cannot know (like the version and settings of the alignment software), and
also that even if all that data is available, the file produced would likely
still be unacceptable to the TCGA database as the TCGA database only accepts
files from 'accepted' groups and projects, and regular Pindel users would
therefore find pindel2vcf4tcga of no use at all.
For now, pindel2vcf4tgca is basically a 'half-converter' which converts
pindel files into a semi-TCGA-vcf format, which can be processed by certain
scripts that TCGA partners possess into full-blown TCGA files. This does not
mean that pindel2vcf4tcga will never be working as a 'stand-alone' script,
but for now it is simply for use by certain TCGA-affiliated users, regular
Pindel users should not try to apply it to their own Pindel output.
** What do the terms and numbers in the Pindel output file (sample_D,
sample_SI, sample_TD etc.) and in the VCF output file mean? **
* contents *
-in the pindel output file, what do the six numbers after every sample name
mean?
* In the Pindel output file, what do the six numbers after every sample name mean? *
In the Pindel output file (sample_D, sample_SI etc.) each event is
represented by one line of data about the event, and multiple lines
representing raw reads. The data line starts with an index (the 0th event,
1st event, 2nd event etc.) followed by an SV type (D means deletion, for
example), and ends with a group of seven items for each sample,
consisting of the sample name and six numbers, like
"SIMCHRVS2 392 12 1 1 0 0".
What do those numbers mean?
The first number is the read depth (as calculated by samtools) at the
left breakpoint of the SV, which indicates how many reads nicely map
/cover the left breakpoint. If this number is high (or at least much higher
than the last four numbers), then this region maps 'normally', which
indicates that this SV may be a false positive, though there are still
cases in which you may want to check it further. Somatic mutations
in cancer, for example, may have much lower alt coverage than the
'reference coverage' given by the first two numbers, but still be a real
and possibly damaging mutation; it is just not present in all cells in
the sample.
Similarly, the second number represents the read depth at the right
breakpoint of the SV.
The third and fourth numbers represent the number of reads on the +
strand mapping to the alternative allele, and the number of unique reads
on the + strand mapping to the alternative allele. This second number is
given because in some cases reads have exactly the same sequence and
the same start and end positions; this could be a coincidence, but often
it is just an artifact from PCR/sequencing, and does not indicate extra
certainty that the event exists; the fourth number basically gives
the number of truly independent reads supporting the event, so may be
more reliable than the third number in ascertaining how likely it is that
the alternative allele really exists.
The fifth and sixth numbers are similar to the third and fourth numbers;
they represent the total and unique numbers of reads that map to the
- strand and support the alternative allele.
** Contacting the authors **
For any pindel-related questions, please contact Kai Ye, [email protected]
If you have questions related to pindel2vcf, you may also contact Eric-Wubbo Lameijer, [email protected]