FAQ

*** Frequently Asked Questions ***

** Contents **

-How do I run pindel?
-Compile time errors
-Runtime problems
-What do the terms and numbers in the Pindel output file (sample_D,
 sample_SI, sample_TD etc.) and in the VCF output file mean?
-Contacting the author(s)

** How do I run pindel? **

Typing "./pindel" at the command line will show you all possible pindel 
options, which are many. 

What you usually need to do, however, is
a) know where your bam file is/bam files are
b) create a small textfile that consists of one line for each bam file
   you want to analyze, consisting of the bamfile name followed by
   whitespace (like a tab) followed by the insert size (you should know
   this from the person who sequenced the data, if you really do not or
   cannot know it, 500 is a decent default value), and finally again a 
   whitespace (like a tab) followed by your name for the sample

Example:

pop1_ind1.bam	450	P1_1
pop1_ind2.bam	450	P1_2
pop1_ind3.bam 	450	P1_3

c) run pindel with its basic parameters: -f to specify the reference fasta
   file, -i to specify the input text file you just created, and -o 
   to speficy the output filename. For example, with the test data delivered
   with the Pindel package, this would look like:

./pindel -f test/SmallTest/sim1chrVs2.fa -i test/SmallTest/sim1chrVs2.conf_for_demo -o simulated_test.out

d) You will now see a lot of files appearing in the directory you started 
   Pindel from, for example simulated_test.out_D, which contains the 
   deletions found in the samples. _SI contains the short insertions, _INV
   the inversions, and _TD the tandem duplications. There are also some
   other output files, which can be filled at will with specific options
   (like -l makes Pindel output long insertions as well). You can view the
   files (they are ordinary text files) with any text editor of your choosing
   to get detailed information on the SVs detected. You can also transform
   them into VCF files, by using the pindel2vcf tool. How to use that is
   explained later in this FAQ. You can also find other examples on how
   to run Pindel and pindel2vcf in the demo directory.


** Compile time errors **

* Contents *

-Problems compiling pindel on OS X /
 "fatal error: <omp.h> file not found"

-"fatal error: khash.h: No such file or directory"

* Problems compiling pindel on OS X * /
* "fatal error: <omp.h> file not found"

For speed, Pindel uses the openmp-library to allow multithreaded performance.
However, OS X does not seem to support openMP in general.

This problem can be tackled by applying the following steps:

1) check if you have homebrew installed on your computer
   [if you don't know whether you have homebrew, just type the command for
   the next step, "brew reinstall gcc --without-multilib". If brew is an
   unknown command, you need to install it.]
   To install homebrew, google "homebrew", or go to http://brew.sh/, or use
   the ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" command.

2) Ensure you have a proper gcc installation without multilib, by using the 
   following command:
   "brew reinstall gcc --without-multilib"

3) Allow pindel to use the real GCC (of homebrew), instead of the 'fake' gcc
   which is default for OS X, by going to pindel/src and using

   make clean
   make CXX=g++-4.9
   [or any other number than 4.9, just whatever your homebrewed gcc version
   number is]

4) go back to the ./pindel directory and again try the ./INSTALL command. If that
   does not work, please contact us.


* "fatal error: khash.h: No such file or directory" *

At the moment of this FAQ update (Feb 20, 2016) we do not know exactly what
causes this issue, possibly check the htslib-directory that you give to
Pindel's ./INSTALL . Change path into a relative path, for example? Or 
cloning htslib directly? Anyway, one way to solve this is to make a 
(relatively) clean htslib/pindel install. Try the following steps:

1) create a directory (for example "mkdir pindeltest", then 
   "cd pindeltest")

2) clone pindel into it ("git clone https://github.com/genome/pindel") 

3) clone htslib into it ("git clone https://github.com/samtools/htslib")

4) go to the htslib directory ("cd htslib")
   a) ? I did not do so, but one user replaced the value of prefix in the
      htslib Makefile by the htslib path (/usr/local/Apps/pindeltest/htslib)
      Not certain if this is needed, but you could definitely try if the 
      following steps do not work otherwise.
   b) "make", "sudo make install"

5) go to the pindel directory and install ("cd ../pindel", 
   "./INSTALL ../htslib")

6) if all things have gone well, you can now run pindel ("./pindel")
  

** Runtime problems **

* Contents *

- Memory usage of Pindel is very high
- Pindel runs (too) slowly
- Pindel does not output VCF files
- pindel2vcf4tgca does not produce valid VCF files


* Memory usage of Pindel is very high *

The most common reason for high memory usage of Pindel is that Pindel tries
to process the centromer regions, which contain lots of 'weird' reads. In such
cases, it is best to use the -j or -J options to specify which chromosomal
regions should be searched by pindel (-j) or skipped (-J). Note that in some
cases you may need to extend the excluded regions a bit (by 10k or such), as
pindel by default also checks for reads just outside the edges of the 
officially specified regions, so you may want to slightly enlarge your 
centromers, telomers or other problematic regions.

The window size option (-W) can also reduce memory usage. This can be needed
for very large data sets.


* Pindel runs (too) slowly *

Pindel can run quite slowly if there are many reads to process; for a part,
long runtime can be caused by high coverage. However, there are some options
to speed things up:
1) eliminate 'unproductive' searching by excluding the centromeric and 
   telomeric regions from the search process, this can be done with the -j and
   -J options.
2) Parallelize: there are two ways in which to speed up Pindel using 
   parallelization:
   a) allow Pindel to use multithreading (multiple processors/processor cores)
      by using the -T option (default is 1; setting it to 2, 4 or 6 may help,
      depending on your system)
   b) commit jobs per chromosome or half-chromosome in parallel to a cluster,
      this can be done by using the -c option to specify a chromosome (like 
      -c 20) or per chromosome region (-c 20:500000-2000000)
3) lower the search thoroughness. This can be done in many ways, check the 
   pindel options (./pindel without any arguments) for further information.
   Some of the options are lowering the sensitivity (-E), or disabling 
   genotyping (not using -g)
 

* Pindel does not output VCF files *

Pindel itself does not output VCF files. Instead, it outputs a file format 
that gives far more detail on the called SVs than would be possible with a 
VCF file. This has the advantage that one can manually check promising SVs
to see whether they are likely a real SV or just some artifact of 
sequencing (like an extra base in a homopolymer). The disadvantage is of
course that most genomic tools need a standard format like VCF. Therefore,
we provide a conversion utility named pindel2vcf. You can find it as an
executable file in the same directory as Pindel itself.
   It can be used as follows:

   pindel2vcf -p sample3chr20_D -r human_g1k_v36.fasta -R 1000GenomesPilot-NCBI36
              -d 20101123 -v sample3chr20_D.vcf

where the -p option is the name of the pindel output file, -r the name of the
reference fasta file, -R the official name of the reference, -d the date
at which the reference was originally created (if you don't care for submitting things
to official archives, you can just make up those two and fill in things
like -R x and -d 00000000, though I would recommend to just take the time
and effort to find those things out). 
   In any case, pindel2vcf will produce a vcf file that can be used by
other tools.
   Note that pindel2vcf has many options for filtering SVs and also an 
option for outputting a GATK-format VCF. Run ./pindel2vcf to see all the
available options.


* Pindel2vcf4tcga does not produce valid VCF files *

Pindel2vcf4tcga, despite its name, does not produce valid VCF files. This
is because valid TCGA vcf files require a lot of input data that Pindel2vcf
cannot know (like the version and settings of the alignment software), and
also that even if all that data is available, the file produced would likely
still be unacceptable to the TCGA database as the TCGA database only accepts
files from 'accepted' groups and projects, and regular Pindel users would 
therefore find pindel2vcf4tcga of no use at all.
   For now, pindel2vcf4tgca is basically a 'half-converter' which converts
pindel files into a semi-TCGA-vcf format, which can be processed by certain
scripts that TCGA partners possess into full-blown TCGA files. This does not
mean that pindel2vcf4tcga will never be working as a 'stand-alone' script,
but for now it is simply for use by certain TCGA-affiliated users, regular
Pindel users should not try to apply it to their own Pindel output.

** What do the terms and numbers in the Pindel output file (sample_D,
 sample_SI, sample_TD etc.) and in the VCF output file mean? **

* contents *
-in the pindel output file, what do the six numbers after every sample name
 mean?


* In the Pindel output file, what do the six numbers after every sample name mean? *

In the Pindel output file (sample_D, sample_SI etc.) each event is 
represented by one line of data about the event, and multiple lines
representing raw reads. The data line starts with an index (the 0th event,
1st event, 2nd event etc.) followed by an SV type (D means deletion, for
example), and ends with a group of seven items for each sample,
consisting of the sample name and six numbers, like 
"SIMCHRVS2 392 12 1 1 0 0".
What do those numbers mean?
The first number is the read depth (as calculated by samtools) at the
left breakpoint of the SV, which indicates how many reads nicely map 
/cover the left breakpoint. If this number is high (or at least much higher
than the last four numbers), then this region maps 'normally', which
indicates that this SV may be a false positive, though there are still
cases in which you may want to check it further. Somatic mutations
in cancer, for example, may have much lower alt coverage than the 
'reference coverage' given by the first two numbers, but still be a real
and possibly damaging mutation; it is just not present in all cells in 
the sample.
Similarly, the second number represents the read depth at the right 
breakpoint of the SV.
The third and fourth numbers represent the number of reads on the + 
strand mapping to the alternative allele, and the number of unique reads
on the + strand mapping to the alternative allele. This second number is
given because in some cases reads have exactly the same sequence and 
the same start and end positions; this could be a coincidence, but often 
it is just an artifact from PCR/sequencing, and does not indicate extra
certainty that the event exists; the fourth number basically gives 
the number of truly independent reads supporting the event, so may be 
more reliable than the third number in ascertaining how likely it is that
the alternative allele really exists.
The fifth and sixth numbers are similar to the third and fourth numbers;
they represent the total and unique numbers of reads that map to the 
- strand and support the alternative allele.


** Contacting the authors **
For any pindel-related questions, please contact Kai Ye, kaiye@xjtu.edu.cn
If you have questions related to pindel2vcf, you may also contact Eric-Wubbo Lameijer, e.m.w.lameijer@gmail.com