src
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
SGA - String Graph Assembler ============================ SGA is a de novo assembler for DNA sequence reads. It is based on Gene Myers' string graph formulation of assembly and uses the FM-index/Burrows-Wheeler transform to efficiently find overlaps between sequence reads. The core algorithms are described in this paper: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/12/i367 ------------- Compiling SGA SGA dependencies: -google sparse hash library (https://github.com/sparsehash/sparsehash) -the bamtools library (https://github.com/pezmaster31/bamtools) -zlib (http://www.zlib.net/) -(optional but suggested) the jemalloc memory allocator (http://www.canonware.com/jemalloc/download.html) Additionally, the pipeline python scripts use the following modules. These are not required to build SGA but must be available if you want to use the relevant python helper scripts: -pysam (http://code.google.com/p/pysam/) -ruffus (http://www.ruffus.org.uk/) If you cloned the repository from github, run autogen.sh from the src directory to generate the configure file: ./autogen.sh If bamtools and the sparsehash have been installed in standard locations (like /usr/local) you can run configure without any parameters then run make: ./configure make If bamtools or the sparsehash are installed elsewhere, you can specify their locations as follows: ./configure --with-sparsehash=/home/jsimpson/ --with-bamtools=/home/jsimpson/software/bamtools These directories should be the root of the install (in other words, the directories have include/ and lib/ as subdirectories contained the header files and libraries, respectively). The program uses pthreads to parallelize most steps of the assembly. The use of a concurrent memory allocator like jemalloc can improve running time. If you would like to enable the jemalloc memory allocator, specify the path as follows: ./configure --with-jemalloc=/home/jsimpson/jemalloc/lib/ The Hoard and tcmalloc allocators are also supported and can be enabled using --with-hoard/--with-tcmalloc if desired. -------------- Installing SGA Running make install will install sga into /usr/local/bin/ by default. To specify the install location use the --prefix option to configure: ./configure --prefix=/home/jsimpson/ && make && make install This command will copy sga to /home/jsimpson/bin/sga ----------- Running SGA SGA consists of a number of subprograms, together which form the assembly pipeline. The subprograms can also be used to perform other interesting tasks, like read error correction or removing PCR duplicates. Each program and subprogram will print a brief description and its usage instructions if the --help flag is used. To get a listing of all subprograms, run sga --help. Examples of an SGA assembly are provided in the src/examples directory. It is suggested to look at these examples to become familar with the flow of data through the program. The major subprograms are: * sga preprocess Prepare reads for assembly. It can perform optional quality filtering/trimming. By default it will discard reads that have uncalled bases ('N' or '.'). If you wish to keep these reads, use the --permuteAmbiguous flag which will randomly change any uncalled bases to one of [ACGT]. It is mandatory to run this command on real data. If your reads are paired, the --pe-mode option should be specified. The paired reads can be input in two files (sga preprocess READS1 READS2) where the first read in READS1 is paired with the first read on READS2 and so on. Alternatively, they can be specified in a single file where the two reads are expected to appear in consecutive records. By default, output is written to stdout. * sga index -a ropebwt READS Build the FM-index for READS, which is a fasta or fastq file. This program is threaded (-t N). * sga correct READS Perform error correction on READS file. Overlap and kmer-based correction algorithms are implemented. By default, a k-mer based correction is performed. Many options exist for this program, see --help. * sga filter READS Remove exact-match duplicated sequences and reads with low-frequency k-mers. This program automatically regenerates the FM-index without the removed reads. * sga overlap -m N READS Find overlaps between reads to construct the string graph. The -m parameter specifies the minimum length of the overlaps to find. By default only non-transitive (irreducible) edges are output and edges between identical sequences. If all overlaps between reads are desired, the --exhaustive option can be specified. This program is threaded. The output file is READS.asqg.gz by default. * sga assemble READS.asqg.gz Assemble takes the output of the overlap step and constructs contigs. The output is in contigs.fa by default. Options exist for cleaning the graph before assembly which will substantially increase assembly continuity. See the --cut-terminal, --bubble, --resolve-small options. ------------------- Data quality issues Sequence assembly requires high quality data. It is worth assessing the quality of your reads using tools like FastQC (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) to help guide the choice of assembly parameters. Low-quality data should be filtered or trimmed. Very highly-represented sequences (>1000X) can cause problems for SGA. This can happen when sequencing a small genome or when mitochondria or other contamination is present in the sequencing run. In these cases, it is worth considering pre-filtering the data or running an initial 'rmdup' step before error correction. ------- History The first SGA code check-in was August, 2009. The algorithms for directly constructing the string graph from the FM-index were developed and implemented in the fall of 2009. The initial public release was October 2010. ------- License SGA is licensed under GPLv3. See the COPYING file in the src directory. ---------------- Third party code SGA uses Bentley and Sedgwick's multikey quicksort code that can be found here: http://www.cs.princeton.edu/~rs/strings/demo.c It also uses zlib, the google sparse hash and gzstream by Deepak Bandyopadhyay and Lutz Kettner (see Thirdparty/README). SGA also uses the stdaln dynamic programming functions written by Heng Li. The ropebwt implementation is by Heng Li (https://raw.github.com/lh3/ropebwt). SGA also uses the rapidjson library by Milo Yip (http://code.google.com/p/rapidjson/). This Software includes an implementation of the algorithm published by Bauer et al. [Markus J. Bauer, Anthony J. Cox and Giovanna Rosone: Lightweight BWT Construction for Very Large String Collections. Lecture Notes in Computer Science 6661 (Proceedings of the 22nd Annual Conference on Combinatorial Pattern Matching), 2011, pp.219-231]. Intellectual property rights in the algorithm are owned and protected by Illumina Cambridge Ltd. and/or its affiliates (“Illumina”). Illumina is not a contributor or licensor under this GPL license and retains all rights, title and interest in the algorithm and intellectual property associated therewith. No rights or licenses to Illumina intellectual property, including patents, are granted to you under this license. ------- Credits Written by Jared Simpson. The algorithms were developed by Jared Simpson and Richard Durbin.