Skip to content

Comparison with samtools

lomereiter edited this page Aug 16, 2012 · 9 revisions

Functionality

Viewing SAM/BAM

Feature sambamba view samtools view Notes
BAM support Full Full
SAM support Full Full sambamba skips (syntactically) invalid tags and sets invalid fields to default values
Error messages Descriptive Incomplete Where samtools says just 'truncated file', sambamba prints detailed error message with a description what is wrong with BAM file
Multithreaded BAM decompression Yes No
Non-seekable file stream support Yes Yes
Skipping invalid reads Optional No Sambamba library also includes a module for creating custom validation tools
Filtering Powerful Limited sambamba view comes with a simple query language for filtering alignments
JSON output Yes No useful for interacting with scripting languages
Progressbar Optional No

Other tools

Feature sambamba samtools
Indexing Yes, multithreaded Yes, single-threaded
Merging BAM files Yes, multithreaded decompression and compression Yes, compression is multithreaded
Automatic SAM header merging Yes No
Multithreaded BAM file external sort Yes Yes
Flag statistics Yes, multithreaded Yes, single-threaded
(other utilities available in samtools are not implemented in sambamba)

Performance

Here are some benchmarks on two configurations:

  • Intel Atom N450 @ 1.66GHz (1 core with hyperthreading), 1GB of RAM
  • 2x Intel Xeon E5310 @ 1.60GHz (8 cores without hyperthreading), 8GB of RAM

On both machines, sambamba was built with GDC compiler (which is used for building debian packages), and samtools was built with its default makefile using gcc -02.

Tools were tested on HG00125.chrom20.ILLUMINA.bwa.GBR.low_coverage.20111114.bam (denoted by $FILENAME in command lines), 301MB in size.

Indexing BAM file (empty file cache)
sambamba index $FILENAME samtools index $FILENAME
Configuration Time Memory usage CPU load Configuration Time Memory usage CPU load
Intel Atom N450 12.29s 32MB 147% Intel Atom N450 13.6s 1.4MB 92%
2x Intel Xeon E5310 6.96s 32MB 139% 2x Intel Xeon E5310 8.73s 1.4MB 93%
Indexing BAM file (file fully cached into RAM)
sambamba index $FILENAME samtools index $FILENAME
Configuration Time Memory usage CPU load Configuration Time Memory usage CPU load
Intel Atom N450 9.43s 32MB 188% Intel Atom N450 12.08s 1.4MB 99%
2x Intel Xeon E5310 2.21s 32MB 433% 2x Intel Xeon E5310 7.98s 1.4MB 100%
Filtering reads from a region, with BAM output (empty file cache)
sambamba view -f bam $FILENAME 20:10,000,000-20,000,000 -F "mapping_quality >= 50" -o test.bam samtools view -b $FILENAME 20:10,000,000-20,000,000 -q50 -o test.bam
Configuration Time Memory usage CPU load Configuration Time Memory usage CPU load
Intel Atom N450 22.96s 90MB 98% Intel Atom N450 23.16s 1.8MB 96%
2x Intel Xeon E5310 5.24s 90MB 250% 2x Intel Xeon E5310 10.83s 1.8MB 98%
Counting reads from a region (file fully cached into RAM)
sambamba view $FILENAME -c -F "[RG] == 'ERR016156' and proper_pair and first_of_pair and not duplicate" 20:1000000-3000000 samtools view $FILENAME -c -r 'ERR016156' -f66 -F1024 20:1000000-3000000
Configuration Time Memory usage CPU load Configuration Time Memory usage CPU load
Intel Atom N450 0.53s 50MB 144% Intel Atom N450 0.42s 1.3MB 99%
2x Intel Xeon E5310 0.20s 50MB 208% 2x Intel Xeon E5310 0.27s 1.5MB 100%

Conclusion

As you can see, sambamba exploits parallelism where samtools does not. The faster the storage you use, the more the speedup is (see results for indexing).

However, there're some drawbacks at the moment. Memory usage is higher due to extensive use of various buffers, and region queries are slower in some cases (though not much, about 10-20%).