There may be performance benefits when compiling from source as the build can be optimized for the host system.
Install with bioconda (Linux)
conda install -c bioconda -c conda-forge raptor
Install with brew (Linux, macOS)
brew install brewsci/bio/raptor
Prerequisites (click to expand)
- CMake >= 3.16
- GCC 10, 11 or 12 (most recent minor version)
- git
Refer to the Seqan3 Setup Tutorial for more in depth information.
Download current main branch (click to expand)
git clone https://github.com/seqan/raptor
git submodule update --init
Download specific version (click to expand)
E.g., for version 1.1.0
:
git clone --branch raptor-v1.1.0 --recurse-submodules https://github.com/seqan/raptor
Or from within an existing repository
git checkout raptor-v1.1.0
Building (click to expand)
cd raptor
mkdir -p build
cd build
cmake ..
make
The binary can be found in bin
.
You may want to add the Raptor executable to your PATH:
export PATH=$(pwd)/bin:$PATH
raptor --version
By default, Raptor will be built with host specific optimizations (-march=native
). This behavior can be disabled by
passing -DRAPTOR_NATIVE_BUILD=OFF
to CMake.
A toy data set (124 MiB compressed, 983 MiB decompressed) can be found here.
wget https://ftp.imp.fu-berlin.de/pub/seiler/raptor/example_data.tar.gz
tar xfz example_data.tar.gz
After extraction, the example_data
will look like:
$ tree -L 2 example_data
example_data
├── 1024
│ ├── bins
│ └── reads
└── 64
├── bins
└── reads
The bins
folder contains a FASTA file for each bin and the reads
directory contains a FASTQ file for each bin
containing reads from the respective bin (with 2 errors).
Additionally, mini.fastq
(5 reads of all bins), all.fastq
(concatenation of all FASTQ files) and all10.fastq
(all.fastq
repeated 10 times) are provided in the reads
folder.
In the following, we will use the 64
data set.
To build an index over all bins, we first prepare a file that contains one file path per line
(a line corresponds to a bin) and use this file as input:
seq -f "example_data/64/bins/bin_%02g.fasta" 0 1 63 > all_bin_paths.txt
raptor build --kmer 19 --window 23 --size 8m --output raptor.index all_bin_paths.txt
You may be prompted to enable or disable automatic update notifications. For questions, please consult the SeqAn documentation.
Afterwards, we can search for some reads:
raptor search --error 2 --index raptor.index --query example_data/64/reads/mini.fastq --output search.output
The output starts with a header section (lines starting with #
). The header maps a number to each input file.
After the header section, each line of the output consists of the read ID (in the toy example these are numbers) and
the corresponding bins in which they were found:
#0 example_data/64/bins/bin_00.fasta
#1 example_data/64/bins/bin_01.fasta
...
#62 example_data/64/bins/bin_62.fasta
#63 example_data/64/bins/bin_63.fasta
#QUERY_NAME USER_BINS
0 0
1 0
2 0
3 0
4 0
16384 1
...
1015812 62
1032192 63
1032193 63
1032194 63
1032195 63
1032196 63
For a list of options, see the help pages:
raptor --help
raptor build --help
raptor search --help
raptor upgrade --help
We offer the option to precompute the minimisers of the input files. This is useful to build indices of big datasets (in the range of several TiB) and also allows an estimation of the needed index size since the amount of minimisers is known. Following above example, we would change the build step as follows:
First we precompute the minimisers and store them in a directory:
seq -f "example_data/64/bins/bin_%02g.fasta" 0 1 63 > all_bin_paths.txt
raptor build --kmer 19 --window 23 --compute-minimiser --output precomputed_minimisers all_bin_paths.txt
Then we run the build step again and use the computed minimisers as input:
seq -f "precomputed_minimisers/bin_%02g.minimiser" 0 1 63 > all_minimiser_paths.txt
raptor build --size 8m --output minimiser_raptor.index all_minimiser_paths.txt
The preprocessing applies the same cutoffs as used in Mantis
(Pandey et al., 2018).
This means that only minimisers that occur more often than the cutoff specifies are included in the output.
If you wish to process all minimisers, you can use --disable-cutoffs
.
To reduce the overall memory consumption, the index can be divided into multiple (a power of two) parts.
This can be done by passing --parts n
to raptor build
, where n
is the number of parts you want to create.
This will create n
files, each representing one part of the index. The --size
parameter describes the overall size
of the index. For example, --size 8g --parts 4
will create four 2 GiB indices. This will reduce the memory consumption
of raptor build
and raptor search
by approximately 6 GiB, since there will only be one part in memory at any given
time. raptor search
will automatically detect the parts, and does not need any special parameters.
An old index can be upgraded by running raptor upgrade
and providing some information about how the index was
constructed.
We implement the core interface of SOCKS. For a list of options, see the help pages:
raptor socks build --help
raptor socks lookup-kmer --help
Raptor is being developed by Enrico Seiler, but also incorporates much work from other members of SeqAn.
In your academic works (also comparisons and pipelines) please cite:
- Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences; Enrico Seiler, Svenja Mehringer, Mitra Darvish, Etienne Turc, and Knut Reinert; iScience 2021 24 (7): 102782. doi: https://doi.org/10.1016/j.isci.2021.102782
Raptor was presented at the 25th International Conference on Research in Computational Molecular Biology:
Please see the License section for information on allowed use.
The subdirectory util
contains applications and scripts related to the paper.
Raptor is open source software. However, certain conditions apply when you (re-)distribute and/or modify Raptor, please see the license.
Vercel is kind enough to build and host our documentation and even provide preview-builds within our pull requests. Check them out!