- The SRA is full of sequencing data. 🎉
- Tons of
- sequencing platforms
- experiment types (genomic, transcriptomic, metagenomic, younameit)
- read qualities
- Great, lots of data to play around with, but…
- often you don't want all the data from an experiment
- saving 100s of read sets takes lots of space
- files contain contaminants 😭
- you only want individual genomes out of a metagenome
- often you don't want all the data from an experiment
- The big question: How can we easily get only the interesting parts of SRA sets?
- Get reference genomes of interest or contaminants out of refseq to create a reference database
- Streaming the data right out of the SRA and use
magicblast
to compare to our reference database - only save those reads you actually want!
- clone this repository
git clone https://github.com/NCBI-Hackathons/STREAMclean
- install the required python libraries
pip install -r requirements.txt
- download
magicblast
from NCBI
./mapper_wrapper.sh -d test1 -i bacteria -s SRR4420340
or more specifically
./mapper_wrapper.sh -d test1 -i "-t 199310 viral" -s SRR4420340
This will:
- Download specified reference genomes using the ncbi-genome-download package.
- Create a magic-blast database of the collected reference genomes.
- Map the SRA accessions against the whitelist/blacklist reference database.