Get microbial sequence data easier and faster
Traditionally if you saw an accession number in a manuscript, you would paste it into NCBI Search and then muck around trying to download the associated data. There were wizards who could use the Entrez interface and its associated command line tools, but it needs to be easier.
A variety of tools now exist to download data from NCBI and ENA:
Combined they are powerful. I use them. Some work with assemblies, some with both, some with both, but often with confusing caveats and annoying parameters I don't feel I should have to think about.
I just want to do this:
% seeka PRJEB5167
% cd PRJEB5167
% ls
ERR405852 ERR405853 ERR405854 ERR405855 ERR405856 ERR405857 ERR405858
ERR405859 ERR405860 ERR405861 ERR405862 ERR405863 ERR405864 ERR405865
ERR405866 ERR405867 ERR405868 ERR405869 ERR405870 ERR405871 ERR405872
PRJEB5167.tsv
% head -n 1 PRJEB5167.tsv | tr "\t" "\n" | head | nl
1 study_accession
2 secondary_study_accession
3 sample_accession
4 secondary_sample_accession
5 experiment_accession
6 run_accession
7 submission_accession
8 tax_id
9 scientific_name
10 instrument_platform
% cd ERR405855
% ls
ERR405855_1.fastq.gz ERR405855_2.fastq.gz
% seeka --version
seeka 0.4.2
# download a single run
% seeka ERR405852
# get data for a biosample
% seeka SAMEA2297485
# get every read set in a project
# seeka PRJEB5167
GCA_nnnnnnnnn.v
- Genbank assembly[A-Z]{4}01000000
- Genbank assemblyGCF_nnnnnnnnn.v
- Refseq assemblyNC_nnnnnn.v
- Refseq assemblyPRJ{EB,NA}
- SRA project[SED]RRnnnnnnn
- SRA read set (FASTQ)[SED]RXnnnnnnn
- SRA experiment[SED]RPnnnnnnn
- SRA study[SED]RSnnnnnnn
- SRA sampleSAM[NED]
- Biosamples
seeka.ACCESSION.tsv
- metadata TSV for search query*.fastq.gz
- any read data*.fna.gz
- any assemblies in FASTA*.gbff.gz
- any Genbank files in FASTA
conda install -c conda-forge -c bioconda -c defaults seeka # COMING SOON
Install HomeBrew (Mac OS X) or LinuxBrew (Linux).
brew install brewsci/bio/seeka # COMING SOON
This will install the latest version direct from Github.
You'll need to add the seeka bin
directory to your $PATH
,
and also ensure all the dependencies are installed.
cd $HOME
git clone https://github.com/tseemann/seeka.git
$HOME/seeka/bin/seeka --help
perl
>= 5.26ascp
from the Aspera Command Line Toolsrsync
esearch
,efetch
,elink
from the Entrezedirect
toolkit
seeka is free software, released under the GPL 3.0.
Please submit suggestions and bug reports to the Issue Tracker