-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sourmash database construction - current status and future thoughts #2015
Comments
more - should we build RefSeq representative databases, per sourmash-bio/databases#13? provide better database benchmarks - #2014 upgrading wort manifests - #1965 providing links in taxonomy databases - #1969 building less redundant databases with minimum set covers - #1852 old IMG databases that we actually could upgrade - #385 |
Sketching protein databases using note - suppressed record issue: #2037 |
With the merge of
StandaloneManifestIndex
#1891 andsourmash sig check
#1907, and the advent ofsourmash sketch fromfile
#1885 and associated scripts in https://github.com/ctb/2022-sourmash-sketchfrom/, a new day is dawning for sourmash database construction 🎉tl;dr
sig check
andsketch fromfile
The two key commands introduced in sourmash v4.4.0 are
sourmash sig check
andsourmash sketch fromfile
.sig check
helps identify and retrieve relevant signatures that match to lists of identifiers, whilesketch fromfile
helps coordinate the bulk construction of sketches.Some functional examples of
sourmash sketch fromfile
are here. This includes examples of databases with private identifiers and also databases with NCBI-formatted identifiers.sig check
is used to find and extract signatures from wort collections based on identifier prefix matching, and can also be used to verify that all desired identifiers are in a database:This issue is a consolidation of unresolved elements from previous issues, as well as a reference point for closure of previous issues and PRs.
Relevant issues:
standarized database build and release process
This is now in a new repository, https://github.com/sourmash-bio/database-releases.
The idea is that for each new release, the scripts and/or workflow for building that release will be placed in a new directory in
database-releases
, and then a new release of that repository will be made along with the database update. This will provide a zenodo DOI for each database script update.constructing Genbank databases (DNA)
We do this from
wort-genomes
, using collection manifests (to track the genomes) and assembly summary files (to identify signature names to put in the collections).See https://github.com/sourmash-bio/database-releases/tree/main/genbank-2022.03
constructing GTDB databases (DNA)
We do this from
wort-genomes
, using collection manifests (to track the genomes) and assembly summary files (to identify signature names to put in the collections).See https://github.com/sourmash-bio/database-releases/tree/main/gtdb-rs207.genomic-reps and https://github.com/sourmash-bio/database-releases/tree/main/gtdb-rs207.genomic.
building taxonomy CSVs
Genbank
I updated the scripts from the previous lineage stuff to use the assembly summary files; scripts here: https://github.com/ctb/2022-assembly-summary-to-lineages.
GTDB
@taylorreiter provided an R script for producing taxonomy spreadsheets from GTDB's taxonomy TSVs, here: #1941 (comment)
Follow-up questions
Questions:
probably not worth the trouble, but it should be as simple as
sourmash sig cat OLD_DB -o NEW_DB
.see #1847
planning for protein databases
soon-ish we will be releasing protein databases... building these is more difficult because we don't have them in wort genomes. I think we are planning to use the
sourmash sketch fromfile
approach together with a custom workflow for building and/or retrieving the protein files cc @bluegenes.Other future things to think about
#991 - using BD Bags, and/or the datasets tool, and/or supporting incremental download of data releases.
Consider just providing the .zip files, together with workflows for constructing the other files as needed? First raised in #1511.
Provide a list of available databases in a computationally accessible format (along with, presumably, tools for retrieving them?) - #1005
The text was updated successfully, but these errors were encountered: