Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash database construction - current status and future thoughts #2015

Open
ctb opened this issue May 1, 2022 · 2 comments
Open

sourmash database construction - current status and future thoughts #2015

ctb opened this issue May 1, 2022 · 2 comments

Comments

@ctb
Copy link
Contributor

ctb commented May 1, 2022

With the merge of StandaloneManifestIndex #1891 and sourmash sig check #1907, and the advent of sourmash sketch fromfile #1885 and associated scripts in https://github.com/ctb/2022-sourmash-sketchfrom/, a new day is dawning for sourmash database construction 🎉

tl;dr sig check and sketch fromfile

The two key commands introduced in sourmash v4.4.0 are sourmash sig check and sourmash sketch fromfile. sig check helps identify and retrieve relevant signatures that match to lists of identifiers, while sketch fromfile helps coordinate the bulk construction of sketches.

Some functional examples of sourmash sketch fromfile are here. This includes examples of databases with private identifiers and also databases with NCBI-formatted identifiers.

sig check is used to find and extract signatures from wort collections based on identifier prefix matching, and can also be used to verify that all desired identifiers are in a database:

sourmash sig check <wort collection> \
    --picklist <picklist> \
    -o missing.csv \
    --save-manifest matching.csv

This issue is a consolidation of unresolved elements from previous issues, as well as a reference point for closure of previous issues and PRs.

Relevant issues:

standarized database build and release process

This is now in a new repository, https://github.com/sourmash-bio/database-releases.

The idea is that for each new release, the scripts and/or workflow for building that release will be placed in a new directory in database-releases, and then a new release of that repository will be made along with the database update. This will provide a zenodo DOI for each database script update.

constructing Genbank databases (DNA)

We do this from wort-genomes, using collection manifests (to track the genomes) and assembly summary files (to identify signature names to put in the collections).

See https://github.com/sourmash-bio/database-releases/tree/main/genbank-2022.03

constructing GTDB databases (DNA)

We do this from wort-genomes, using collection manifests (to track the genomes) and assembly summary files (to identify signature names to put in the collections).

See https://github.com/sourmash-bio/database-releases/tree/main/gtdb-rs207.genomic-reps and https://github.com/sourmash-bio/database-releases/tree/main/gtdb-rs207.genomic.

building taxonomy CSVs

Genbank

I updated the scripts from the previous lineage stuff to use the assembly summary files; scripts here: https://github.com/ctb/2022-assembly-summary-to-lineages.

GTDB

@taylorreiter provided an R script for producing taxonomy spreadsheets from GTDB's taxonomy TSVs, here: #1941 (comment)

Follow-up questions

Questions:

  • do we want to update old databases?

probably not worth the trouble, but it should be as simple as sourmash sig cat OLD_DB -o NEW_DB.

  • can we / should we provide metadata for databases?

see #1847

planning for protein databases

soon-ish we will be releasing protein databases... building these is more difficult because we don't have them in wort genomes. I think we are planning to use the sourmash sketch fromfile approach together with a custom workflow for building and/or retrieving the protein files cc @bluegenes.

Other future things to think about

#991 - using BD Bags, and/or the datasets tool, and/or supporting incremental download of data releases.

Consider just providing the .zip files, together with workflows for constructing the other files as needed? First raised in #1511.

Provide a list of available databases in a computationally accessible format (along with, presumably, tools for retrieving them?) - #1005

@ctb
Copy link
Contributor Author

ctb commented May 1, 2022

more -

should we build RefSeq representative databases, per sourmash-bio/databases#13?

provide better database benchmarks - #2014

upgrading wort manifests - #1965

providing links in taxonomy databases - #1969

building less redundant databases with minimum set covers - #1852

old IMG databases that we actually could upgrade - #385

@bluegenes
Copy link
Contributor

Sketching protein databases using fromfile took ~ 15hours and ~ 1G RAM after all faa.gz files were available. Since not all genomes have protein fastas available, the workflow requires checking for empty/missing proteome files, and using prodigal to generate protein fastas from genomes as necessary.

note - suppressed record issue: #2037

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants