Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tax in zip files; supporting multiple taxonomies with command-line switches #2216

Open
ctb opened this issue Aug 16, 2022 · 2 comments
Open
Labels

Comments

@ctb
Copy link
Contributor

ctb commented Aug 16, 2022

In #2154, we've been talking about how to include taxonomic information in zipfiles, and I've been trying to figure out how that would work at the command line.

But all the discussion happened in a now-closed issue and a now-merged PR ;). So here's a new issue!

Comments copied over from various other issues and PRs -

From #2195 (comment), @bluegenes:

As discussed on slack :) -

re: SOURMASH-TAXONOMY — would you consider GTDB-TAXONOMY and NCBI-TAXONOMY instead, with the default being gtdb?

OR, somewhere in database info/metadata (which we don’t have yet, but have talked about), add the default for that database? In this case, I'm thinking about database info/metadata as database version (e.g. gtdb-rs207), sourmash signature version, creation date, etc -- and then adding default-taxonomy.

From #2012 (comment), I wrote:

Trying to figure out how distributing multiple taxonomies in a zip file would work at the command line.

The most obvious idea is:

sourmash tax classify -g gather.csv -t gtdb-xyz.zip --gtdb

which would load GTDB-TAXONOMY.csv from gtdb-xyz.zip, vs

sourmash tax classify -g gather.csv -t gtdb-xyz.zip --ncbi

which would load NCBI-TAXONOMY.csv from gtdb-xyz.zip.

Then we could potentially add --lins later on for #1813.

Alternative command-line switches would be --tax-type ncbi or something but I feel like --ncbi and --gtdb are probably simplest and easiest to remember.


which received @bluegenes endorsement:

I like --gtdb and --ncbi, especially since I can't see us integrating so many taxonomies that having an argument per tax type would be unwieldy.

--lins definitely useful when we get there!

Also sorta connects with #2186, searching/selecting on taxonomic lineages?

@ctb ctb added the taxonomy label Aug 16, 2022
@ctb ctb changed the title supporting multiple taxonomies with command-line switches tax in zip files; supporting multiple taxonomies with command-line switches Aug 31, 2022
@ctb
Copy link
Contributor Author

ctb commented Sep 5, 2022

found this comment from @bluegenes, buried in a different issue - it appears to be the original we-should-have-tax-in-zip idea -

Additional thought: It would be handy to include the taxonomy file inside each database file (possible with zip, sbt.zip, and sqldb and not needed for lca, right?). That would reduce extra download code and the need to link the correct taxonomy file with each database. For taxonomy functions with official databases, users could provide the database on the command line (instead of needing to find/download the taxonomy file), and we could automatically find it. I would imagine TAXONOMY.csv, complementary to manifest file. We would still allow alternate taxonomies, of course, but at least each db would come with the official set for that db?

@ctb
Copy link
Contributor Author

ctb commented Oct 18, 2023

also ref nf-core/taxprofiler#404, where it would clearly be nice to have just one file containing sketches + taxonomy CSV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant