Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] add sourmash tax crosscheck to validate taxonomies <=> database contents #2362

Draft
wants to merge 10 commits into
base: latest
Choose a base branch
from

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Nov 14, 2022

Fixes #2361

This PR adds sourmash tax crosscheck, which validates taxonomies against databases and databases against taxonomies (every identifier has a lineage, every lineage has at least one identifier).

This also adds some of the remaining lingering useful functionality in lca index to the tax submodule, in pursuit of the eventual consolidation of lca utility/parsing functionality into tax per #2198

tax crosscheck currently verifies:

  • every identifier has a taxonomic lineage
  • every taxonomic lineage is present at least once in database
  • there are no duplicate identifiers in the database

TODO:

  • add tests
  • add output of bad or missing identifiers to CSV file

Example usage

all is well

% sourmash tax crosscheck --taxonomy gtdb-rs207.taxonomy.sqldb --db gtdb-rs207.genomic.k31.zip
...
loading taxonomies...
...loaded 317542 entries.
loading sourmash databases...
...found 317542 sketches across 1 files

no duplicate identifiers found!
all identifiers have a taxonomic lineage associated with them!
all taxonomy entries have a matching sketch in the databases!

missing identifiers in databases

% sourmash tax crosscheck --taxonomy gtdb-rs207.taxonomy.sqldb --db gtdb-rs207.genomic-reps.dna.k31.zip
...
loading taxonomies...
...loaded 317542 entries.
loading sourmash databases...
...found 65703 sketches across 1 files

no duplicate identifiers found!
all identifiers have a taxonomic lineage associated with them!
found 251839 identifiers in the taxonomy that are NOT in any database.

@codecov
Copy link

codecov bot commented Nov 14, 2022

Codecov Report

Merging #2362 (e5a0560) into latest (1c8b165) will increase coverage by 7.30%.
The diff coverage is 17.18%.

@@            Coverage Diff             @@
##           latest    #2362      +/-   ##
==========================================
+ Coverage   84.09%   91.39%   +7.30%     
==========================================
  Files         130      102      -28     
  Lines       15048    11627    -3421     
  Branches     2208     2237      +29     
==========================================
- Hits        12654    10627    -2027     
+ Misses       2098      703    -1395     
- Partials      296      297       +1     
Flag Coverage Δ
python 91.39% <17.18%> (-0.83%) ⬇️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/lca/lca_utils.py 88.70% <ø> (ø)
src/sourmash/tax/tax_utils.py 98.12% <ø> (ø)
src/sourmash/tax/__main__.py 69.62% <2.80%> (-24.02%) ⬇️
src/sourmash/cli/tax/crosscheck.py 90.00% <90.00%> (ø)
src/sourmash/cli/tax/__init__.py 100.00% <100.00%> (ø)
src/core/src/sketch/nodegraph.rs
src/core/src/ffi/index/revindex.rs
src/core/src/ffi/mod.rs
src/core/src/ffi/hyperloglog.rs
src/core/src/sketch/hyperloglog/mod.rs
... and 24 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ctb ctb changed the title [WIP] start adding sourmash tax crosscheck [WIP] add sourmash tax crosscheck to validate taxonomies <=> database contents Nov 15, 2022
@ctb
Copy link
Contributor Author

ctb commented Nov 15, 2022

@bluegenes @taylorreiter any additional checks or functionality to add beyond outputting the problematic identifiers etc?

@taylorreiter
Copy link
Contributor

Is it too much trouble to check if they differ by version (e.g. *.1 vs *.2) or db genome was downloaded from (GCF* vs. GCA*)?

@ctb
Copy link
Contributor Author

ctb commented Nov 15, 2022

Is it too much trouble to check if they differ by version (e.g. *.1 vs *.2) or db genome was downloaded from (GCF* vs. GCA*)?

ooh good ideas (presumably borne from experience 👀 )

@bluegenes
Copy link
Contributor

Can you optionally output the missing identifiers to a file?

@ctb
Copy link
Contributor Author

ctb commented Nov 15, 2022

Can you optionally output the missing identifiers to a file?

yep, is in plan!

@ctb
Copy link
Contributor Author

ctb commented Nov 16, 2022

A couple updates in e15ff89 this fine morning -

Outputting details of missing identifiers

FIRST, sourmash tax crosscheck now supports --output-details,

% sourmash tax crosscheck --taxonomy xxx.csv --db gtdb-rs207.genomic-reps.dna.k31.zip -o details.csv --ignore-genbank

which gives a file like this:

ident problem details
0 GCF_001591565 missing_from_tax this identifier is in a database but not in a taxonomy
1 GCF_000730565 missing_from_tax this identifier is in a database but not in a taxonomy
2 GCA_002862805 missing_from_tax this identifier is in a database but not in a taxonomy
3 ...
4 GCA_900510935 missing_from_db this identifier is in a taxonomy but not in a database
5 GCA_900177625 missing_from_db this identifier is in a taxonomy but not in a database
6 GCA_013608825 missing_from_db this identifier is in a taxonomy but not in a database
7 ...

Catching GenBank specific issues

SECOND, as you might infer from the above command line, by default crosscheck will attempt to resolve GCF/GCA identifiers that are present in a database but missing from a taxonomy by switching them to GCA/GCF identifiers - see output for default,

% sourmash tax crosscheck --taxonomy xxx.csv --db gtdb-rs207.genomic-reps.dna.k31.zip -o details.csv 
...
loading taxonomies...
...loaded 317542 entries.
loading sourmash databases...
...found 65703 sketches across 1 files

Missing identifiers; checking GenBank to see if we need to demux
GCA_ and GCF_ identifiers.
Found 65703 GenBank identifiers!
Turn off GenBank demuxification with --ignore-genbank.

Resolved all but 0 missing identifiers with GenBank demuxification.
no duplicate identifiers found!
all identifiers have a taxonomic lineage associated with them!
found 251839 identifiers in the taxonomy that are NOT in any database.
Output 251839 entries to 'details.csv'

vs with --ignore-genbank:

% sourmash tax crosscheck --taxonomy xxx.csv --db gtdb-rs207.genomic-reps.dna.k31.zip -o details.csv  --ignore-genbank
...
loading taxonomies...
...loaded 317542 entries.
loading sourmash databases...
...found 65703 sketches across 1 files

** NOTE: missing identifiers, but --ignore-genbank is set;
** we are NOT checking to see if GCF <=> GCA conversion would resolve.
no duplicate identifiers found!
found 65703 distinct identifiers with no associated taxonomic lineage.
found 317542 identifiers in the taxonomy that are NOT in any database.
Output 383245 entries to 'details.csv'

Pondering whether we want to do something clever in the taxonomies module overall and support this kind of interconversion automatically. In #2049 I suggested just making a tax spreadsheet with both 🤷 .

I see three basic options -

  1. automatically support GCF <=> GCA conversion in taxonomic lookups by default (but support turning it off);
  2. detect it in crosscheck but leave it up to the user to fix;
  3. detect it in crosscheck and output a new taxonomy spreadsheet with updated identifiers.

My CS/correctness hat suggests (2) or (3), but my UX/make-life-easier hat is thinking (1). I might give implementing (1) a try and see how hard it is to debug/test :). There's also a question of whether it's a major version update to do that, will think on't.

@ctb
Copy link
Contributor Author

ctb commented Nov 16, 2022

Thinking more on this -

  • automatically support GCF <=> GCA conversion in taxonomic lookups by default (but support turning it off);

on the one hand, this is Doing Things Automatically For the User which I dislike, but... it should be easy to implement by adjusting the way MultiLineageDB does lookups. So I think I'll give it a try.

  • detect it in crosscheck but leave it up to the user to fix;

this is safest, but also most annoying for the user.

  • detect it in crosscheck and output a new taxonomy spreadsheet with updated identifiers.

this is a decent middle ground and localizes all of the nasty conversion code into a single, discrete step. We could also support this so maybe I'll implement it too?

@ctb
Copy link
Contributor Author

ctb commented Nov 16, 2022

Is it too much trouble to check if they differ by version (e.g. *.1 vs *.2) or db genome was downloaded from (GCF* vs. GCA*)?

so now I'm thinking that we might potentially output a "fixed" taxonomy that updates version numbers and provides GCF to GCA conversion.

@taylorreiter
Copy link
Contributor

Is it too much trouble to check if they differ by version (e.g. .1 vs .2) or db genome was downloaded from (GCF vs. GCA)?

so now I'm thinking that we might potentially output a "fixed" taxonomy that updates version numbers and provides GCF to GCA conversion.

my knee jerk was that that was a bad idea...but the taxonomy should be consistent between versions, refseq, and genbank so...shrug? you might get duplicates tho, and it would be good to know if there are duplicates from that sort of thing in the input db

@ctb
Copy link
Contributor Author

ctb commented Nov 17, 2022

so now I'm thinking that we might potentially output a "fixed" taxonomy that updates version numbers and provides GCF to GCA conversion.

my knee jerk was that that was a bad idea...but the taxonomy should be consistent between versions, refseq, and genbank so...shrug? you might get duplicates tho, and it would be good to know if there are duplicates from that sort of thing in the input db

curious about the knee jerk reaction: is your concern about version numbers specifically, or the general update-the-taxonomy idea?

also, what do you mean by duplicates?

@taylorreiter
Copy link
Contributor

so now I'm thinking that we might potentially output a "fixed" taxonomy that updates version numbers and provides GCF to GCA conversion.

my knee jerk was that that was a bad idea...but the taxonomy should be consistent between versions, refseq, and genbank so...shrug? you might get duplicates tho, and it would be good to know if there are duplicates from that sort of thing in the input db

curious about the knee jerk reaction: is your concern about version numbers specifically, or the general update-the-taxonomy idea?

Ya the concern came from the content of a genome differing between one version of the genome and the next...but then I realized that doesn't really matter, because the taxonomy should be consistent between genome versions i think.

also, what do you mean by duplicates?

So where I encountered this was when I combined all the genomes from GTDB rs202 with all of the genomes from GTDB rs207. GTDB rs202 is not an exact subset of rs207 because some versions changed and because some genomes went from refseq to genbank and vice versa. So when I combined all of the rs202 with all of rs207, I ended up with duplicate sigs that had different names. (GCA_XXXXX -> GCF_XXXXX, and both being in my database)

does that make sense or should i try and explain it a different way?

@ctb
Copy link
Contributor Author

ctb commented Nov 17, 2022

so now I'm thinking that we might potentially output a "fixed" taxonomy that updates version numbers and provides GCF to GCA conversion.

my knee jerk was that that was a bad idea...but the taxonomy should be consistent between versions, refseq, and genbank so...shrug? you might get duplicates tho, and it would be good to know if there are duplicates from that sort of thing in the input db

curious about the knee jerk reaction: is your concern about version numbers specifically, or the general update-the-taxonomy idea?

Ya the concern came from the content of a genome differing between one version of the genome and the next...but then I realized that doesn't really matter, because the taxonomy should be consistent between genome versions i think.

this is certainly something we could check, too.

also, what do you mean by duplicates?

So where I encountered this was when I combined all the genomes from GTDB rs202 with all of the genomes from GTDB rs207. GTDB rs202 is not an exact subset of rs207 because some versions changed and because some genomes went from refseq to genbank and vice versa. So when I combined all of the rs202 with all of rs207, I ended up with duplicate sigs that had different names. (GCA_XXXXX -> GCF_XXXXX, and both being in my database)

does that make sense or should i try and explain it a different way?

makes perfect sense, thanks!

@ctb
Copy link
Contributor Author

ctb commented Nov 29, 2022

Dug into some accession differences a little bit because of the unicity project, and realized how messed up our GTDB and Genbank releases are, sigh.

It turns out we didn't harmonize GCA/GCF choices between GTDB rs207 and genbank-bacteria/genbank-archaea. So the issue of GCF vs GCA accessions raised in #2049 is really up close and personal. Double sigh.

Anyway, even after removing identifier versions,

  • we have 218,805 GTDB accessions that are not in the Genbank databases
  • we get down to 2104 GTDB accessions that are not in the Genbank database once we convert all GCA to GCF in both databases

also relevant to including taxonomies in zipfiles per #2216

So I will have to play around a bit with the code to see how I can make it most useful. Triple sigh.

@ctb
Copy link
Contributor Author

ctb commented Dec 16, 2022

This seems like a problem 🤔 - riffing off of #2407,

% sourmash tax crosscheck --taxonomy try2.lca.sql --database try2.lca.sql

== This is sourmash version 4.5.1.dev37+g2b087daf. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 19 entries.
loading sourmash databases...
...found 19 sketches across 1 files

Missing identifiers; checking GenBank to see if we need to demux
GCA_ and GCF_ identifiers.
Found 19 GenBank identifiers!
Turn off GenBank demuxification with --ignore-genbank.

no duplicate identifiers found!
found 19 distinct identifiers with no associated taxonomic lineage.
found 19 identifiers in the taxonomy that are NOT in any database.

I know why it's happening - the identifier handling in SQLite LCA databases is terrible - but whew, what a rat's nest.

@ctb
Copy link
Contributor Author

ctb commented Mar 25, 2023

see #2538 for some other things we could look at

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add tax crosscheck to compare databases and lineages for correctness
3 participants