[WIP] add `sourmash tax crosscheck` to validate taxonomies <=> database contents #2362

ctb · 2022-11-14T14:18:37Z

This PR adds sourmash tax crosscheck, which validates taxonomies against databases and databases against taxonomies (every identifier has a lineage, every lineage has at least one identifier).

This also adds some of the remaining lingering useful functionality in lca index to the tax submodule, in pursuit of the eventual consolidation of lca utility/parsing functionality into tax per #2198

tax crosscheck currently verifies:

every identifier has a taxonomic lineage
every taxonomic lineage is present at least once in database
there are no duplicate identifiers in the database

TODO:

add tests
add output of bad or missing identifiers to CSV file

Example usage

all is well

% sourmash tax crosscheck --taxonomy gtdb-rs207.taxonomy.sqldb --db gtdb-rs207.genomic.k31.zip
...
loading taxonomies...
...loaded 317542 entries.
loading sourmash databases...
...found 317542 sketches across 1 files

no duplicate identifiers found!
all identifiers have a taxonomic lineage associated with them!
all taxonomy entries have a matching sketch in the databases!

missing identifiers in databases

% sourmash tax crosscheck --taxonomy gtdb-rs207.taxonomy.sqldb --db gtdb-rs207.genomic-reps.dna.k31.zip
...
loading taxonomies...
...loaded 317542 entries.
loading sourmash databases...
...found 65703 sketches across 1 files

no duplicate identifiers found!
all identifiers have a taxonomic lineage associated with them!
found 251839 identifiers in the taxonomy that are NOT in any database.

codecov · 2022-11-14T14:27:12Z

Codecov Report

Merging #2362 (e5a0560) into latest (1c8b165) will increase coverage by 7.30%.
The diff coverage is 17.18%.

@@            Coverage Diff             @@
##           latest    #2362      +/-   ##
==========================================
+ Coverage   84.09%   91.39%   +7.30%     
==========================================
  Files         130      102      -28     
  Lines       15048    11627    -3421     
  Branches     2208     2237      +29     
==========================================
- Hits        12654    10627    -2027     
+ Misses       2098      703    -1395     
- Partials      296      297       +1

Flag	Coverage Δ
python	`91.39% <17.18%> (-0.83%)`	⬇️
rust	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/sourmash/lca/lca_utils.py	`88.70% <ø> (ø)`
src/sourmash/tax/tax_utils.py	`98.12% <ø> (ø)`
src/sourmash/tax/__main__.py	`69.62% <2.80%> (-24.02%)`	⬇️
src/sourmash/cli/tax/crosscheck.py	`90.00% <90.00%> (ø)`
src/sourmash/cli/tax/__init__.py	`100.00% <100.00%> (ø)`
src/core/src/sketch/nodegraph.rs
src/core/src/ffi/index/revindex.rs
src/core/src/ffi/mod.rs
src/core/src/ffi/hyperloglog.rs
src/core/src/sketch/hyperloglog/mod.rs
... and 24 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…o add/tax_crosscheck

ctb · 2022-11-15T14:53:00Z

@bluegenes @taylorreiter any additional checks or functionality to add beyond outputting the problematic identifiers etc?

taylorreiter · 2022-11-15T15:04:19Z

Is it too much trouble to check if they differ by version (e.g. *.1 vs *.2) or db genome was downloaded from (GCF* vs. GCA*)?

ctb · 2022-11-15T15:08:00Z

Is it too much trouble to check if they differ by version (e.g. *.1 vs *.2) or db genome was downloaded from (GCF* vs. GCA*)?

ooh good ideas (presumably borne from experience 👀 )

bluegenes · 2022-11-15T15:42:07Z

Can you optionally output the missing identifiers to a file?

ctb · 2022-11-15T16:03:26Z

Can you optionally output the missing identifiers to a file?

yep, is in plan!

ctb · 2022-11-16T15:07:38Z

A couple updates in e15ff89 this fine morning -

Outputting details of missing identifiers

FIRST, sourmash tax crosscheck now supports --output-details,

% sourmash tax crosscheck --taxonomy xxx.csv --db gtdb-rs207.genomic-reps.dna.k31.zip -o details.csv --ignore-genbank

which gives a file like this:

	ident	problem	details
0	GCF_001591565	missing_from_tax	this identifier is in a database but not in a taxonomy
1	GCF_000730565	missing_from_tax	this identifier is in a database but not in a taxonomy
2	GCA_002862805	missing_from_tax	this identifier is in a database but not in a taxonomy
3	...
4	GCA_900510935	missing_from_db	this identifier is in a taxonomy but not in a database
5	GCA_900177625	missing_from_db	this identifier is in a taxonomy but not in a database
6	GCA_013608825	missing_from_db	this identifier is in a taxonomy but not in a database
7	...

Catching GenBank specific issues

SECOND, as you might infer from the above command line, by default crosscheck will attempt to resolve GCF/GCA identifiers that are present in a database but missing from a taxonomy by switching them to GCA/GCF identifiers - see output for default,

% sourmash tax crosscheck --taxonomy xxx.csv --db gtdb-rs207.genomic-reps.dna.k31.zip -o details.csv 
...
loading taxonomies...
...loaded 317542 entries.
loading sourmash databases...
...found 65703 sketches across 1 files

Missing identifiers; checking GenBank to see if we need to demux
GCA_ and GCF_ identifiers.
Found 65703 GenBank identifiers!
Turn off GenBank demuxification with --ignore-genbank.

Resolved all but 0 missing identifiers with GenBank demuxification.
no duplicate identifiers found!
all identifiers have a taxonomic lineage associated with them!
found 251839 identifiers in the taxonomy that are NOT in any database.
Output 251839 entries to 'details.csv'

vs with --ignore-genbank:

% sourmash tax crosscheck --taxonomy xxx.csv --db gtdb-rs207.genomic-reps.dna.k31.zip -o details.csv  --ignore-genbank
...
loading taxonomies...
...loaded 317542 entries.
loading sourmash databases...
...found 65703 sketches across 1 files

** NOTE: missing identifiers, but --ignore-genbank is set;
** we are NOT checking to see if GCF <=> GCA conversion would resolve.
no duplicate identifiers found!
found 65703 distinct identifiers with no associated taxonomic lineage.
found 317542 identifiers in the taxonomy that are NOT in any database.
Output 383245 entries to 'details.csv'

Pondering whether we want to do something clever in the taxonomies module overall and support this kind of interconversion automatically. In #2049 I suggested just making a tax spreadsheet with both 🤷 .

I see three basic options -

automatically support GCF <=> GCA conversion in taxonomic lookups by default (but support turning it off);
detect it in crosscheck but leave it up to the user to fix;
detect it in crosscheck and output a new taxonomy spreadsheet with updated identifiers.

My CS/correctness hat suggests (2) or (3), but my UX/make-life-easier hat is thinking (1). I might give implementing (1) a try and see how hard it is to debug/test :). There's also a question of whether it's a major version update to do that, will think on't.

ctb · 2022-11-16T15:26:23Z

Thinking more on this -

automatically support GCF <=> GCA conversion in taxonomic lookups by default (but support turning it off);

on the one hand, this is Doing Things Automatically For the User which I dislike, but... it should be easy to implement by adjusting the way MultiLineageDB does lookups. So I think I'll give it a try.

detect it in crosscheck but leave it up to the user to fix;

this is safest, but also most annoying for the user.

detect it in crosscheck and output a new taxonomy spreadsheet with updated identifiers.

this is a decent middle ground and localizes all of the nasty conversion code into a single, discrete step. We could also support this so maybe I'll implement it too?

ctb · 2022-11-16T16:17:57Z

Is it too much trouble to check if they differ by version (e.g. *.1 vs *.2) or db genome was downloaded from (GCF* vs. GCA*)?

so now I'm thinking that we might potentially output a "fixed" taxonomy that updates version numbers and provides GCF to GCA conversion.

taylorreiter · 2022-11-17T12:12:44Z

Is it too much trouble to check if they differ by version (e.g. .1 vs .2) or db genome was downloaded from (GCF vs. GCA)?

so now I'm thinking that we might potentially output a "fixed" taxonomy that updates version numbers and provides GCF to GCA conversion.

my knee jerk was that that was a bad idea...but the taxonomy should be consistent between versions, refseq, and genbank so...shrug? you might get duplicates tho, and it would be good to know if there are duplicates from that sort of thing in the input db

ctb · 2022-11-17T14:26:07Z

so now I'm thinking that we might potentially output a "fixed" taxonomy that updates version numbers and provides GCF to GCA conversion.

my knee jerk was that that was a bad idea...but the taxonomy should be consistent between versions, refseq, and genbank so...shrug? you might get duplicates tho, and it would be good to know if there are duplicates from that sort of thing in the input db

curious about the knee jerk reaction: is your concern about version numbers specifically, or the general update-the-taxonomy idea?

also, what do you mean by duplicates?

taylorreiter · 2022-11-17T14:33:22Z

so now I'm thinking that we might potentially output a "fixed" taxonomy that updates version numbers and provides GCF to GCA conversion.

my knee jerk was that that was a bad idea...but the taxonomy should be consistent between versions, refseq, and genbank so...shrug? you might get duplicates tho, and it would be good to know if there are duplicates from that sort of thing in the input db

curious about the knee jerk reaction: is your concern about version numbers specifically, or the general update-the-taxonomy idea?

Ya the concern came from the content of a genome differing between one version of the genome and the next...but then I realized that doesn't really matter, because the taxonomy should be consistent between genome versions i think.

also, what do you mean by duplicates?

So where I encountered this was when I combined all the genomes from GTDB rs202 with all of the genomes from GTDB rs207. GTDB rs202 is not an exact subset of rs207 because some versions changed and because some genomes went from refseq to genbank and vice versa. So when I combined all of the rs202 with all of rs207, I ended up with duplicate sigs that had different names. (GCA_XXXXX -> GCF_XXXXX, and both being in my database)

does that make sense or should i try and explain it a different way?

ctb · 2022-11-17T14:36:51Z

so now I'm thinking that we might potentially output a "fixed" taxonomy that updates version numbers and provides GCF to GCA conversion.

my knee jerk was that that was a bad idea...but the taxonomy should be consistent between versions, refseq, and genbank so...shrug? you might get duplicates tho, and it would be good to know if there are duplicates from that sort of thing in the input db

curious about the knee jerk reaction: is your concern about version numbers specifically, or the general update-the-taxonomy idea?

Ya the concern came from the content of a genome differing between one version of the genome and the next...but then I realized that doesn't really matter, because the taxonomy should be consistent between genome versions i think.

this is certainly something we could check, too.

also, what do you mean by duplicates?

So where I encountered this was when I combined all the genomes from GTDB rs202 with all of the genomes from GTDB rs207. GTDB rs202 is not an exact subset of rs207 because some versions changed and because some genomes went from refseq to genbank and vice versa. So when I combined all of the rs202 with all of rs207, I ended up with duplicate sigs that had different names. (GCA_XXXXX -> GCF_XXXXX, and both being in my database)

does that make sense or should i try and explain it a different way?

makes perfect sense, thanks!

ctb · 2022-11-29T01:31:29Z

Dug into some accession differences a little bit because of the unicity project, and realized how messed up our GTDB and Genbank releases are, sigh.

It turns out we didn't harmonize GCA/GCF choices between GTDB rs207 and genbank-bacteria/genbank-archaea. So the issue of GCF vs GCA accessions raised in #2049 is really up close and personal. Double sigh.

Anyway, even after removing identifier versions,

we have 218,805 GTDB accessions that are not in the Genbank databases
we get down to 2104 GTDB accessions that are not in the Genbank database once we convert all GCA to GCF in both databases

also relevant to including taxonomies in zipfiles per #2216

So I will have to play around a bit with the code to see how I can make it most useful. Triple sigh.

…o add/tax_crosscheck

ctb · 2022-12-16T14:47:56Z

This seems like a problem 🤔 - riffing off of #2407,

% sourmash tax crosscheck --taxonomy try2.lca.sql --database try2.lca.sql

== This is sourmash version 4.5.1.dev37+g2b087daf. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 19 entries.
loading sourmash databases...
...found 19 sketches across 1 files

Missing identifiers; checking GenBank to see if we need to demux
GCA_ and GCF_ identifiers.
Found 19 GenBank identifiers!
Turn off GenBank demuxification with --ignore-genbank.

no duplicate identifiers found!
found 19 distinct identifiers with no associated taxonomic lineage.
found 19 identifiers in the taxonomy that are NOT in any database.

I know why it's happening - the identifier handling in SQLite LCA databases is terrible - but whew, what a rat's nest.

ctb · 2023-03-25T14:58:04Z

see #2538 for some other things we could look at

start adding sourmash tax crosscheck

d1541d7

ctb added 3 commits November 15, 2022 06:07

add crosscheck functionality ;)

5041699

Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…

7cd9d45

…o add/tax_crosscheck

Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…

ae0310e

…o add/tax_crosscheck

ctb changed the title ~~[WIP] start adding sourmash tax crosscheck~~ [WIP] add sourmash tax crosscheck to validate taxonomies <=> database contents Nov 15, 2022

add genbank awareness and detail output

e15ff89

ctb added 2 commits November 17, 2022 10:56

add --ignore-genbank

c0a8661

minor fixes ;)

0762657

ctb added 3 commits November 28, 2022 18:02

Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…

bb2d439

…o add/tax_crosscheck

Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…

0f32f3d

…o add/tax_crosscheck

Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…

e5a0560

…o add/tax_crosscheck

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] add `sourmash tax crosscheck` to validate taxonomies <=> database contents #2362

[WIP] add `sourmash tax crosscheck` to validate taxonomies <=> database contents #2362

ctb commented Nov 14, 2022 •

edited

Loading

codecov bot commented Nov 14, 2022 •

edited

Loading

ctb commented Nov 15, 2022

taylorreiter commented Nov 15, 2022

ctb commented Nov 15, 2022

bluegenes commented Nov 15, 2022

ctb commented Nov 15, 2022

ctb commented Nov 16, 2022 •

edited

Loading

ctb commented Nov 16, 2022

ctb commented Nov 16, 2022

taylorreiter commented Nov 17, 2022

ctb commented Nov 17, 2022

taylorreiter commented Nov 17, 2022

ctb commented Nov 17, 2022

ctb commented Nov 29, 2022

ctb commented Dec 16, 2022

ctb commented Mar 25, 2023

[WIP] add sourmash tax crosscheck to validate taxonomies <=> database contents #2362

Are you sure you want to change the base?

[WIP] add sourmash tax crosscheck to validate taxonomies <=> database contents #2362

Conversation

ctb commented Nov 14, 2022 • edited Loading

Example usage

all is well

missing identifiers in databases

codecov bot commented Nov 14, 2022 • edited Loading

Codecov Report

ctb commented Nov 15, 2022

taylorreiter commented Nov 15, 2022

ctb commented Nov 15, 2022

bluegenes commented Nov 15, 2022

ctb commented Nov 15, 2022

ctb commented Nov 16, 2022 • edited Loading

Outputting details of missing identifiers

Catching GenBank specific issues

ctb commented Nov 16, 2022

ctb commented Nov 16, 2022

taylorreiter commented Nov 17, 2022

ctb commented Nov 17, 2022

taylorreiter commented Nov 17, 2022

ctb commented Nov 17, 2022

ctb commented Nov 29, 2022

ctb commented Dec 16, 2022

ctb commented Mar 25, 2023

[WIP] add `sourmash tax crosscheck` to validate taxonomies <=> database contents #2362

[WIP] add `sourmash tax crosscheck` to validate taxonomies <=> database contents #2362

ctb commented Nov 14, 2022 •

edited

Loading

codecov bot commented Nov 14, 2022 •

edited

Loading

ctb commented Nov 16, 2022 •

edited

Loading