-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] add sourmash tax crosscheck
to validate taxonomies <=> database contents
#2362
base: latest
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #2362 +/- ##
==========================================
+ Coverage 84.09% 91.39% +7.30%
==========================================
Files 130 102 -28
Lines 15048 11627 -3421
Branches 2208 2237 +29
==========================================
- Hits 12654 10627 -2027
+ Misses 2098 703 -1395
- Partials 296 297 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
…o add/tax_crosscheck
…o add/tax_crosscheck
sourmash tax crosscheck
to validate taxonomies <=> database contents
@bluegenes @taylorreiter any additional checks or functionality to add beyond outputting the problematic identifiers etc? |
Is it too much trouble to check if they differ by version (e.g. |
ooh good ideas (presumably borne from experience 👀 ) |
Can you optionally output the missing identifiers to a file? |
yep, is in plan! |
A couple updates in e15ff89 this fine morning - Outputting details of missing identifiersFIRST,
which gives a file like this:
Catching GenBank specific issuesSECOND, as you might infer from the above command line, by default
vs with
Pondering whether we want to do something clever in the taxonomies module overall and support this kind of interconversion automatically. In #2049 I suggested just making a tax spreadsheet with both 🤷 . I see three basic options -
My CS/correctness hat suggests (2) or (3), but my UX/make-life-easier hat is thinking (1). I might give implementing (1) a try and see how hard it is to debug/test :). There's also a question of whether it's a major version update to do that, will think on't. |
Thinking more on this -
on the one hand, this is Doing Things Automatically For the User which I dislike, but... it should be easy to implement by adjusting the way
this is safest, but also most annoying for the user.
this is a decent middle ground and localizes all of the nasty conversion code into a single, discrete step. We could also support this so maybe I'll implement it too? |
so now I'm thinking that we might potentially output a "fixed" taxonomy that updates version numbers and provides GCF to GCA conversion. |
my knee jerk was that that was a bad idea...but the taxonomy should be consistent between versions, refseq, and genbank so...shrug? you might get duplicates tho, and it would be good to know if there are duplicates from that sort of thing in the input db |
curious about the knee jerk reaction: is your concern about version numbers specifically, or the general update-the-taxonomy idea? also, what do you mean by duplicates? |
Ya the concern came from the content of a genome differing between one version of the genome and the next...but then I realized that doesn't really matter, because the taxonomy should be consistent between genome versions i think.
So where I encountered this was when I combined all the genomes from GTDB rs202 with all of the genomes from GTDB rs207. GTDB rs202 is not an exact subset of rs207 because some versions changed and because some genomes went from refseq to genbank and vice versa. So when I combined all of the rs202 with all of rs207, I ended up with duplicate sigs that had different names. (GCA_XXXXX -> GCF_XXXXX, and both being in my database) does that make sense or should i try and explain it a different way? |
this is certainly something we could check, too.
makes perfect sense, thanks! |
Dug into some accession differences a little bit because of the unicity project, and realized how messed up our GTDB and Genbank releases are, sigh. It turns out we didn't harmonize GCA/GCF choices between GTDB rs207 and genbank-bacteria/genbank-archaea. So the issue of GCF vs GCA accessions raised in #2049 is really up close and personal. Double sigh. Anyway, even after removing identifier versions,
also relevant to including taxonomies in zipfiles per #2216 So I will have to play around a bit with the code to see how I can make it most useful. Triple sigh. |
…o add/tax_crosscheck
…o add/tax_crosscheck
…o add/tax_crosscheck
This seems like a problem 🤔 - riffing off of #2407,
I know why it's happening - the identifier handling in SQLite LCA databases is terrible - but whew, what a rat's nest. |
see #2538 for some other things we could look at |
Fixes #2361
This PR adds
sourmash tax crosscheck
, which validates taxonomies against databases and databases against taxonomies (every identifier has a lineage, every lineage has at least one identifier).This also adds some of the remaining lingering useful functionality in
lca index
to the tax submodule, in pursuit of the eventual consolidation of lca utility/parsing functionality into tax per #2198tax crosscheck
currently verifies:TODO:
Example usage
all is well
missing identifiers in databases