-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
25k signatures #4
Comments
yep, Donovan pointed that out to me this morning! Working on it.
The 25k dereplicated seem like a good ~small set to use, tho. More tomorrow.
|
I think a small index would be a great idea, bc/ the GTDB "raw" contains about 10% E. coli :) Just for completeness, we recently dereplicated the GTDB non-manually (compared w/ Parks et al.) see here -- classification accuracy was near-identical. |
closing in favor of sourmash-bio/sourmash#2015 and, well, many other things, but sourmash-bio/sourmash#1941 too. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
hi @ctb,
I think the "real" GTDB has about 150k genomes, and 25k genomes sounds like a dereplicated set used by the GTDB toolkit, or am I missing something here? Maybe the 25k genomes correspond to this subset by the same group around Donovan Parks?
kind regards,
Adrian
The text was updated successfully, but these errors were encountered: