Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

25k signatures #4

Closed
phiweger opened this issue Dec 30, 2019 · 3 comments
Closed

25k signatures #4

phiweger opened this issue Dec 30, 2019 · 3 comments

Comments

@phiweger
Copy link

hi @ctb,

I think the "real" GTDB has about 150k genomes, and 25k genomes sounds like a dereplicated set used by the GTDB toolkit, or am I missing something here? Maybe the 25k genomes correspond to this subset by the same group around Donovan Parks?

kind regards,
Adrian

@ctb
Copy link
Contributor

ctb commented Dec 30, 2019 via email

@phiweger
Copy link
Author

I think a small index would be a great idea, bc/ the GTDB "raw" contains about 10% E. coli :)

Just for completeness, we recently dereplicated the GTDB non-manually (compared w/ Parks et al.) see here -- classification accuracy was near-identical.

@ctb
Copy link
Contributor

ctb commented May 1, 2022

closing in favor of sourmash-bio/sourmash#2015 and, well, many other things, but sourmash-bio/sourmash#1941 too.

@ctb ctb closed this as completed May 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants