-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable search via client/server or IPC interface #1484
Comments
https://greyhound.sourmash.bio/ might be a barebones PoC for this, using HTTP endpoints.
(jo -> https://jpmens.net/2016/03/05/a-shell-command-to-create-json-jo/) Adapted from the There is no error handling, need to receive a |
it's not pretty but it kinda works? https://github.com/ctb/2021-sourmash-jsonrpc cc @phiweger |
ah that is awesome! |
:) On reflection, I think the main advantage here is going to be for LCA databases, since they are the major "big memory/long load time" databases at the moment. PR #1644 supports loading them, but none of the LCA specific commands work - but that's straightforward to fix if you would like me to do so. Just lmk. |
yep - searching against
|
searching after database load, first search takes ~1mfor the first search, the server takes about a minute to reconstruct all 280,000 signatures --
second search takes a secondonce the signatures are reconstructed, 💥 ~1 second to do search 🎉
|
what do you mean by "reconstruct" @ctb ? I noticed this difference btw/ 1st and 2nd search just loading the db into memory using the API, too. reconstruct == load into mem? |
LCA databases don't contain complete signatures - they store hashes, with an associated list of signature names. This allows sourmash to reconstruct signatures from the database. This reconstruction is done once and then all the signatures are cached. And, when you do it for 10s of thousands of signatures, the reconstruction takes a bit of time :). The database was designed the way it was because when we implemented the LCA module functionality, we were focused on applying LCA functionality to individual k-mers and doing various taxonomic things that way. We then later decided to make these databases into databases that supported search and gather as well, and had to retrofit signature recreation. Now that we have random-access zip databases with manifests, as well as different (and better) ways of doing taxonomy, I'll probably re-engineer the LCA databases to simply point to existing signatures. I expect that'll speed things up ;). See #1591 for the issue. |
err, question for @phiweger - do you use/are you interested in using the (1) taxonomic functions on LCA databases, (2) the in-memory aspects of the databases, or (3) both? |
(3) of course ;) |
🙄 😆 well, hopefully we can convince you that the new |
the search and gather functionality is now sufficiently well laid out for us to make it work over remote connections in a nice standard way...
The text was updated successfully, but these errors were encountered: