Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable search via client/server or IPC interface #1484

Open
ctb opened this issue Apr 24, 2021 · 11 comments
Open

enable search via client/server or IPC interface #1484

ctb opened this issue Apr 24, 2021 · 11 comments

Comments

@ctb
Copy link
Contributor

ctb commented Apr 24, 2021

the search and gather functionality is now sufficiently well laid out for us to make it work over remote connections in a nice standard way...

@luizirber
Copy link
Member

luizirber commented Apr 25, 2021

https://greyhound.sourmash.bio/ might be a barebones PoC for this, using HTTP endpoints.

curl --data-binary @<(jo similarity=false threshold=0.5 [email protected]) https://greyhound.sourmash.bio/search

(jo -> https://jpmens.net/2016/03/05/a-shell-command-to-create-json-jo/)

Adapted from the gather example from this tweet:
curl -X POST --data-binary "@greyhound.sig" https://greyhound.sourmash.bio/gather

There is no error handling, need to receive a k=21,scaled=1000 sig... it's a PoC =]

@ctb
Copy link
Contributor Author

ctb commented Jun 27, 2021

it's not pretty but it kinda works? https://github.com/ctb/2021-sourmash-jsonrpc

cc @phiweger

@phiweger
Copy link

ah that is awesome!

@ctb
Copy link
Contributor Author

ctb commented Jun 27, 2021

ah that is awesome!

:)

On reflection, I think the main advantage here is going to be for LCA databases, since they are the major "big memory/long load time" databases at the moment. PR #1644 supports loading them, but none of the LCA specific commands work - but that's straightforward to fix if you would like me to do so. Just lmk.

@ctb
Copy link
Contributor Author

ctb commented Jun 28, 2021

yep - searching against gtdb-rs202.genomic-reps.k31.lca.json.gz (48,000 genomes) is sub-second after it is loaded and the signatures are constructed. w00t!

time sourmash search podar-ref/1.sig.gz -k 31 http://localhost:5000

== This is sourmash version 4.1.3.dev16+g9dbd8b5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting specified query k=31
loaded query: CP001941.1 Aciduliprofundum bo... (k=31, DNA)
loaded 1 databases.

1 matches:
similarity   match
----------   -----
100.0%       GCF_000025665.1 Aciduliprofundum boonei T469 strain=T469,...

real    0m0.769s
user    0m0.569s
sys     0m0.182s

@ctb
Copy link
Contributor Author

ctb commented Jun 28, 2021

searching gtdb-rs202.genomic.k31.lca.json.gz - 280,000 signatures

after database load, first search takes ~1m

for the first search, the server takes about a minute to reconstruct all 280,000 signatures --

% time sourmash search podar-ref/1.sig.gz -k 31 http://localhost:5000

== This is sourmash version 4.1.3.dev16+g9dbd8b5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting specified query k=31
loaded query: CP001941.1 Aciduliprofundum bo... (k=31, DNA)
loaded 1 databases.

2 matches:
similarity   match
----------   -----
100.0%       GCF_000025665.1 Aciduliprofundum boonei T469 strain=T469,...
 95.0%       GCA_013329755.1 Aciduliprofundum boonei, ASM1332975v1

real    1m7.083s
user    0m0.612s
sys     0m0.224s

second search takes a second

once the signatures are reconstructed, 💥 ~1 second to do search 🎉

% time sourmash search podar-ref/1.sig.gz -k 31 http://localhost:5000

== This is sourmash version 4.1.3.dev16+g9dbd8b5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting specified query k=31
loaded query: CP001941.1 Aciduliprofundum bo... (k=31, DNA)
loaded 1 databases.

2 matches:
similarity   match
----------   -----
100.0%       GCF_000025665.1 Aciduliprofundum boonei T469 strain=T469,...
 95.0%       GCA_013329755.1 Aciduliprofundum boonei, ASM1332975v1

real    0m1.065s
user    0m0.615s
sys     0m0.222s

@phiweger
Copy link

what do you mean by "reconstruct" @ctb ? I noticed this difference btw/ 1st and 2nd search just loading the db into memory using the API, too. reconstruct == load into mem?

@ctb
Copy link
Contributor Author

ctb commented Jun 28, 2021

LCA databases don't contain complete signatures - they store hashes, with an associated list of signature names. This allows sourmash to reconstruct signatures from the database. This reconstruction is done once and then all the signatures are cached.

And, when you do it for 10s of thousands of signatures, the reconstruction takes a bit of time :).

The database was designed the way it was because when we implemented the LCA module functionality, we were focused on applying LCA functionality to individual k-mers and doing various taxonomic things that way. We then later decided to make these databases into databases that supported search and gather as well, and had to retrofit signature recreation.

Now that we have random-access zip databases with manifests, as well as different (and better) ways of doing taxonomy, I'll probably re-engineer the LCA databases to simply point to existing signatures. I expect that'll speed things up ;). See #1591 for the issue.

@ctb
Copy link
Contributor Author

ctb commented Jun 28, 2021

err, question for @phiweger - do you use/are you interested in using the (1) taxonomic functions on LCA databases, (2) the in-memory aspects of the databases, or (3) both?

@phiweger
Copy link

(3) of course ;)

@ctb
Copy link
Contributor Author

ctb commented Jun 28, 2021

🙄

😆

well, hopefully we can convince you that the new sourmash tax being released with 4.2 (any day now...) is a good replacement for sourmash lca!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants