Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SBT loading in memory #475

Closed
phiweger opened this issue May 19, 2018 · 12 comments · Fixed by #2025
Closed

SBT loading in memory #475

phiweger opened this issue May 19, 2018 · 12 comments · Fixed by #2025
Labels
faq things to add to an FAQ or docs sbt

Comments

@phiweger
Copy link

Is there a way to load an SBT into memory once and then keep it there for various queries? I know that a Redis backend was toyed with at some point, but I am unsure if this was integrated into v2.0

Thank you,
Adrian

@luizirber
Copy link
Member

luizirber commented May 20, 2018

Not yet... I need something like this for wort, and was talking with @psdehal (since they implemented this for KBase, but there still issues to solve).

@phiweger
Copy link
Author

What is wort?

@luizirber
Copy link
Member

@phiweger it's a webservice for computing/retrieving/searching sourmash signatures. I don't have much yet, but an overview is available at https://github.com/dib-lab/wort/blob/master/docs/arch.md and it is online (and horrible to navigate...) at https://wort.oxli.org

@luizirber luizirber added the sbt label Jun 7, 2018
@ctb
Copy link
Contributor

ctb commented Jun 20, 2020

kind of related to #909, sharing an LCA index once for many processes.

@phiweger
Copy link
Author

I now run into this problem a lot, especially when querying many signatures as part of larger workflows. The workflow manager will usually start many processes searching signatures, but with any reasonably sized SBT this crashes pretty quickly bc/ it tries to load one SBT into memory for each process. @ctb if I use the API to load the SBT and then python multiprocess queries against it, what will happen? ;)

@phiweger
Copy link
Author

@luizirber how do you manage queries w/ wort? I assume you don't load the index once for each query?

@luizirber
Copy link
Member

luizirber commented Sep 30, 2020

how do you manage queries w/ wort? I assume you don't load the index once for each query?

So, the SRA search is cheating =]
I load all the queries in memory, and then each thread process a chunk of the metagenomes sigs by loading each metagenome sig, comparing to all query sigs, and unloading the metagenome sig. In this way the memory consumption is pretty low.

if I use the API to load the SBT and then python multiprocess queries against it, what will happen? ;)

It will probably mostly-sorta-kinda work. I'm a bit nervous because the SBT code loads data from disk dynamically, and in a multithreaded context this can lead to data races and other weirdness (there is no locking in any point).
(Incidentally, a large push for Rust in sourmash was exactly for these sort of cases, but the SBT impl in Rust is not complete enough for general use yet =/)

@luizirber
Copy link
Member

the latest branch has a --cache-size parameter in sourmash gather that can help with controlling how much memory is used: #1161

@luizirber
Copy link
Member

hey, I think greyhound and #1226 actually help solve this too

@ctb
Copy link
Contributor

ctb commented Jun 25, 2021

with sourmash v4.1.0 the memory usage of SBTs has dramatically decreased; see #1370 (comment) specifically.

for in-memory single-process stuff, LCA_Database is a good choice, and it's fairly fast to create them dynamically for small to medium sized databases.

Finally, for read only SBTs, I very much doubt there would be any problems with sharing them between processes from disk.

@phiweger
Copy link
Author

@ctb when I load an LCA db into mem with sourmash.load_file_as_index() and then use the .search() method, are LCA and SBT interchangeable? Like, when would I use one over the other? Thank you for clarifying.

@ctb
Copy link
Contributor

ctb commented Jun 26, 2021

Yep, they are interchangeable from an API perspective!

We have some (minimal :) documentation here,

https://sourmash.readthedocs.io/en/latest/command-line.html#indexed-databases

and there's an example of using/constructing an in-memory LCA_Database here:

https://github.com/dib-lab/charcoal/blob/latest/charcoal/compare_taxonomy.py#L177

Note that the lca_db.insert(...) function used there takes an optional identifier as an argument, but if the signatures are named sensibly you don't need to pass that in.

@ctb ctb added the faq things to add to an FAQ or docs label Mar 30, 2022
@ctb ctb closed this as completed in #2025 May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
faq things to add to an FAQ or docs sbt
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants