-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thinking about further gather
performance improvements.
#1110
Comments
gather performance is improved dramatically by greyhound, as mentioned here #1226 (comment). Practically speaking we can now do ~million bacterial signature searches in under 2 hours even in Python (see genome-grist for some functioning code) |
ref #1370 and beyond for massive performance increases coming soon. Maybe we can close this issue soon! |
from slack, luiz sayeth:
|
we've moved beyond most of this discussion; between mastiff updates and the multithreading work in https://github.com/sourmash-bio/pyo3_branchwater, we're in good shape. |
This is a summarization of remaining issues in #838, which has become unreadable :)
from @luizirber in #838 (comment):
I'll dump here some other pointers to
gather
improvements, and they should probably become other issues eventually (and this one stays as a meta/tracking issue?). I'll break it in two sections: Engineering for making what already exists faster, and Research for deeper changes that involve rethinking data structures and methods.Engineering
We can continue running
py-spy
andheaptrack
for optimizing specific routines or lowering overall memory consumption.Nodegraph
from memory (avoid the tempfile dance when loading data from storage).Nodegraph
from the query, and use it to query internal nodes (instead of checking every hash in the query). This branch has this somewhat implemented, but depends on an unreleased khmer branch. Should also be doable in [MRG] replace khmer.Nodegraph with rust nodegraph #799gather
or long-running processes (in these cases you might prefer to keep the full index in memory), but it can lower the memory consumption substantially. There is also [WIP] cache top node results for gather #520 which tries to cache internal nodes, but I think it needs to cut a bit deeper to work well..unload()
from Expose an unload method for SBT nodes #784, I think.Research
SBT indexing is not very smart: it finds the next available position in the tree, and put the signature there. Ideally we would want signatures to be clustered by some metric (probably similarity, but can also just be common hashes), because the search algorithms (be it depth-first search like in
gather
, or breadth-first search like insearch
) strongly benefit from prunning the search as early as possible (and avoid loading data from disk).So, one direction to solve
gather
issues: better organization of the SBT nodes. At least two ways to do it:howdesbt cluster
command). There is some support for this in sourmash, hidden in the Rust crate.Another direction, but depending on the first one: change the search strategy.
gather
. Eventually theDFS
will find the same result, but the best-first can reach it faster (by taking the path with largest similarity first). But this only works if the index is well-behaved...results
dict a bit, so it breaks some tests with search functions that don't take extra kwargs.more engineering / multi-threading
@satishv:
@ctb:
Maybe, maybe... It's not trivial, but I've been thinking about doing something like this, using rayon:
https://github.com/oconnor663/bao/blob/a2fedd649487de63e4d0c0a9e45f5a22d1e46365/src/hash.rs#L224L270
Index
Python abc/Rust trait bridge (3.3). I want to focus on improving compute (3.2, check Compute oxidation #845) before doing the bridge.reorganizing SBT structure
larger bloom filters?
Important note here: the compressed BF are on DISK, not on MEMORY. Using larger BFs might incur more memory usage. To fix that:
search
the nodes are being unloaded after they are traversed (as per Expose an unload method for SBT nodes #784 (comment)), but forgather
simply unloading them is too aggressive (since they might be traversed again). Having a cache that can call.unload()
when the node is evicted would be perfect, but I didn't find any available with a good API for that. Probably will have to write our own. (Calling.unload()
is fine, because the nodes know how to reload the data; but it would involve reading data from storage again, which will be slower)fixedbitset
API is super nice to work on, so having something like it but with internal RRR encoding would be perfectchanging the way we do search
@luizirber:
@ctb:
yes! but of course it's more complicated - see #930 for a slightly rambling discussion of how this would have to work across multiple databases.
The text was updated successfully, but these errors were encountered: