`downsample_*` functions in minhash.rs _always_ downsample, even when downsampling is not necessary #3343

ctb · 2024-10-06T12:39:09Z

The root cause of the 10x slowdown in the branchwater plugin (sourmash-bio/sourmash_plugin_branchwater#463) was that downsample_scaled was being called on every against sketch, and it was creating a new sketch every time it was called - even when no downsampling needed to be done, because the scaled values match.

This is fixed in sourmash-bio/sourmash_plugin_branchwater#464 by only doing the downsampling if the scale does not match.

I think the sourmash code base behavior is somewhat unexpected and we should avoid doing expensive work when it's not needed. But I'm not sure how to modify the function signatures appropriately.

The text was updated successfully, but these errors were encountered:

ctb · 2024-10-06T12:42:00Z

From my question on slack:

Hi all, a Rust conundrum that I don’t quite know how to resolve. Maybe writing it out will help.
So, over in sourmash minhash.rs, we have a function downsample_max_hash (implemented on KmerMinHash) that currently always returns a new copy of KmerMinHash, suitably modified.

In my ideal world, we would change this function to only return a new sketch if it has to change the sketch, but would return itself (unmodified) if not - if the requested max_hash is the same as the current one, there’s no need to create a new one.

Now obviously you can’t have a function return both KmerMinHash and &KmerMinHash.

So how do I do this, structurally?

Is this a place where Clone on Write would be useful?
And/or can I change the function to return just a reference, and then would I need to use lifetimes or something?

You can see what I had to do in a function that uses these functions, here:

https://github.com/sourmash-bio/sourmash_plugin_branchwater/blob/453f943351c6c702235e1b085cd04d3616b1a09a/src/manysearch.rs#L219

the call to query.inflated_abundances is the only thing that is different in the if/else statement between doing the downsampling and not - that’s because against_ds here is a new copy in the if branch, while we’re using against in the else.

ctb · 2024-10-06T17:36:11Z

luiz writes:

CoW would be a solution, but what about changing the fn sig to

    pub fn downsample_max_hash(self, max_hash: u64) -> Result<KmerMinHash, Error>

?

This avoid possible lifetime issues by giving control to whoever calls the function

then the check for downsampling or not would be done inside the downsample_max_hash function

and if nothing needs to be done, just return self

kind of how select() works

sourmash/src/core/src/manifest.rs

Line 262 in 34001e7

fn select(self, selection: &Selection) -> Result<Self> {

ctb · 2024-10-12T10:03:36Z

resolved in #3342

@luizirber

…ather` bug around `scaled`. (#3342) This PR does five things: First, it swaps the implementation of `KmerMinHash::downsample_max_hash` with `KmerMinHash::downsample_scaled`, and the same for `KmerMinHashBTree`. Previously a call to `downsample_scaled` calculated the right `max_hash` from `scaled`, then called `downsample_max_hash`, which then converted `max_hash` back to `scaled`. This reverses the logic so that (slightly) less work is done and, more importantly, the code is a bit more straightforward. Second, it changes the `downsample_*` functions so that they do not downsample when no downsampling is needed. As part of this the method signatures are changed to take an object, rather than a reference. This lets the functions return an unmodified `KmerMinHash` when no downsampling is needed. Third, it turns out the `downsample_*` functions didn't check to make sure that the new `scaled` value was larger than the old one, i.e. they didn't prevent upsampling. That check was added and a new error, `CannotUpsampleScaled`, was added to sourmash core. Fourth, this uncovered a bug in `RevIndex::gather` where the query was downsampled to the match, even when the match was lower scaled. This PR rejiggers the code so that downsampling is done appropriately in the `gather` and `calculate_gather_stats`. Since `RevIndex::gather` isn't used in the the sourmash CLI, the bug only presented in the test suite and in the branchwater plugin; see sourmash-bio/sourmash_plugin_branchwater#468 and sourmash-bio/sourmash_plugin_branchwater#467, where a fastmultigather test had to be fixed because of the incorrect scaled values output by `RevIndex::gather`. Fifth, it includes #3348 from @luizirber, which adds a `Signature::try_into()` to `KmerMinHash` to support the elimination of some clones. Because of the method signature change for the `downsample_*` functions, the sourmash-core version needs to be bumped to a new major version, 0.16.0. It's been a fun journey! 😅 Fixes #3343 Some notes on further changes and performance implications: As a consequence of the `RevIndex::gather` changes, redundant downsampling has to be done in `RevIndex::gather` and `calculate_gather_stats`, unless we want to change the method signature of `calculate_gather_stats`. I decided the PR was big enough that I didn't want to do that in addition. It should not affect most use cases where `scaled` is the same, and we will see if it results in any slowdowns over in the branchwater plugin. See #3196 for an issue on all of this. We could also just insist that the query scaled is the one to pay attention to, per #2951. This would simplify the code in Python-land as well. Overall, the performance implications of this PR are not clear. Previously downsampling was being done even when it wasn't needed, so this may speed things up quite a lot for our typical use case! On the other hand, redundant downsampling will happen in cases where there are scaled mismatches. We just need to benchmark it, I think. Some preliminary benchmarking reported in sourmash-bio/sourmash_plugin_branchwater#430 (comment) suggests that fastgather is now much more memory effficient 🎉 so that's good! TODO: - [x] resolve the scaled mismatch stuff. do we return an `Err` or what if the downsampling can't be performed? - [x] update PR description - [x] add more tests for downsampling, and maybe for gather - [x] play with this code over in the branchwater plugin too! sourmash-bio/sourmash_plugin_branchwater#467 --------- Co-authored-by: Luiz Irber <[email protected]>

ctb mentioned this issue Oct 11, 2024

bug when downsampling matches in fastmultigather against rocksdb sourmash-bio/sourmash_plugin_branchwater#468

Closed

ctb mentioned this issue Oct 12, 2024

MRG: improve downsampling behavior on KmerMinHash; fix RevIndex::gather bug around scaled. #3342

Merged

4 tasks

ctb closed this as completed in #3342 Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`downsample_*` functions in minhash.rs _always_ downsample, even when downsampling is not necessary #3343

`downsample_*` functions in minhash.rs _always_ downsample, even when downsampling is not necessary #3343

ctb commented Oct 6, 2024

ctb commented Oct 6, 2024

ctb commented Oct 6, 2024

ctb commented Oct 12, 2024

downsample_* functions in minhash.rs _always_ downsample, even when downsampling is not necessary #3343

downsample_* functions in minhash.rs _always_ downsample, even when downsampling is not necessary #3343

Comments

ctb commented Oct 6, 2024

ctb commented Oct 6, 2024

ctb commented Oct 6, 2024

ctb commented Oct 12, 2024

`downsample_*` functions in minhash.rs _always_ downsample, even when downsampling is not necessary #3343

`downsample_*` functions in minhash.rs _always_ downsample, even when downsampling is not necessary #3343