bug when downsampling matches in fastmultigather against rocksdb #468

ctb · 2024-10-11T23:47:58Z

This is a highly specific bug 😆 that was revealed in sourmash-bio/sourmash#3342 as part of refactoring the code and fixing sourmash-bio/sourmash#3343.

In brief, KmerMinHash::downsample_scaled(...) allowed up-sampling sketches to scaled values that were lower than the original sketch, if resampling was needed. This was then being used in gather as implemented in RocksDB in disk_revindex.rs.

Long story short, the following command yields incorrect gather output:

sourmash scripts fastmultigather src/python/tests/test-data/SRR606249.sig.gz list3.rocksdb -o fastmultigather.csv -s 100000

where list3.rocksdb is created from:

src/python/tests/test-data/47.fa.sig.gz
src/python/tests/test-data/63.fa.sig.gz
src/python/tests/test-data/2.fa.sig.gz

This incorrectly outputs k-mer values such as intersect_bp that are based on the scaled value of the match, not of the query.

Tackling in sourmash-bio/sourmash#3342 and #467; see changes to test_fastmultigather.py in #467 esp.

The text was updated successfully, but these errors were encountered:

@luizirber

…ather` bug around `scaled`. (#3342) This PR does five things: First, it swaps the implementation of `KmerMinHash::downsample_max_hash` with `KmerMinHash::downsample_scaled`, and the same for `KmerMinHashBTree`. Previously a call to `downsample_scaled` calculated the right `max_hash` from `scaled`, then called `downsample_max_hash`, which then converted `max_hash` back to `scaled`. This reverses the logic so that (slightly) less work is done and, more importantly, the code is a bit more straightforward. Second, it changes the `downsample_*` functions so that they do not downsample when no downsampling is needed. As part of this the method signatures are changed to take an object, rather than a reference. This lets the functions return an unmodified `KmerMinHash` when no downsampling is needed. Third, it turns out the `downsample_*` functions didn't check to make sure that the new `scaled` value was larger than the old one, i.e. they didn't prevent upsampling. That check was added and a new error, `CannotUpsampleScaled`, was added to sourmash core. Fourth, this uncovered a bug in `RevIndex::gather` where the query was downsampled to the match, even when the match was lower scaled. This PR rejiggers the code so that downsampling is done appropriately in the `gather` and `calculate_gather_stats`. Since `RevIndex::gather` isn't used in the the sourmash CLI, the bug only presented in the test suite and in the branchwater plugin; see sourmash-bio/sourmash_plugin_branchwater#468 and sourmash-bio/sourmash_plugin_branchwater#467, where a fastmultigather test had to be fixed because of the incorrect scaled values output by `RevIndex::gather`. Fifth, it includes #3348 from @luizirber, which adds a `Signature::try_into()` to `KmerMinHash` to support the elimination of some clones. Because of the method signature change for the `downsample_*` functions, the sourmash-core version needs to be bumped to a new major version, 0.16.0. It's been a fun journey! 😅 Fixes #3343 Some notes on further changes and performance implications: As a consequence of the `RevIndex::gather` changes, redundant downsampling has to be done in `RevIndex::gather` and `calculate_gather_stats`, unless we want to change the method signature of `calculate_gather_stats`. I decided the PR was big enough that I didn't want to do that in addition. It should not affect most use cases where `scaled` is the same, and we will see if it results in any slowdowns over in the branchwater plugin. See #3196 for an issue on all of this. We could also just insist that the query scaled is the one to pay attention to, per #2951. This would simplify the code in Python-land as well. Overall, the performance implications of this PR are not clear. Previously downsampling was being done even when it wasn't needed, so this may speed things up quite a lot for our typical use case! On the other hand, redundant downsampling will happen in cases where there are scaled mismatches. We just need to benchmark it, I think. Some preliminary benchmarking reported in sourmash-bio/sourmash_plugin_branchwater#430 (comment) suggests that fastgather is now much more memory effficient 🎉 so that's good! TODO: - [x] resolve the scaled mismatch stuff. do we return an `Err` or what if the downsampling can't be performed? - [x] update PR description - [x] add more tests for downsampling, and maybe for gather - [x] play with this code over in the branchwater plugin too! sourmash-bio/sourmash_plugin_branchwater#467 --------- Co-authored-by: Luiz Irber <[email protected]>

This was referenced Oct 11, 2024

MRG: update to code for forthcoming sourmash release #467

Merged

MRG: improve downsampling behavior on KmerMinHash; fix RevIndex::gather bug around scaled. sourmash-bio/sourmash#3342

Merged

validate gather with mismatched scaled values #469

Open

ctb closed this as completed in #467 Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug when downsampling matches in fastmultigather against rocksdb #468

bug when downsampling matches in fastmultigather against rocksdb #468

ctb commented Oct 11, 2024 •

edited

Loading

bug when downsampling matches in fastmultigather against rocksdb #468

bug when downsampling matches in fastmultigather against rocksdb #468

Comments

ctb commented Oct 11, 2024 • edited Loading

ctb commented Oct 11, 2024 •

edited

Loading