Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Counter-based gather #1311

Closed
wants to merge 2 commits into from
Closed

[WIP] Counter-based gather #1311

wants to merge 2 commits into from

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Feb 7, 2021

This is a Python-level implementation of what greyhound does for gather. I exposed as another method in the Index abc, with a blanket (and not very optimized...) implementation that depends on the Index.signatures() method.

TODO

  • SBT supports a faster impl (using search), but needs some way of indexing signatures by identifier (I used the index in the signature list as index in greyhound/blanket impl, but that forces loading all the signatures from disk in the SBT...)
  • For testing I replaced the call to obj.gather with obj.counter_gather in src/sourmash/search.py, but that's sub-optimal (because the gather in search.py redoes a lot of the work that counter_gather is doing). Need to move more functionality around to avoid redoing the work

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Feb 7, 2021

Codecov Report

Merging #1311 (02adcfa) into latest (5e66db9) will increase coverage by 5.26%.
The diff coverage is 87.17%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1311      +/-   ##
==========================================
+ Coverage   88.84%   94.10%   +5.26%     
==========================================
  Files         123       96      -27     
  Lines       18264    14687    -3577     
  Branches     1409     1420      +11     
==========================================
- Hits        16226    13821    -2405     
+ Misses       1800      624    -1176     
- Partials      238      242       +4     
Flag Coverage Δ
python 94.10% <87.17%> (-0.04%) ⬇️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/index.py 92.62% <86.84%> (-2.68%) ⬇️
src/sourmash/search.py 91.22% <100.00%> (ø)
src/sourmash/sbtmh.py 81.30% <0.00%> (-1.87%) ⬇️
src/core/src/ffi/nodegraph.rs
src/core/src/signature.rs
src/core/src/ffi/signature.rs
src/core/src/cmd.rs
src/core/src/index/mod.rs
src/core/src/ffi/mod.rs
src/core/tests/test.rs
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e66db9...02adcfa. Read the comment docs.

@ctb
Copy link
Contributor

ctb commented Feb 22, 2021

Digging in a bit more, what about splitting the counter_gather function in two - one that does the counter building (the prefetch), and one that uses it to do the gather?

@ctb
Copy link
Contributor

ctb commented Mar 6, 2021

Looking at this code, I can't help but think there are strong similarities with the _find_signatures method in LCA_Database.

@ctb
Copy link
Contributor

ctb commented Mar 7, 2021

@luizirber, see CounterGatherIndex in #1371 for a mashup of the "counter gather" approach with a separate prefetch function. There are still some problems with the tests that I haven't tracked down yet, but it mostly works. Curious what you think of the approach.

# Prepare counter for finding the next match by decrementing
# all hashes found in the current match in other datasets
for (dataset_id, _) in most_common:
counter[dataset_id] -= signatures[dataset_id].minhash.count_common(match.minhash, True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whups, this subtraction needs to be done for the overlap with intersection of the match and the query, not the overlap with the query (which may be far larger).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see the copypasta code in #1371, src/sourmash/index.py::CounterGatherIndex.gather(...), which I fixed to pass gather tests)

@ctb
Copy link
Contributor

ctb commented May 22, 2021

should this be closed, now that #1370 has been merged?

@luizirber
Copy link
Member Author

Yup.

@luizirber luizirber closed this May 24, 2021
@luizirber luizirber deleted the counter_gather branch March 27, 2022 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants