-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Counter-based gather #1311
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1311 +/- ##
==========================================
+ Coverage 88.84% 94.10% +5.26%
==========================================
Files 123 96 -27
Lines 18264 14687 -3577
Branches 1409 1420 +11
==========================================
- Hits 16226 13821 -2405
+ Misses 1800 624 -1176
- Partials 238 242 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Digging in a bit more, what about splitting the |
Looking at this code, I can't help but think there are strong similarities with the |
@luizirber, see |
# Prepare counter for finding the next match by decrementing | ||
# all hashes found in the current match in other datasets | ||
for (dataset_id, _) in most_common: | ||
counter[dataset_id] -= signatures[dataset_id].minhash.count_common(match.minhash, True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whups, this subtraction needs to be done for the overlap with intersection of the match and the query, not the overlap with the query (which may be far larger).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see the copypasta code in #1371, src/sourmash/index.py::CounterGatherIndex.gather(...)
, which I fixed to pass gather tests)
should this be closed, now that #1370 has been merged? |
Yup. |
This is a Python-level implementation of what greyhound does for
gather
. I exposed as another method in theIndex
abc, with a blanket (and not very optimized...) implementation that depends on theIndex.signatures()
method.TODO
search
), but needs some way of indexing signatures by identifier (I used the index in the signature list as index in greyhound/blanket impl, but that forces loading all the signatures from disk in the SBT...)obj.gather
withobj.counter_gather
insrc/sourmash/search.py
, but that's sub-optimal (because thegather
insearch.py
redoes a lot of the work thatcounter_gather
is doing). Need to move more functionality around to avoid redoing the workChecklist
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?