Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Index.gather is not doing gather? #1263

Closed
wants to merge 2 commits into from
Closed

Conversation

luizirber
Copy link
Member

The default implementation of Index.gather returns all signatures with a containment above a threshold, which... is not what gather is supposed to do =]

SBT and LCA reimplement the method, but LinearIndex reuses it and generate the wrong results.

There is also a discussion about the proper intended behavior of Index.gather. Sometimes it returns the best match (only), sometimes it returns a list of matches. I think we should have a separate method for the first use case (best_match?), and another with a simple gather based on linear scans over the signatures.

cc @ctb

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Dec 22, 2020

Codecov Report

Merging #1263 (fe6152f) into latest (ca201cf) will increase coverage by 5.24%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1263      +/-   ##
==========================================
+ Coverage   88.71%   93.95%   +5.24%     
==========================================
  Files         125       98      -27     
  Lines       18238    14615    -3623     
  Branches     1434     1434              
==========================================
- Hits        16180    13732    -2448     
+ Misses       1812      637    -1175     
  Partials      246      246              
Flag Coverage Δ
python 93.95% <100.00%> (ø)
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/index.py 95.40% <100.00%> (+0.10%) ⬆️
tests/test_index.py 100.00% <100.00%> (ø)
src/core/src/ffi/signature.rs
src/core/src/index/storage.rs
src/core/src/index/sbt/mod.rs
src/core/src/index/bigsi.rs
src/core/src/index/sbt/mhbt.rs
src/core/src/encodings.rs
src/core/src/sketch/minhash.rs
src/core/src/index/mod.rs
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca201cf...fe6152f. Read the comment docs.

Comment on lines 134 to -139
matches = lidx.gather(ss47)
assert len(matches) == 2
assert len(matches) == 1
assert matches[0][0] == 1.0
assert matches[0][1] == ss47
assert round(matches[1][0], 2) == 0.49
assert matches[1][1] == ss63
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is a good example of the 'weird' behavior: if doing gather with ss47, it should match ss47 completely, and nothing will be left to match other sigs. But the test is asserting that there is a second match (ss63), at 0.49 containment. That's... not gather, that's a regular search using containment.

@ctb
Copy link
Contributor

ctb commented Dec 22, 2020

I need to look at this more closely (and I thought there was an issue on this, or maybe just a long-winded discussion in some PR...)

I do vaguely recall running into a mental roadblock around the behavior of gather() on an Index object. The gather algorithm is a multi-index procedure by nature, so it didn't make sense to have it do anything other than find the best containment. Our improved mental model around min-set-cov and/or max-containment may clarify this now.

Which is all to say... yeah, I agree it's probably messed up, and it'd be great to revisit, and I'll do so :)

@ctb
Copy link
Contributor

ctb commented Feb 22, 2021

what about a function prefetch that does a scan across an Index class preparatory to search or gather?

ref #1310 for the proposed CLI functionality.

(this may be me just appropriating your idea under a different name, @luizirber. apologies if so - on a bit of a holiday & wanted to write this down while I had the thought at the tip of my brain, so am not taking time to remind myself of the whole conversation :)

@ctb
Copy link
Contributor

ctb commented Apr 29, 2021

See #1489 for an alternate approach to resolving this conundrum.

@ctb
Copy link
Contributor

ctb commented May 22, 2021

this can probably be closed, yah? viz #1370

@luizirber
Copy link
Member Author

yup.

@luizirber luizirber closed this May 24, 2021
@luizirber luizirber deleted the index_gather_fix branch September 23, 2021 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants