Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash index does not flatten the signatures when building an SBT #1454

Closed
ctb opened this issue Apr 11, 2021 · 15 comments · Fixed by #1392
Closed

sourmash index does not flatten the signatures when building an SBT #1454

ctb opened this issue Apr 11, 2021 · 15 comments · Fixed by #1392

Comments

@ctb
Copy link
Contributor

ctb commented Apr 11, 2021

I've always assumed that sourmash index only stores flattened signatures, since there's no way to do an abundance-search on the SBT. I was wrong!

I guess I'm not clear on whether this is good, or bad; intentional, or not. Hence - issue!

% sourmash index xyz.sbt.zip tests/test-data/gather-abund/reads-s10-s11.sig

== This is sourmash version 4.0.1.dev21+gc49a8d84.d20210405. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading 1 files into SBT
loaded 1 sigs from 'tests/test-data/gather-abund/reads-s10-s11.sig'

loaded 1 sigs; saving SBT under "xyz.sbt.zip"
Finished saving nodes, now saving SBT index file.
Finished saving SBT index, available at /Users/t/dev/sourmash/xyz.sbt.zip

% sourmash sig describe xyz.sbt.zip

== This is sourmash version 4.0.1.dev21+gc49a8d84.d20210405. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---reading from file 'xyz.sbt.zip'
signature filename: xyz.sbt.zip
signature: 1-1
source file: r3.fa
md5: 43e3b5d6f298a181e32d0244eac643a3
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 770
signature license: CC0

loaded 1 sigs from 'xyz.sbt.zip'
loaded 1 signatures total.
@ctb

This comment has been minimized.

@ctb

This comment has been minimized.

@ctb
Copy link
Contributor Author

ctb commented Apr 13, 2021

After thinking about this a bit, I do not yet want to go down the road of storing leaves that contain information that is disparate from the information contained in the internal SBT nodes/bloom filters. So in #1392, I am introducing a flatten into sourmash index.

It's tempting to regard this a proof of concept for storing complete signature in leaves per #198, but I don't want to formally support it yet 😆 .

@ctb
Copy link
Contributor Author

ctb commented Apr 13, 2021

(Note that only one test - the one that exposed this problem in the first place -test_sbt_categorize_ignore_abundance - fails. So it's mostly undocumented/untested behavior anyway.)

@luizirber
Copy link
Member

I've always assumed that sourmash index only stores flattened signatures, since there's no way to do an abundance-search on the SBT. I was wrong!

I always thought of the abundance-search as using the flat query (don't consider abundances, only presence) for internal nodes, but then use the abundance query against leaves (and only report if they are over the threshold). The assumption here is that the abundance info doesn't change the search process until it reaches the leaves (but many more leaves might be reached, because presence/absence might have a higher threshold than abundance).

Isn't that what you're seeing?

@ctb
Copy link
Contributor Author

ctb commented Apr 13, 2021 via email

@ctb
Copy link
Contributor Author

ctb commented Apr 14, 2021

note also that LCA databases do not currently support storing abundances, so this behavior would be impossible with LCA databases as they currently exist. We could fix that, I suppose ;).

@luizirber
Copy link
Member

I always thought of the abundance-search as using the flat query (don't consider abundances, only presence) for internal nodes, but then use the abundance query against leaves (and only report if they are over the threshold). The assumption here is that the abundance info doesn't change the search process until it reaches the leaves (but many more leaves might be reached, because presence/absence might have a higher threshold than abundance).

This absolutely and utterly bit me in #1137... I added a test that returned 17 matches with the current method, but my changes always returned 15 matches. When I started tracking results across the SBT, I noticed the similarity numbers were off, and sure enough the abundances changed the similarity (higher values than flat similarity). Sigh.

@ctb
Copy link
Contributor Author

ctb commented Apr 17, 2021

so... we agree? 😁

I think there's a lot of opportunity around storing full signatures in the leaves a la #198, but I think it should be intentional rather than accidental 😆

@luizirber
Copy link
Member

luizirber commented Apr 17, 2021

so... we agree? grin

on the problem, yes; not sure we agree on the solution 🙃

Should we be flattening the query on search instead? Or more generally, can we do a similar analysis to #1137 (comment) but for the angular similarity in order to bound searches with abundance too?

@luizirber
Copy link
Member

luizirber commented Apr 17, 2021

OK, I think I found something: fc2ee33 implements an angular similarity upper bound method for comparing Nodegraph/MinHash. It might overestimate (and I still need to check properly how much it is overestimating), but I made https://github.com/luizirber/2021-04-17-angular-bound (Binder) to try out implementations and throw hypothesis on it to generate falsifiable test cases.

The benefit of this approach is that #1137 works and we can continue supporting abundance queries for search in SBTs 🙃

@ctb
Copy link
Contributor Author

ctb commented Apr 17, 2021

cool!

I would still like to get #1392 in soonish, and would be happier to enable abundance queries for SBTs in a separate PR. It's not something we've ever advertised or robustly supported, so 🤷 seems like it shouldn't be a big deal to do that, yah?

Separately should look at supporting abundances in LCA databases.

@ctb
Copy link
Contributor Author

ctb commented Apr 17, 2021

ok, got a chance to think about this more. Will see if I can explain my thinking --

note here that the #1370/1371 prefetch functionality explicitly supports pruning of search trees (to make full use of SBTs) as well as efficient search using the reverse index (needed for LCA Databases) and I think it can do so with very little modification.

@luizirber
Copy link
Member

Separately should look at supporting abundances in LCA databases.

On the LCA with abundance side, I tried out abundances in greyhound but ended up going in another more memory-frugal direction (using colors), but the code is still alive in sourmap. I can bring it back to #1238 as a separate index (let's avoid the confusion with data structures that support both abundance and no_abundance use cases =P)

I would still like to get #1392 in soonish, and would be happier to enable abundance queries for SBTs in a separate PR. It's not something we've ever advertised or robustly supported, so shrug seems like it shouldn't be a big deal to do that, yah?

yeah, I was just worried about breaking a use case that we are not testing but someone depends on ("every bug in your software will be someone else's feature"). The angular similarity solution in #1137 can be made faster (eventually), but for now it guarantees that everything still works the same.

@ctb
Copy link
Contributor Author

ctb commented Apr 18, 2021

I would still like to get #1392 in soonish, and would be happier to enable abundance queries for SBTs in a separate PR. It's not something we've ever advertised or robustly supported, so shrug seems like it shouldn't be a big deal to do that, yah?

yeah, I was just worried about breaking a use case that we are not testing but someone depends on ("every bug in your software will be someone else's feature"). The angular similarity solution in #1137 can be made faster (eventually), but for now it guarantees that everything still works the same.

side note, in #1392 where I disable this functionality (which is only exposed in sourmash categorize), I put in an error that tells the user that we can't do this search unless --ignore-abundance is specified. That seemed like the most responsible thing to do ;). But it sounds like we can aim to have #1137 in the next release, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants