-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sourmash index
does not flatten the signatures when building an SBT
#1454
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
After thinking about this a bit, I do not yet want to go down the road of storing leaves that contain information that is disparate from the information contained in the internal SBT nodes/bloom filters. So in #1392, I am introducing a It's tempting to regard this a proof of concept for storing complete signature in leaves per #198, but I don't want to formally support it yet 😆 . |
(Note that only one test - the one that exposed this problem in the first place - |
I always thought of the abundance-search as using the flat query (don't consider abundances, only presence) for internal nodes, but then use the abundance query against leaves (and only report if they are over the threshold). The assumption here is that the abundance info doesn't change the search process until it reaches the leaves (but many more leaves might be reached, because presence/absence might have a higher threshold than abundance). Isn't that what you're seeing? |
rn Tue, Apr 13, 2021 at 02:38:45PM -0700, Luiz Irber wrote:
> I've always assumed that sourmash index only stores flattened signatures, since there's no way to do an abundance-search on the SBT. I was wrong!
I always thought of the abundance-search as using the flat query (don't consider abundances, only presence) for internal nodes, but then use the abundance query against leaves (and only report if they are over the threshold). The assumption here is that the abundance info doesn't change the search process until it reaches the leaves (but many more leaves might be reached, because presence/absence might have a higher threshold than abundance).
Isn't that what you're seeing?
it is! are we confident that this always finds the "best" match across the
database, using the internal nodes to find the relevant leaves? I wasn't sure
that SBT search or gather would do this, but I am quite positive that a
prefetch-based approach would.
|
note also that LCA databases do not currently support storing abundances, so this behavior would be impossible with LCA databases as they currently exist. We could fix that, I suppose ;). |
This absolutely and utterly bit me in #1137... I added a test that returned |
so... we agree? 😁 I think there's a lot of opportunity around storing full signatures in the leaves a la #198, but I think it should be intentional rather than accidental 😆 |
on the problem, yes; not sure we agree on the solution 🙃 Should we be flattening the query on |
OK, I think I found something: fc2ee33 implements an angular similarity upper bound method for comparing Nodegraph/MinHash. It might overestimate (and I still need to check properly how much it is overestimating), but I made https://github.com/luizirber/2021-04-17-angular-bound (Binder) to try out implementations and throw The benefit of this approach is that #1137 works and we can continue supporting abundance queries for search in SBTs 🙃 |
cool! I would still like to get #1392 in soonish, and would be happier to enable abundance queries for SBTs in a separate PR. It's not something we've ever advertised or robustly supported, so 🤷 seems like it shouldn't be a big deal to do that, yah? Separately should look at supporting abundances in LCA databases. |
ok, got a chance to think about this more. Will see if I can explain my thinking --
note here that the #1370/1371 prefetch functionality explicitly supports pruning of search trees (to make full use of SBTs) as well as efficient search using the reverse index (needed for LCA Databases) and I think it can do so with very little modification. |
On the LCA with abundance side, I tried out abundances in greyhound but ended up going in another more memory-frugal direction (using colors), but the code is still alive in sourmap. I can bring it back to #1238 as a separate index (let's avoid the confusion with data structures that support both abundance and no_abundance use cases =P)
yeah, I was just worried about breaking a use case that we are not testing but someone depends on ("every bug in your software will be someone else's feature"). The angular similarity solution in #1137 can be made faster (eventually), but for now it guarantees that everything still works the same. |
side note, in #1392 where I disable this functionality (which is only exposed in |
I've always assumed that sourmash index only stores flattened signatures, since there's no way to do an abundance-search on the SBT. I was wrong!
I guess I'm not clear on whether this is good, or bad; intentional, or not. Hence - issue!
The text was updated successfully, but these errors were encountered: