`sourmash index` does not flatten the signatures when building an SBT #1454

ctb · 2021-04-11T22:49:51Z

I've always assumed that sourmash index only stores flattened signatures, since there's no way to do an abundance-search on the SBT. I was wrong!

I guess I'm not clear on whether this is good, or bad; intentional, or not. Hence - issue!

% sourmash index xyz.sbt.zip tests/test-data/gather-abund/reads-s10-s11.sig

== This is sourmash version 4.0.1.dev21+gc49a8d84.d20210405. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading 1 files into SBT
loaded 1 sigs from 'tests/test-data/gather-abund/reads-s10-s11.sig'

loaded 1 sigs; saving SBT under "xyz.sbt.zip"
Finished saving nodes, now saving SBT index file.
Finished saving SBT index, available at /Users/t/dev/sourmash/xyz.sbt.zip

% sourmash sig describe xyz.sbt.zip

== This is sourmash version 4.0.1.dev21+gc49a8d84.d20210405. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---reading from file 'xyz.sbt.zip'
signature filename: xyz.sbt.zip
signature: 1-1
source file: r3.fa
md5: 43e3b5d6f298a181e32d0244eac643a3
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 770
signature license: CC0

loaded 1 sigs from 'xyz.sbt.zip'
loaded 1 signatures total.

The text was updated successfully, but these errors were encountered:

ctb · 2021-04-13T17:14:59Z

After thinking about this a bit, I do not yet want to go down the road of storing leaves that contain information that is disparate from the information contained in the internal SBT nodes/bloom filters. So in #1392, I am introducing a flatten into sourmash index.

It's tempting to regard this a proof of concept for storing complete signature in leaves per #198, but I don't want to formally support it yet 😆 .

ctb · 2021-04-13T17:19:00Z

(Note that only one test - the one that exposed this problem in the first place -test_sbt_categorize_ignore_abundance - fails. So it's mostly undocumented/untested behavior anyway.)

luizirber · 2021-04-13T21:38:31Z

I've always assumed that sourmash index only stores flattened signatures, since there's no way to do an abundance-search on the SBT. I was wrong!

I always thought of the abundance-search as using the flat query (don't consider abundances, only presence) for internal nodes, but then use the abundance query against leaves (and only report if they are over the threshold). The assumption here is that the abundance info doesn't change the search process until it reaches the leaves (but many more leaves might be reached, because presence/absence might have a higher threshold than abundance).

Isn't that what you're seeing?

ctb · 2021-04-13T21:41:22Z

rn Tue, Apr 13, 2021 at 02:38:45PM -0700, Luiz Irber wrote:

> I've always assumed that sourmash index only stores flattened signatures, since there's no way to do an abundance-search on the SBT. I was wrong! I always thought of the abundance-search as using the flat query (don't consider abundances, only presence) for internal nodes, but then use the abundance query against leaves (and only report if they are over the threshold). The assumption here is that the abundance info doesn't change the search process until it reaches the leaves (but many more leaves might be reached, because presence/absence might have a higher threshold than abundance). Isn't that what you're seeing?

it is! are we confident that this always finds the "best" match across the database, using the internal nodes to find the relevant leaves? I wasn't sure that SBT search or gather would do this, but I am quite positive that a prefetch-based approach would.

ctb · 2021-04-14T00:25:47Z

note also that LCA databases do not currently support storing abundances, so this behavior would be impossible with LCA databases as they currently exist. We could fix that, I suppose ;).

luizirber · 2021-04-16T23:56:47Z

I always thought of the abundance-search as using the flat query (don't consider abundances, only presence) for internal nodes, but then use the abundance query against leaves (and only report if they are over the threshold). The assumption here is that the abundance info doesn't change the search process until it reaches the leaves (but many more leaves might be reached, because presence/absence might have a higher threshold than abundance).

This absolutely and utterly bit me in #1137... I added a test that returned 17 matches with the current method, but my changes always returned 15 matches. When I started tracking results across the SBT, I noticed the similarity numbers were off, and sure enough the abundances changed the similarity (higher values than flat similarity). Sigh.

ctb · 2021-04-17T00:00:27Z

so... we agree? 😁

I think there's a lot of opportunity around storing full signatures in the leaves a la #198, but I think it should be intentional rather than accidental 😆

luizirber · 2021-04-17T02:23:33Z

so... we agree? grin

on the problem, yes; not sure we agree on the solution 🙃

Should we be flattening the query on search instead? Or more generally, can we do a similar analysis to #1137 (comment) but for the angular similarity in order to bound searches with abundance too?

luizirber · 2021-04-17T19:53:01Z

OK, I think I found something: fc2ee33 implements an angular similarity upper bound method for comparing Nodegraph/MinHash. It might overestimate (and I still need to check properly how much it is overestimating), but I made https://github.com/luizirber/2021-04-17-angular-bound (Binder) to try out implementations and throw hypothesis on it to generate falsifiable test cases.

The benefit of this approach is that #1137 works and we can continue supporting abundance queries for search in SBTs 🙃

ctb · 2021-04-17T20:14:21Z

cool!

I would still like to get #1392 in soonish, and would be happier to enable abundance queries for SBTs in a separate PR. It's not something we've ever advertised or robustly supported, so 🤷 seems like it shouldn't be a big deal to do that, yah?

Separately should look at supporting abundances in LCA databases.

ctb · 2021-04-17T21:59:59Z

ok, got a chance to think about this more. Will see if I can explain my thinking --

the find refactoring in [MRG] Rework the find functionality for Index classes #1392 is (imo) a nice & clean & robust way to do searching w/o abundance;
[MRG] refactor gather functionality for speed & modularity; provide prefetch functionality. #1370 and [EXP] add a prefetch linear search function to Index #1371 add prefetch/greyhound style functionality that makes clean use of [MRG] Rework the find functionality for Index classes #1392 (see [EXP] test-merge of new find code and prefetch #1465);
once they are merged, I think it is straightforward to extend the find functionality in [MRG] Rework the find functionality for Index classes #1392 to then implement abundance search on top of prefetch;

note here that the #1370/1371 prefetch functionality explicitly supports pruning of search trees (to make full use of SBTs) as well as efficient search using the reverse index (needed for LCA Databases) and I think it can do so with very little modification.

luizirber · 2021-04-17T23:12:47Z

Separately should look at supporting abundances in LCA databases.

On the LCA with abundance side, I tried out abundances in greyhound but ended up going in another more memory-frugal direction (using colors), but the code is still alive in sourmap. I can bring it back to #1238 as a separate index (let's avoid the confusion with data structures that support both abundance and no_abundance use cases =P)

I would still like to get #1392 in soonish, and would be happier to enable abundance queries for SBTs in a separate PR. It's not something we've ever advertised or robustly supported, so shrug seems like it shouldn't be a big deal to do that, yah?

yeah, I was just worried about breaking a use case that we are not testing but someone depends on ("every bug in your software will be someone else's feature"). The angular similarity solution in #1137 can be made faster (eventually), but for now it guarantees that everything still works the same.

ctb · 2021-04-18T13:22:47Z

I would still like to get #1392 in soonish, and would be happier to enable abundance queries for SBTs in a separate PR. It's not something we've ever advertised or robustly supported, so shrug seems like it shouldn't be a big deal to do that, yah?

yeah, I was just worried about breaking a use case that we are not testing but someone depends on ("every bug in your software will be someone else's feature"). The angular similarity solution in #1137 can be made faster (eventually), but for now it guarantees that everything still works the same.

side note, in #1392 where I disable this functionality (which is only exposed in sourmash categorize), I put in an error that tells the user that we can't do this search unless --ignore-abundance is specified. That seemed like the most responsible thing to do ;). But it sounds like we can aim to have #1137 in the next release, too.

This comment has been minimized.

Sign in to view

ctb mentioned this issue Apr 13, 2021

[MRG] Rework the find functionality for Index classes #1392

Merged

15 tasks

luizirber closed this as completed in #1392 Apr 22, 2021

ctb mentioned this issue Apr 23, 2021

Draft release notes for v4.1.0 #1391

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`sourmash index` does not flatten the signatures when building an SBT #1454

`sourmash index` does not flatten the signatures when building an SBT #1454

ctb commented Apr 11, 2021

This comment has been minimized.

This comment has been minimized.

ctb commented Apr 13, 2021

ctb commented Apr 13, 2021

luizirber commented Apr 13, 2021

ctb commented Apr 13, 2021 via email

ctb commented Apr 14, 2021

luizirber commented Apr 16, 2021

ctb commented Apr 17, 2021

luizirber commented Apr 17, 2021 •

edited

Loading

luizirber commented Apr 17, 2021 •

edited

Loading

ctb commented Apr 17, 2021

ctb commented Apr 17, 2021

luizirber commented Apr 17, 2021

ctb commented Apr 18, 2021

sourmash index does not flatten the signatures when building an SBT #1454

sourmash index does not flatten the signatures when building an SBT #1454

Comments

ctb commented Apr 11, 2021

This comment has been minimized.

This comment has been minimized.

ctb commented Apr 13, 2021

ctb commented Apr 13, 2021

luizirber commented Apr 13, 2021

ctb commented Apr 13, 2021 via email

ctb commented Apr 14, 2021

luizirber commented Apr 16, 2021

ctb commented Apr 17, 2021

luizirber commented Apr 17, 2021 • edited Loading

luizirber commented Apr 17, 2021 • edited Loading

ctb commented Apr 17, 2021

ctb commented Apr 17, 2021

luizirber commented Apr 17, 2021

ctb commented Apr 18, 2021

`sourmash index` does not flatten the signatures when building an SBT #1454

`sourmash index` does not flatten the signatures when building an SBT #1454

luizirber commented Apr 17, 2021 •

edited

Loading

luizirber commented Apr 17, 2021 •

edited

Loading