Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy loading LinearIndex? #1097

Closed
luizirber opened this issue Jul 11, 2020 · 10 comments
Closed

Lazy loading LinearIndex? #1097

luizirber opened this issue Jul 11, 2020 · 10 comments
Labels

Comments

@luizirber
Copy link
Member

Related to the use case in #1096: LinearIndex could potentially be lazy-loaded: during insertion, take note of the existing file location, and only load the data when required (during iteration, for example), and potentially .unload() it after use.

Problems:

  • Might need another method in the LinearIndex, because .insert() takes a Signature, which already lost the location for the existing file...
  • Need a Storage that supports linking to an existing file instead of trying to save it to a new place (FSStorage and ZipStorage sadly don't fit this usage...).
@ctb
Copy link
Contributor

ctb commented Feb 28, 2021

Random thought - is this a place where some kind of signature manifest would be useful, so that we can support not loading signatures that don't match a selector (ksize, moltype, etc.)?

@luizirber
Copy link
Member Author

Hmm, I don't know why I didn't mention this when I opened the issue, but SigLeaf kind of works for this too. On the Rust side I started calling it SigStore, and it knows how to load/unload the data. SigStore/SigLeaf doesn't exist as a separate file, it is saved directly in the JSON file describing the SBT (and could be saved on the LinearIndex too), which also matches the idea of the signature manifest.

@ctb
Copy link
Contributor

ctb commented Mar 11, 2021

Right, that's the leaf.data stuff, right? I think that container approach is perfectly appropriate, too, where the signature is just the signature, and any "metadata" about the signature is in an object containing the signature.

location strings

This is maybe a bit off topic from this issue, but what about including an optionally-set .location property (or perhaps better -storage) on SourmashSignature that you can use to get a handle to that signature, if one is available? I'm not sure I'm thinking about this right, but let's see --

for ss in db.signatures():
   print(ss.location)

would give you the sourmash-loadable location of that signature, if available.

locations = set()
for ss in db.signatures():
   locations.add(ss.location)

would give you the (presumably lightweight :) set of strings to use to load those signatures, and

sigs = []
for loc in locations:
   ss = storage.load(loc)
   sigs.append(ss)

would load all those signatures. Perhaps obvious, but the location would be relative to the database storage location.

You could do something similar with Index classes/databases - maybe add a .storage object,

for ss in db.signatures():
   assert ss.storage == db  # or db.storage, or something?

and then have db.location optionally point to a sourmash-loadable storage URL that can be understood by index.load or something.

Then you could do:

sig_locs = []
for ss in db.signatures():
    full_loc = url.urljoin(db.location, ss.location)
    sig_locs.append(full_loc)

intersection with picklists/shopping carts

I am thinking about defining manifests as "must be complete", i.e. they contain info on every signature in a collection; then we can have separate "pick lists" or "shopping carts" that are output by search, gather, prefetch, compare, etc. The idea is that these would support CLI and API access to (quickly) load subsets of signatures from collections, i.e.

siglist = []
for sigloc in manifest.locations:
   siglist = sourmash.load_signature_from_url(sigloc)

and

sourmash signature extract --picklist xyz.txt genbank-k31.sbt.zip

intersection with manifests

Then, if you wanted to support lazy loading, you would have a manifest with ksize, moltype, scaled/num, abund, etc., and the select command would narrow those down to pick lists (see below :), and databases would do the obvious and hopefully not-overly-clever thing of only loading the signatures in the picklist.

@luizirber
Copy link
Member Author

Before I answer the longer comment, an observation about how this is turning to be a fractal problem =]


SBTs are organized using two parts:

  • a JSON describing the structure of the tree
  • a storage containing the data for node in the tree (internal and signatures)
    The structure doesn't contain the data, only a record of how the location/path that can be loaded/saved to a storage (the SigLeaf/Node classes in SBTs).
    This separation allows swapping storages (hidden dir, Zip, IPFS), and also avoids loading all the SBT data in-memory (lazy loading).

Signatures are organized as one (JSON, possibly compressed) file, and the content (minhash sketches) is present in this file. This is easy to distribute (because it is only one file), but requires reading the full file to extract metadata or a subset of the signature.


It seems that we are moving into adopting a more SBT-style for signatures, turning the current Signature JSON into something more like a manifest (a la SBT JSON?), and maybe storing signatures in a format that allows lazy loading (zip files? sqlite?). In this case, do we want to separate the minhash sketches from the signature JSON? Or support a mixed case where it can be either inlined or available in an external storage? Inlined is convenient, especially for small sketches, but external is VERY useful for gigantic signatures.

@ctb
Copy link
Contributor

ctb commented Apr 30, 2021

@bluegenes writes:

If I want to (not so hypothetically) pick certain sigs to compare, (by name), across 300k sigs, I would currently build a name::sigfile csv (that can be used as a lookup dict), then load sigs as needed for comparisons. Is there a better way to do this with Zipfiles yet?

to which my answer was: no, there is not - yet :)

@ctb
Copy link
Contributor

ctb commented May 8, 2021

sourmash has evolved quite a bit in this area since this issue was created, most recently with #1370 (which is not yet merged, but probably will be). I think the class to look at would now be MultiIndex instead of LinearIndex for lazy loading, but I haven't thought too much more about this functionality.

Note too that the prefetch functionality in #1493 essentially supports the end goal here because it makes only a single pass across as many files as you give it.

@ctb
Copy link
Contributor

ctb commented Jun 19, 2021

See LazyMultiIndex in #1590 and comment here for something that would be useful in conjunction with #1590 -

if we had a LazyLinearIndex class with a manifest attached that only loaded the signatures when .signatures() was called, it could interact with LazyMultiIndex in #1590 to support full lazy loading of .sig files. The only tricky bit in implementation is that we need to figure out how to specify a manifest location 🤷.

@ctb
Copy link
Contributor

ctb commented Mar 25, 2022

@bluegenes writes:

If I want to (not so hypothetically) pick certain sigs to compare, (by name), across 300k sigs, I would currently build a name::sigfile csv (that can be used as a lookup dict), then load sigs as needed for comparisons. Is there a better way to do this with Zipfiles yet?

to which my answer was: no, there is not - yet :)

UPDATE Mar 25, 2022: yes, this works great with picklists and also pattern matching (--include-db-pattern/--exclude-db-pattern)

@ctb
Copy link
Contributor

ctb commented Mar 26, 2022

I think this issue resolves now back to the original question at the top: can we direct-index into on-disk JSON files to pull out just the signature we want? And I think the answer is yes, via manifests and their internal_location entry. But it seems also like zip files are a better answer to this.

@ctb
Copy link
Contributor

ctb commented Aug 3, 2022

I'm going to close this based on the last comment:

it seems also like zip files are a better answer to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants