-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy loading LinearIndex? #1097
Comments
Random thought - is this a place where some kind of signature manifest would be useful, so that we can support not loading signatures that don't match a selector (ksize, moltype, etc.)? |
Hmm, I don't know why I didn't mention this when I opened the issue, but |
Right, that's the location stringsThis is maybe a bit off topic from this issue, but what about including an optionally-set for ss in db.signatures():
print(ss.location) would give you the sourmash-loadable location of that signature, if available. locations = set()
for ss in db.signatures():
locations.add(ss.location) would give you the (presumably lightweight :) set of strings to use to load those signatures, and sigs = []
for loc in locations:
ss = storage.load(loc)
sigs.append(ss) would load all those signatures. Perhaps obvious, but the location would be relative to the database storage location. You could do something similar with for ss in db.signatures():
assert ss.storage == db # or db.storage, or something? and then have Then you could do: sig_locs = []
for ss in db.signatures():
full_loc = url.urljoin(db.location, ss.location)
sig_locs.append(full_loc) intersection with picklists/shopping cartsI am thinking about defining manifests as "must be complete", i.e. they contain info on every signature in a collection; then we can have separate "pick lists" or "shopping carts" that are output by search, gather, prefetch, compare, etc. The idea is that these would support CLI and API access to (quickly) load subsets of signatures from collections, i.e. siglist = []
for sigloc in manifest.locations:
siglist = sourmash.load_signature_from_url(sigloc) and sourmash signature extract --picklist xyz.txt genbank-k31.sbt.zip intersection with manifestsThen, if you wanted to support lazy loading, you would have a manifest with ksize, moltype, scaled/num, abund, etc., and the select command would narrow those down to pick lists (see below :), and databases would do the obvious and hopefully not-overly-clever thing of only loading the signatures in the picklist. |
Before I answer the longer comment, an observation about how this is turning to be a fractal problem =] SBTs are organized using two parts:
Signatures are organized as one (JSON, possibly compressed) file, and the content (minhash sketches) is present in this file. This is easy to distribute (because it is only one file), but requires reading the full file to extract metadata or a subset of the signature. It seems that we are moving into adopting a more SBT-style for signatures, turning the current Signature JSON into something more like a manifest (a la SBT JSON?), and maybe storing signatures in a format that allows lazy loading (zip files? sqlite?). In this case, do we want to separate the minhash sketches from the signature JSON? Or support a mixed case where it can be either inlined or available in an external storage? Inlined is convenient, especially for small sketches, but external is VERY useful for gigantic signatures. |
@bluegenes writes:
to which my answer was: no, there is not - yet :) |
sourmash has evolved quite a bit in this area since this issue was created, most recently with #1370 (which is not yet merged, but probably will be). I think the class to look at would now be Note too that the |
See if we had a |
UPDATE Mar 25, 2022: yes, this works great with picklists and also pattern matching ( |
I think this issue resolves now back to the original question at the top: can we direct-index into on-disk JSON files to pull out just the signature we want? And I think the answer is yes, via manifests and their |
I'm going to close this based on the last comment:
|
Related to the use case in #1096:
LinearIndex
could potentially be lazy-loaded: during insertion, take note of the existing file location, and only load the data when required (during iteration, for example), and potentially.unload()
it after use.Problems:
LinearIndex
, because.insert()
takes aSignature
, which already lost the location for the existing file...Storage
that supports linking to an existing file instead of trying to save it to a new place (FSStorage
andZipStorage
sadly don't fit this usage...).The text was updated successfully, but these errors were encountered: