Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: provide --internal-storage and --no-internal-storage for index #390

Merged
merged 21 commits into from
Aug 12, 2024

Conversation

ctb
Copy link
Collaborator

@ctb ctb commented Jul 12, 2024

Provides command-line access to the sourmash v0.15 internalize_storage() function added in sourmash-bio/sourmash#3250, via index --internal-storage (defaults to True).

Documentation is separately updated in #416.

This PR also:

  • expands tests to cover both internal and exteranl behaviors;
  • renames .rdb to .rocksdb internally.

Punted to other issues:

Behavior that is not tested:

  • relocatability of rocksdb w/internal storage: can we rename the index directory after --internal-storage is used, and will it still work?

@ctb
Copy link
Collaborator Author

ctb commented Jul 12, 2024

(this comment is now outdated, but I'm leaving it in for posterity and reference ;)

With the latest trace statements, I find the problem (as expected ;) here, in the storage::InnerStorage struct.

InnerStorage::from_spec: opening zip://list.sig.zip

Conveniently 😭 this is not a new bug or problem, this is the same problem that I discussed ad nauseum over in sourmash-bio/sourmash#3008, where we found it to be a problem for manifests - triggered initially by sourmash-bio/sourmash#3053.

In brief, the problem is this: when we refer to an external storage, how do we interpret non-absolute paths?

As I wrote in #3008, "I am slowly coming around to the idea that loading things relative to the manifest path is correct."

So I think the Right Fix would be to rejigger the path to the zip file to be interpreted relative to the RocksDB location.

However, there is another fun component to this, which is that I'm not sure it's documented anywhere that RocksDB indices created by this plugin store the sketches externally, which is needed for gather (but not for manysearch)... Per @luizirber on slack,

me:

do RocksDB disk rev indexes, once created, require access to the original signatures to function?

@luizirber:

Yes for gather, no for search
You can also build the index with internal sigs, but I don’t think it’s the default

So anyway I think maybe the default for branchwater plugin should be to build a self-contained index. I'm going to dig into the code a bit more to see if that's possible.

@ctb
Copy link
Collaborator Author

ctb commented Jul 12, 2024

There's another wrinkle that I'm having trouble sorting out - it's not clear to me that fastmultigather on the RocksDB index is actually using the external sketches? For the FS storage, I can remove the sketches and it still works 🤔 .

@ctb
Copy link
Collaborator Author

ctb commented Jul 12, 2024

There's another wrinkle that I'm having trouble sorting out - it's not clear to me that fastmultigather on the RocksDB index is actually using the external sketches? For the FS storage, I can remove the sketches and it still works 🤔 .

OK, I was wrong. It looks like fastmultigather appropriately errors out when the sketches are not available, either via FSStorage or ZipStorage.

The real difference is that ZipStorage errors out on load, while FSStorage errors out as soon as it tries to do anything. So sourmash scripts check fails on ZipStorage load, while FSStorage only fails on fastmultigather. I'll write a test.

@ctb
Copy link
Collaborator Author

ctb commented Jul 12, 2024

The real difference is that ZipStorage errors out on load, while FSStorage errors out as soon as it tries to do anything. So sourmash scripts check fails on ZipStorage load, while FSStorage only fails on fastmultigather. I'll write a test.

Even more fun -

  • when using rocksdb built from ZipStorage, if the zip file is not available, fastmultigather will fail to open the rocksdb.
  • when using rocksdb built from FSStorage, if the files are not available it looks like fastmultigather produces an empty CSV file.

See commit 310df3d for an example test.

@ctb
Copy link
Collaborator Author

ctb commented Jul 13, 2024

You can also build the index with internal sigs, but I don’t think it’s the default

from slack - luiz says:

hmm, there was a an internal/external divide in the initial implementation, it might not have survived future refactors and the internal (save sigs in rocksdb) is not working anymore. you can see a leftover here:
https://github.com/sourmash-bio/sourmash/blob/e3a4d4bcf150d2d58c36b046c35ed61cf9835686/src/core/src/index/revindex/disk_revindex.rs#L486
in any case, with collections the internal/external thing shouldn't happen anyway: the manifest has the info, and the SIGS column family from above can be used as the storage (with internal_location as key, and the sig as value).

@ctb
Copy link
Collaborator Author

ctb commented Jul 14, 2024

@ctb ctb changed the title EXP: explore RocksDB location breakages MRG: provide --internal-storage and --no-internal-storage for index Aug 12, 2024
@ctb
Copy link
Collaborator Author

ctb commented Aug 12, 2024

@bluegenes could you take a look at this and see if it all makes sense to you? It's ready for merge.

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me!

@ctb ctb merged commit a780440 into main Aug 12, 2024
1 check passed
@ctb ctb deleted the rocksdb_location branch August 12, 2024 19:48
@ctb ctb mentioned this pull request Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants