Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: on-disk RevIndex based on RocksDB #2230

Merged
merged 30 commits into from
Dec 1, 2023
Merged

Feat: on-disk RevIndex based on RocksDB #2230

merged 30 commits into from
Dec 1, 2023

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Aug 21, 2022

On-disk RevIndex based on RocksDB, initially implemented in https://github.com/luizirber/2022-06-26-rocksdb-eval
This is the index/core data structure backing https://mastiff.sourmash.bio

There are many changes in the Rust code, so bumping the version to 0.12.0.

This is mostly not exposed thru the FFI yet. Tests from the from the in-memory RevIndex (greyhound) from #1238 were kept working, but it is not well supported (doesn't allow saving/loading from disk, for example), and should be wholly replaced by sourmash::index::revindex::disk_revindex (the on-disk RevIndex) in the future.
It is confusing to have these different RevIndex impls in Rust, and I started converging them, but the work is not completely done yet.

#2727 is a better starting point for how Index abc/trait should work acrosss Python/Rust, and I started moving the Rust indices to start from a LinearIndex and later specialize into a RevIndex, which will make easier to translate the work from #2727 for future indices across FFI.

A couple of new concepts introduced in this PR:

  • a Collection is a Manifest + Storage. So a zip file like the ones for GTDB databases fit this easily (storage = ZipStorage, manifest is read from the zipfile), but a file paths list does too (manifest built from the file paths, storage = FSStorage). This goes in a bit of different direction than thoughts on Storage, manifests, etc. #1901, which was extending Storage to support more functionality. I think Storage should stay pretty bare and mostly deal with loading/saving data, but not much knowledge of what data is there (this is covered with Manifest).

  • a CollectionSet is a consistent collection of signatures. Consistent here means: same k-size, downsample-compatible for scaled, same moltype. You can create a CollectionSet by running .select() on a Collection. CollectionSet is required for building indices (because we shouldn't be building indices mixing k-size/moltype), and greatly simplifies the logic in many places because it is not necessary to check for compatibility.

  • LinearIndex was rewritten based on Collection (and also the greyhound/branchwater parallelism), and this supports the "parallel search without an index" use case. There is no index construction per se here, pretty much just a thin layer on top of Collection implementing functionality expected from indices.

  • Manifest, Selection, and Picklist are still incomplete, but the relevant function definitions are in place, need to barrage it with tests (and potentially exposing to Python and reusing the ones there in Feature: implement FFI for manifest, picklist and selection in Rust #2726)

Feature

  • Initial implementation for Manifest, Selection, and Picklist following
    the Python API.
  • Collection is a new abstraction for working with a set of signatures. A
    collection needs a Storage for holding the signatures (on-disk, in-memory,
    or remotely), and a Manifest to describe the metadata for each signature.
  • Expose CSV parsing and RocksDB errors.
  • New module sourmash::index::revindex::disk_revindex with the on-disk
    RevIndex implementation based on RocksDB.
  • Add iter and iter_mut methods for Signature.
  • Add load_sig and save_sig methods to Storage trait for higher-level data
    manipulation and caching.
  • Add spec method to Storage to allow constructing a concrete Storage from
    a string description.
  • Add InnerStorage for synchronizing parallel access to Storage
    implementations.
  • Add MemStorage for keeping signatures in-memory (mostly for debugging and
    testing).

Refactor

  • Rename HashFunctions variants to follow camel-case, so Murmur64Protein instead of murmur64_protein
  • LinearIndex is now implemented as a thin layer on top of Collection.
  • Move GatherResult to sourmash::index module.
  • Move sourmash::index::revindex to sourmash::index::mem_revindex (this is
    the Greyhound version of revindex, in-memory only). It was also refactored
    internally to build a version of a LinearIndex that will be merged in the
    future with sourmash::index::LinearIndex
  • Move select method from Index trait into a separate Select trait,
    and implement it for Signature based on the new Selection API.
  • Move SigStore into sourmash::storage module, and remove the generic. Now
    it always stores Signature. Also implement Select for it.

Build

  • Add new branchwater feature (enabled by default), which can be disabled by downstream projects to limit bringing heavy dependencies like rocksdb
  • Add new rkyv feature (disabled by default), making MinHash serializable
    with the rkyv crate.
  • Add semver checks for CI (so we bump versions accordingly, or avoid breaking changes)
  • Reduce features combinations on Rust checks (takes much less time to run)
  • Disable musllinux wheels (need to figure out how to build rocksdb for it)

@codecov
Copy link

codecov bot commented Aug 21, 2022

Codecov Report

Attention: 318 lines in your changes are missing coverage. Please review.

Comparison is base (d6994a1) 86.23% compared to head (a748bb3) 86.30%.

Files Patch % Lines
src/core/src/storage.rs 51.63% 59 Missing ⚠️
src/core/src/index/revindex/mod.rs 50.84% 58 Missing ⚠️
src/core/src/index/revindex/mem_revindex.rs 48.91% 47 Missing ⚠️
src/core/src/index/linear.rs 65.16% 31 Missing ⚠️
src/core/src/manifest.rs 62.06% 22 Missing ⚠️
src/core/src/signature.rs 29.03% 22 Missing ⚠️
src/core/src/index/revindex/disk_revindex.rs 89.26% 19 Missing ⚠️
src/core/src/selection.rs 14.28% 18 Missing ⚠️
src/core/src/collection.rs 67.92% 17 Missing ⚠️
src/core/src/ffi/mod.rs 0.00% 8 Missing ⚠️
... and 7 more
Additional details and impacted files
@@            Coverage Diff             @@
##           latest    #2230      +/-   ##
==========================================
+ Coverage   86.23%   86.30%   +0.07%     
==========================================
  Files         130      135       +5     
  Lines       14756    15301     +545     
  Branches     2622     2622              
==========================================
+ Hits        12725    13206     +481     
- Misses       1731     1795      +64     
  Partials      300      300              
Flag Coverage Δ
hypothesis-py 25.80% <0.00%> (ø)
python 92.82% <50.00%> (ø)
rust 57.99% <61.57%> (+7.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@luizirber luizirber mentioned this pull request Aug 28, 2022
8 tasks
@luizirber luizirber changed the title [WIP] Brings mastiff into core [WIP] Bring mastiff into core Aug 29, 2022
@luizirber luizirber force-pushed the lirber/mastiff branch 4 times, most recently from 39ddd3a to e2c9e92 Compare September 25, 2022 21:11
luizirber added a commit to sourmash-bio/sra_search that referenced this pull request Oct 2, 2022
Fixes #4 #6
Closes #10 

Bring a few bits from sourmash-bio/sourmash#2230
to support the proper query config (scaled considering downsampling,
both of queries and DB sigs).
@luizirber luizirber force-pushed the lirber/mastiff branch 2 times, most recently from da587db to 2819307 Compare October 22, 2022 17:17
@luizirber luizirber force-pushed the lirber/mastiff branch 3 times, most recently from 28948a1 to 112b4f1 Compare March 31, 2023 02:03
@luizirber luizirber changed the title [WIP] Bring mastiff into core Feat: on-disk RevIndex based on RocksDB Nov 24, 2023
@luizirber luizirber requested a review from a team November 24, 2023 01:24
@ctb
Copy link
Contributor

ctb commented Nov 25, 2023

@bluegenes can you remind me of the relationship between this branch and the pyo3_dependency on commit ff1092f (which may no longer exist in this repo...?)

@luizirber my main request is that you update the description of the PR so that we can refer back to this in the future. Specific thoughts for things to add (or maybe questions) -

  • this bumps the sourmash-rs package to 0.12, right?
  • am I right in understanding that there is no Python API access to the RevIndex? Or does the code in src/sourmash/index/revindex.py make use of the updated class?

thanks!

@luizirber
Copy link
Member Author

luizirber commented Nov 26, 2023

@bluegenes can you remind me of the relationship between this branch and the pyo3_dependency on commit ff1092f (which may no longer exist in this repo...?)

I stopped rebasing this branch and latest and started merging instead, so commits should be available (even if the branch is deleted, but maybe keep it around for a bit).

sourmash-bio/sourmash_plugin_branchwater#134 uses this PR instead, and can flip to use 0.12.0 once this PR is merged and released. We can open smaller PRs to address shortcomings (and release 0.12.x versions, or 0.13.0 if we need to change the API), or add features as needed by pyo3_branchwater.

@luizirber my main request is that you update the description of the PR so that we can refer back to this in the future. Specific thoughts for things to add (or maybe questions) -

PR description updated!

this bumps the sourmash-rs package to 0.12, right?

👍

am I right in understanding that there is no Python API access to the RevIndex? Or does the code in src/sourmash/index/revindex.py make use of the updated class?

Well... there is a light FFI exposing what is now sourmash::index::revindex::mem_revindex and is the in-memory RevIndex from #1238. This is not well supported (doesn't allow saving/loading from disk, for example), and should be wholly replaced by sourmash::index::revindex::disk_revindex (the on-disk RevIndex).

It is confusing to have these different RevIndex impls in Rust, and I started converging them, but the work is not completely done yet.

Nonetheless, I kept the current tests for the Python RevIndex working, but there are not many of them. #2727 is a better starting point for how Index abc/trait should work acrosss Python/Rust, and I started moving the Rust indices to start from a LinearIndex and later specialize into a RevIndex, which will make easier to translate the work from #2727 for future indices across FFI.

@ctb
Copy link
Contributor

ctb commented Nov 26, 2023

thanks!

@bluegenes I leave this in your capable hands ;)

@ctb
Copy link
Contributor

ctb commented Dec 1, 2023

Per out of band conversation: merge now, release sourmash-rs 0.12.0, integrate/test/fix in sourmash-bio/sourmash_plugin_branchwater#134

@ctb ctb merged commit 22ec122 into latest Dec 1, 2023
33 of 34 checks passed
@ctb ctb deleted the lirber/mastiff branch December 1, 2023 00:56
luizirber added a commit to sourmash-bio/mastiff that referenced this pull request Jan 20, 2024
[rocksdb-eval](https://github.com/luizirber/2022-06-26-rocksdb-eval) is
the mastiff PoC, and since
sourmash-bio/sourmash#2230 inherited the
sourmash parts (the Index implementation) it makes sense to move some
operational tools (like `index`, `update`,`check`, and `convert`) here
and archive the repo.
ctb added a commit that referenced this pull request Aug 18, 2024
luizirber added a commit that referenced this pull request Aug 19, 2024
A few useful things from the branchwater plugin perspective:
- add getters to `num`, `scaled`, and `n_hashes` in `Record`
- add misc documentation, based in part on descriptions in #2230
- add `#[derive Clone]` to `Collection` and `CollectionSet`
- add `Collection::from_rocksdb(...)`

---------

Co-authored-by: Luiz Irber <[email protected]>
Co-authored-by: Luiz Irber <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants