Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how do we get scaled from a RevIndex? #3386

Closed
ctb opened this issue Nov 11, 2024 · 4 comments
Closed

how do we get scaled from a RevIndex? #3386

ctb opened this issue Nov 11, 2024 · 4 comments
Labels

Comments

@ctb
Copy link
Contributor

ctb commented Nov 11, 2024

from slack -

me:

is there a standard or correct way to get the scaled value from a rocksdb revindex?
I know they’re made from a CollectionSet so they have to have a single scaled, I believe. But I don’t see any standard way to get at it.
(running into some …challenges with scaled and manysearch against a rocksdb)

luizirber:

Hmm, I guess you can grab the manifest and read a record from it, but that doesn’t mean that’s the scaled used for the DB, because it can be downscaled
So probably good to save it in the METADATA column, like the manifest is

me:

on reflection, I think we should require that a single scaled be used for the database, in which case either approach would work. Would prefer to avoid changing the METADATA, yah?
happy to be told otherwise. just a hot take.

here’s what I tried out - seems to resolve issues:
https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/504/files#diff-c9cd745a5c795bf1a3ca0d53653a6c2381d33aa49a04218343373e93fc82d830
what do you think about (1) fixing scaled for a database when constructing, and (2) provide a scaled() on the RevIndex?

code in manysearch_rocksdb.rs over in sourmash_plugin_branchwater:

  // grab scaled from the database.
    let max_db_scaled = db
        .collection()
        .manifest()
        .iter()
        .map(|r| r.scaled())
        .max()
        .expect("no records in db?!");

    let selection_scaled: u32 = match selection.scaled() {
        Some(scaled) => {
            if *max_db_scaled > scaled {
                return Err("Error: database scaled is higher than requested scaled".into());
            }
            scaled
        }
        None => {
            eprintln!("Setting scaled={} from the database", *max_db_scaled);
            *max_db_scaled
        }
    };
@ctb ctb added the rust label Nov 11, 2024
@ctb
Copy link
Contributor Author

ctb commented Nov 11, 2024

what do you think about (1) fixing scaled for a database when constructing, and (2) provide a scaled() on the RevIndex?

luiz:

Yes for both, and I’m OK stuffing more into METADATA because this is at the db level, not sig

@ctb
Copy link
Contributor Author

ctb commented Nov 13, 2024

feels like this gels a bit with #3387, which is updating Manifest records with scaled from select.

I think the idea would be to make sure that RocksDB-revindex are built only from CollectionSet (which I think is already the case) and then upgrade CollectionSet to make sure there is only one scaled (it already checks ksize, moltype, etc.)

Looking at the code, we might want to introduce an explicit select in the try_from, since with the updates from #3387 that would update the scaled.

impl TryFrom<Collection> for CollectionSet {
    type Error = crate::Error;

    fn try_from(collection: Collection) -> Result<Self> {
        let first = if let Some(first) = collection.manifest.first() {
            first
        } else {
            // empty collection is consistent ¯\_(ツ)_/¯                        
            return Ok(Self { collection });
        };

        collection
            .manifest
            .iter()
            .skip(1)
            .try_for_each(|c| first.check_compatible(c))?;

        Ok(Self { collection })
    }
}

impl CollectionSet {
    pub fn into_inner(self) -> Collection {
        self.collection
    }

    pub fn selection(&self) -> Selection {
        todo!("Extract selection from first sig")
    }
...
}

@ctb
Copy link
Contributor Author

ctb commented Nov 14, 2024

ok, I'm hesitating to run a select as that will change the function signature - we'd either need to do a clone or change the collection.

@ctb
Copy link
Contributor Author

ctb commented Nov 14, 2024

In #3397, I add min_max_scaled() and enforce that they are the same when converting a Collection into a CollectionSet.

ctb added a commit that referenced this issue Nov 15, 2024
Addresses #3386

#3387 changed `select` to
update `Record.scaled` to the desired scaled value. This PR changes
`CollectionSet` to require that all `scaled` values be the same, which
can now be achieved by running `select` ;).

It also adds a new method `Collection::min_max_scaled()` which makes it
easy to retrieve `scaled` for a `Collection`.
@ctb ctb closed this as completed Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant