-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
foldhash::fast::RandomState
is very bad
#577
Comments
What kind of numbers do you get from ahash or fxhash? cc @orlp |
For such small sets and such a small input domain you are testing very specific relations between input bits and output bits (only the top 7 output bits for the tag and the bottom If an unlucky seed is chosen, you can hit an unlucky avalanche probability, as is documented in the readme for foldhash-f. In the initialization procedure of foldhash I already force some bits on in the global seed, perhaps I can extend this a bit more to force a few more bits on/off such that the small-input case always at least avalanches into the top 7 bits. |
1000 was chosen arbitrarily. I reran this with 0x3800 items for maximum density and averaged over 32768 runs. |
The motivating issue for me was actually small strings. Using to_string() before hashing doesn't change collision rates for most hash functions. For foldhash::fast it becomes consistent but still a bit worse (0.0715 at max density, 0.0349 at min density). For fxhash it becomes 0.133 at max density, 0.0503 at min density. |
Upon further investigation it appears that the issue isn't (just) simply that the 7-bit tag has a bad distribution, but also that for small number inputs it can be correlated with the bottom bits for some seeds. I think I can also resolve this problem by choosing seeds more carefully though. |
@orlp, I was just wondering if you have further thoughts on this edge case? There are various "low u32" keyed hashtables in https://github.com/arthurprs/canopydb that could be affected by this, and I'm not sure if I need to switch to the "quality" hash to prevent such degenerate cases. |
@arthurprs I'm working on analyzing it, at this point I don't have a concrete answer yet. I've collected 1GB worth of seeds along with a performance score on the average number of equality checks performed when inserting the numbers Foldhash-q is rock-solid regardless of seed, so if you want to be 100% certain performance is predictable, it's a good choice, especially in the meantime. |
@Greatness7 Does the problem resolve if you switch to the old hasher, |
Yes I have confirmed that (in v1.15) swapping out use hashbrown::DefaultHashBuilder; with type DefaultHashBuilder = core::hash::BuildHasherDefault<ahash::AHasher>; solves the performance issues that I'm seeing. For context this is a GUI application using the vizia framework. I am not the original author of the framework so I don't know exactly how it all works or what is being hashed, but |
@Greatness7 One more question, does the problem also resolve if you use the quality version of type DefaultHashBuilder = foldhash::quality::RandomState; |
@Greatness7 Actually looking at it a bit more closely I don't think this is the same issue as the one being raised here. This is simply incorrect: pub(crate) fn get_storeid<T: 'static + Hash>(t: &T) -> StoreId {
let mut s = DefaultHashBuilder::default().build_hasher();
TypeId::of::<T>().hash(&mut s);
t.hash(&mut s);
StoreId(s.finish())
} It assumes For I don't believe |
I tested I believe the function in question originally used DefaultHasher::new from std, and when later switching to |
I strongly suspect the program is bugged regardless then, and that the old hasher just didn't expose the bug. Does the program assume that
There is no equivalent in Note that if all you did was change back to type DefaultHashBuilder = core::hash::BuildHasherDefault<ahash::AHasher>; and this is truly only used in that |
You may be right that it could've been bugged before and it was just some property of the old I'll investigate further with the original author and try to figure out if it was incorrect usage all along. |
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=8d9004344a0ddd7799b90e24f5866e75
This counts collisions. The number fluctuates wildly from 1.5% to 7% and mostly depends on global seed (if one measurement is bad then all of them are). Both
foldhash::quality::RandomState
andahash::RandomState
remain at around 2.9% and would be acceptable as default.The text was updated successfully, but these errors were encountered: