`foldhash::fast::RandomState` is very bad #577

turalcar · 2024-10-14T13:55:57Z

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=8d9004344a0ddd7799b90e24f5866e75

This counts collisions. The number fluctuates wildly from 1.5% to 7% and mostly depends on global seed (if one measurement is bad then all of them are). Both foldhash::quality::RandomState and ahash::RandomState remain at around 2.9% and would be acceptable as default.

The text was updated successfully, but these errors were encountered:

Amanieu · 2024-10-14T14:07:28Z

What kind of numbers do you get from ahash or fxhash?

cc @orlp

orlp · 2024-10-14T14:38:28Z

For such small sets and such a small input domain you are testing very specific relations between input bits and output bits (only the top 7 output bits for the tag and the bottom log2(n / 16) output bits for the bucket, if I'm not mistaken). So only 13 of the output bits are being focused on in this test (even less since there's no pre-reserve call), with only 10 input bits.

If an unlucky seed is chosen, you can hit an unlucky avalanche probability, as is documented in the readme for foldhash-f.

In the initialization procedure of foldhash I already force some bits on in the global seed, perhaps I can extend this a bit more to force a few more bits on/off such that the small-input case always at least avalanches into the top 7 bits.

turalcar · 2024-10-14T14:59:08Z

1000 was chosen arbitrarily. I reran this with 0x3800 items for maximum density and averaged over 32768 runs.
fxhash is an interesting one - it seems to always use the same seed: 0.226 collisions per insertion.
ahash: 0.0637 collisions per insertion
std: 0.0637 collisions per insertion
foldhash::quality: 0.0637 collisions per insertion.
foldhash::fast: I ran this 10 times, avg: 0.0699, min: 0.0422, max: 0.213
The result is similar for 0x3801 (minimum density): all hashes provide 0.0273 collision rate, except fxhash which has exactly 0, and foldhash::fast which gave avg: 0.0292, min: 0.0201, max: 0.0458.

turalcar · 2024-10-14T15:20:47Z

The motivating issue for me was actually small strings. Using to_string() before hashing doesn't change collision rates for most hash functions. For foldhash::fast it becomes consistent but still a bit worse (0.0715 at max density, 0.0349 at min density). For fxhash it becomes 0.133 at max density, 0.0503 at min density.

orlp · 2024-10-14T18:05:50Z

Upon further investigation it appears that the issue isn't (just) simply that the 7-bit tag has a bad distribution, but also that for small number inputs it can be correlated with the bottom bits for some seeds. I think I can also resolve this problem by choosing seeds more carefully though.

arthurprs · 2024-10-26T13:41:13Z

@orlp, I was just wondering if you have further thoughts on this edge case? There are various "low u32" keyed hashtables in https://github.com/arthurprs/canopydb that could be affected by this, and I'm not sure if I need to switch to the "quality" hash to prevent such degenerate cases.

orlp · 2024-10-26T16:30:41Z

@arthurprs I'm working on analyzing it, at this point I don't have a concrete answer yet. I've collected 1GB worth of seeds along with a performance score on the average number of equality checks performed when inserting the numbers 0..1<<16 into an hashbrown set, and I'm trying to establish a concrete pattern for what determines whether a seed is good or not, and why. Then I can hopefully update the seed selection to only generate good seeds.

Foldhash-q is rock-solid regardless of seed, so if you want to be 100% certain performance is predictable, it's a good choice, especially in the meantime.

Greatness7 · 2024-12-05T10:28:02Z

I'm seeing a massive decrease in performance when upgrading hashbrown from 0.14 to 0.15 and I think the issue describe here may be the problem.

My application went from well over 60fps to below 10fps after the update. I only realized hashbrown was the cause after attaching a profiler and seeing an outsized impact in the flame graph.

Flame graph when using hashbrown v1.14:

Flame graph when using hashbrown v1.15:

With v1.14 hashbrowns functions aren't significant enough to be noticed in the graph, but with v1.15 they dominate most of the cpu time.

orlp · 2024-12-05T16:42:11Z

@Greatness7 Does the problem resolve if you switch to the old hasher, ahash, with v1.15? If so I'm very curious as to what you're hashing.

Greatness7 · 2024-12-05T23:08:44Z

@Greatness7 Does the problem resolve if you switch to the old hasher, ahash, with v1.15? If so I'm very curious as to what you're hashing.

Yes I have confirmed that (in v1.15) swapping out

use hashbrown::DefaultHashBuilder;

with

type DefaultHashBuilder = core::hash::BuildHasherDefault<ahash::AHasher>;

solves the performance issues that I'm seeing.

For context this is a GUI application using the vizia framework. I am not the original author of the framework so I don't know exactly how it all works or what is being hashed, but DefaultHashBuilder is only used in one place here. Which I think means it's hashing TypeIds.

orlp · 2024-12-06T02:09:50Z

@Greatness7 One more question, does the problem also resolve if you use the quality version of foldhash? So

type DefaultHashBuilder = foldhash::quality::RandomState;

orlp · 2024-12-06T02:46:27Z

@Greatness7 Actually looking at it a bit more closely I don't think this is the same issue as the one being raised here. This is simply incorrect:

pub(crate) fn get_storeid<T: 'static + Hash>(t: &T) -> StoreId {
    let mut s = DefaultHashBuilder::default().build_hasher();
    TypeId::of::<T>().hash(&mut s);
    t.hash(&mut s);
    StoreId(s.finish())
}

It assumes DefaultHashBuilder::default() is always the same, but that's not true for foldhash, it randomly initializes. I think the performance issues are due to having, essentially, random StoreIds.

For foldhash you would need to use foldhash::fast::FixedState, rather than foldhash::fast::RandomState which is what hashbrown defaults to.

I don't believe hashbrown has ever guaranteed that hashbrown::DefaultHashBuilder::default always returns the same hasher, even though it did in the past.

Greatness7 · 2024-12-06T02:59:16Z

@orlp

I tested foldhash::quality::RandomState and it did not resolve the performance issue. Trying with foldhash::fast::FixedState causes the program to no longer work correctly so I cannot easily judge performance there.

I believe the function in question originally used DefaultHasher::new from std, and when later switching to hashbrown to get some performance gains started using DefaultHashBuilder ( link to the diff ). I guess it may have behaved similarly by coincidence? Do you know what the hashbrown equivalent of this std function would be?

orlp · 2024-12-06T13:21:47Z

@Greatness7

Trying with foldhash::fast::FixedState causes the program to no longer work correctly so I cannot easily judge performance there.

I strongly suspect the program is bugged regardless then, and that the old hasher just didn't expose the bug. Does the program assume that get_storeid returns unique ids without any collisions? Because no hash guarantees that.

Do you know what the hashbrown equivalent of this std function would be?

There is no equivalent in hashbrown, hashbrown isn't a hashing library, and that get_storeid function is just a hash.

Note that if all you did was change back to

type DefaultHashBuilder = core::hash::BuildHasherDefault<ahash::AHasher>;

and this is truly only used in that get_storeid function, the performance issues aren't due to the hasher used in hashbrown, because in that scenario hashbrown is stilll using foldhash.

Greatness7 · 2024-12-07T01:05:04Z

@Greatness7

Trying with foldhash::fast::FixedState causes the program to no longer work correctly so I cannot easily judge performance there.

I strongly suspect the program is bugged regardless then, and that the old hasher just didn't expose the bug. Does the program assume that get_storeid returns unique ids without any collisions? Because no hash guarantees that.

Do you know what the hashbrown equivalent of this std function would be?

There is no equivalent in hashbrown, hashbrown isn't a hashing library, and that get_storeid function is just a hash.

Note that if all you did was change back to

type DefaultHashBuilder = core::hash::BuildHasherDefaultahash::AHasher;

and this is truly only used in that get_storeid function, the performance issues aren't due to the hasher used in hashbrown, because in that scenario hashbrown is stilll using foldhash.

You may be right that it could've been bugged before and it was just some property of the old DefaultHashBuilder hiding the problems. Interesting though that it was like that for years without showing any signs and the v1.15 update caused things to degrade immediately.

I'll investigate further with the original author and try to figure out if it was incorrect usage all along.

This was referenced Nov 9, 2024

Trying to redo workspace deps AFLplusplus/LibAFL#2672

Closed

Update hashbrown requirement from 0.14.5 to 0.15.1 AFLplusplus/LibAFL#2674

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`foldhash::fast::RandomState` is very bad #577

`foldhash::fast::RandomState` is very bad #577

turalcar commented Oct 14, 2024

Amanieu commented Oct 14, 2024

orlp commented Oct 14, 2024 •

edited

Loading

turalcar commented Oct 14, 2024

turalcar commented Oct 14, 2024

orlp commented Oct 14, 2024 •

edited

Loading

arthurprs commented Oct 26, 2024 •

edited

Loading

orlp commented Oct 26, 2024

Greatness7 commented Dec 5, 2024

orlp commented Dec 5, 2024

Greatness7 commented Dec 5, 2024 •

edited

Loading

orlp commented Dec 6, 2024

orlp commented Dec 6, 2024 •

edited

Loading

Greatness7 commented Dec 6, 2024

orlp commented Dec 6, 2024 •

edited

Loading

Greatness7 commented Dec 7, 2024 •

edited

Loading

foldhash::fast::RandomState is very bad #577

foldhash::fast::RandomState is very bad #577

Comments

turalcar commented Oct 14, 2024

Amanieu commented Oct 14, 2024

orlp commented Oct 14, 2024 • edited Loading

turalcar commented Oct 14, 2024

turalcar commented Oct 14, 2024

orlp commented Oct 14, 2024 • edited Loading

arthurprs commented Oct 26, 2024 • edited Loading

orlp commented Oct 26, 2024

Greatness7 commented Dec 5, 2024

orlp commented Dec 5, 2024

Greatness7 commented Dec 5, 2024 • edited Loading

orlp commented Dec 6, 2024

orlp commented Dec 6, 2024 • edited Loading

Greatness7 commented Dec 6, 2024

orlp commented Dec 6, 2024 • edited Loading

Greatness7 commented Dec 7, 2024 • edited Loading

`foldhash::fast::RandomState` is very bad #577

`foldhash::fast::RandomState` is very bad #577

orlp commented Oct 14, 2024 •

edited

Loading

orlp commented Oct 14, 2024 •

edited

Loading

arthurprs commented Oct 26, 2024 •

edited

Loading

Greatness7 commented Dec 5, 2024 •

edited

Loading

orlp commented Dec 6, 2024 •

edited

Loading

orlp commented Dec 6, 2024 •

edited

Loading

Greatness7 commented Dec 7, 2024 •

edited

Loading