Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: recognize and use over sized allocations #523

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

morrisonlevi
Copy link

@morrisonlevi morrisonlevi commented May 4, 2024

Allocators are allowed to return a larger memory chunk than was asked for. If the amount extra is large enough, then the hash map can use the extra space. The Global allocator will not hit this path, because it won't over-size enough to matter, but custom allocators may. An example of an allocator which allocates full system pages is included in the test suite (Unix only because it uses mmap).

This implements #489. This relies on PR #524 to increase the minimum number of buckets for certain small types, which in turn constrains the domain of maximum_buckets_in so that the alignment can be ignored.

I haven't done any performance testing. Since this is on the slow path of making a new allocation, the feature should be doable without too much concern about overhead.

This is my first contribution to the project, and I am definitely not an expert in swiss tables. Feedback is very welcome, even nitpicking.

@morrisonlevi
Copy link
Author

morrisonlevi commented May 5, 2024

Edit: this comment is out-of-date with the current implementation, left because it's interesting.


I've figured out the rough equation to do it without a loop:

fn maximum_buckets_in(allocation_size: usize, table_layout: TableLayout) -> usize {
    // Given an equation like:
    //   z >= x * y + x
    // x can be maximized by doing:
    //   x = z / (y + 1)
    // If you squint:
    //   z is the size of the allocation
    //   y is the table_layout.size
    //   x is the number of buckets
    // But there are details like x needing to be a power of 2,
    // and there are some extra bytes mixed in (a possible
    // rounding up for table_layout.align, and Group::WIDTH).
    /// todo: how do I factor in the ctrl_align?
    let z = allocation_size - Group::WIDTH;
    let y_plus_1 = table_layout.size + 1;
    prev_pow2(z / y_plus_1)
}

I'm not quite sure about the table_layout.ctrl_align. I need to think about that more. It seems like it can be ignored, but I haven't quite figured out why or the proof for it.

Edit: I tried to find a case where ignoring the ctrl_align caused a problem programmatically:

type T = (bool, ());
    let table_layout = TableLayout::new::<T>();

    let begin = {
        // there are never less than 4 buckets
        let (layout, _) = table_layout.calculate_layout_for(4).unwrap();
        layout.size()
    };

    use rayon::prelude::*;
    (begin..=(1 << 47))
        .into_par_iter()
        .for_each(|allocation_size| {
            let buckets = maximum_buckets_in(allocation_size, table_layout).unwrap();
            let (layout, _) = table_layout.calculate_layout_for(buckets).unwrap();
            let size = layout.size();
            assert!(
                size <= allocation_size,
                "failed {size} <= {allocation_size}"
            );
        });

I ran it for quite some time with different TableLayouts. No issues on any of them. I think it has to do with the relationship between rounding down to the previous power of 2, and the fact the rounding for align is always very small, in the range 0..ctrl_align.

morrisonlevi added a commit to morrisonlevi/hashbrown that referenced this pull request May 7, 2024
Consider `HashSet<u8>` on x86_64 with SSE:

| buckets | capacity | allocated bytes |
| ------- | -------- | --------------- |
|       4 |        3 |              36 |
|       8 |        7 |              40 |
|      16 |       14 |              48 |

Quadroupling the number of buckets from 4 to 16 does not even increase
the final allocation size by 50% (48/36=1.333). This is an edge case
due to the padding of the control bytes.

This platform isn't the only one with edges. Here's aarch64 on an M1
for the same `HashSet<u8>`:

| buckets | capacity | allocated bytes |
| ------- | -------- | --------------- |
|       4 |        3 |              20 |
|       8 |        7 |              24 |
|      16 |       14 |              40 |

Notice 4 -> 8 buckets leading to only 4 more bytes (20 -> 24) instead
of roughly doubling.

Generalized, `buckets * table_layout.size` needs to be at least as big
as `table_layout.ctrl_align`. For the cases I listed above, we'd get
these new minimum bucket sizes:

 - x86_64 with SSE: 16
 - aarch64: 8

This is a niche optimization. However, it also removes possible
undefined behavior edge case in resize operations. In addition, it
may be a useful property to utilize over-sized allocations (see
rust-lang#523).
morrisonlevi added a commit to morrisonlevi/hashbrown that referenced this pull request May 8, 2024
Consider `HashSet<u8>` on x86_64 with SSE with various bucket sizes and
how many bytes the allocation ends up being:

| buckets | capacity | allocated bytes |
| ------- | -------- | --------------- |
|       4 |        3 |              36 |
|       8 |        7 |              40 |
|      16 |       14 |              48 |
|      32 |       28 |              80 |

In general, doubling the number of buckets should roughly double the
number of bytes used. However, for small bucket sizes for these small
TableLayouts (4 -> 8, 8 -> 16), it doesn't happen. This is an edge case
which happens because of padding of the control bytes and adding the
Group::WIDTH. Taking the buckets from 4 to 16 (4x) only takes the
allocated bytes from 36 to 48 (~1.3x).

This platform isn't the only one with edges. Here's aarch64 on an M1
for the same `HashSet<u8>`:

| buckets | capacity | allocated bytes |
| ------- | -------- | --------------- |
|       4 |        3 |              20 |
|       8 |        7 |              24 |
|      16 |       14 |              40 |

Notice 4 -> 8 buckets leading to only 4 more bytes (20 -> 24) instead
of roughly doubling.

Generalized, `buckets * table_layout.size` needs to be at least as big
as `table_layout.ctrl_align`. For the cases I listed above, we'd get
these new minimum bucket sizes:

 - x86_64 with SSE: 16
 - aarch64: 8

This is a niche optimization. However, it also removes possible
undefined behavior edge case in resize operations. In addition, it
may be a useful property to utilize over-sized allocations (see
rust-lang#523).
morrisonlevi added a commit to morrisonlevi/hashbrown that referenced this pull request May 8, 2024
Consider `HashSet<u8>` on x86_64 with SSE with various bucket sizes and
how many bytes the allocation ends up being:

| buckets | capacity | allocated bytes |
| ------- | -------- | --------------- |
|       4 |        3 |              36 |
|       8 |        7 |              40 |
|      16 |       14 |              48 |
|      32 |       28 |              80 |

In general, doubling the number of buckets should roughly double the
number of bytes used. However, for small bucket sizes for these small
TableLayouts (4 -> 8, 8 -> 16), it doesn't happen. This is an edge case
which happens because of padding of the control bytes and adding the
Group::WIDTH. Taking the buckets from 4 to 16 (4x) only takes the
allocated bytes from 36 to 48 (~1.3x).

This platform isn't the only one with edges. Here's aarch64 on an M1
for the same `HashSet<u8>`:

| buckets | capacity | allocated bytes |
| ------- | -------- | --------------- |
|       4 |        3 |              20 |
|       8 |        7 |              24 |
|      16 |       14 |              40 |

Notice 4 -> 8 buckets leading to only 4 more bytes (20 -> 24) instead
of roughly doubling.

Generalized, `buckets * table_layout.size` needs to be at least as big
as `table_layout.ctrl_align`. For the cases I listed above, we'd get
these new minimum bucket sizes:

 - x86_64 with SSE: 16
 - aarch64: 8

This is a niche optimization. However, it also removes possible
undefined behavior edge case in resize operations. In addition, it
may be a useful property to utilize over-sized allocations (see
rust-lang#523).
morrisonlevi added a commit to morrisonlevi/hashbrown that referenced this pull request May 8, 2024
Consider `HashSet<u8>` on x86_64 with SSE with various bucket sizes and
how many bytes the allocation ends up being:

| buckets | capacity | allocated bytes |
| ------- | -------- | --------------- |
|       4 |        3 |              36 |
|       8 |        7 |              40 |
|      16 |       14 |              48 |
|      32 |       28 |              80 |

In general, doubling the number of buckets should roughly double the
number of bytes used. However, for small bucket sizes for these small
TableLayouts (4 -> 8, 8 -> 16), it doesn't happen. This is an edge case
which happens because of padding of the control bytes and adding the
Group::WIDTH. Taking the buckets from 4 to 16 (4x) only takes the
allocated bytes from 36 to 48 (~1.3x).

This platform isn't the only one with edges. Here's aarch64 on an M1
for the same `HashSet<u8>`:

| buckets | capacity | allocated bytes |
| ------- | -------- | --------------- |
|       4 |        3 |              20 |
|       8 |        7 |              24 |
|      16 |       14 |              40 |

Notice 4 -> 8 buckets leading to only 4 more bytes (20 -> 24) instead
of roughly doubling.

Generalized, `buckets * table_layout.size` needs to be at least as big
as `table_layout.ctrl_align`. For the cases I listed above, we'd get
these new minimum bucket sizes:

 - x86_64 with SSE: 16
 - aarch64: 8

This is a niche optimization. However, it also removes possible
undefined behavior edge case in resize operations. In addition, it
may be a useful property to utilize over-sized allocations (see
rust-lang#523).
Consider `HashSet<u8>` on x86_64 with SSE with various bucket sizes and
how many bytes the allocation ends up being:

| buckets | capacity | allocated bytes |
| ------- | -------- | --------------- |
|       4 |        3 |              36 |
|       8 |        7 |              40 |
|      16 |       14 |              48 |
|      32 |       28 |              80 |

In general, doubling the number of buckets should roughly double the
number of bytes used. However, for small bucket sizes for these small
TableLayouts (4 -> 8, 8 -> 16), it doesn't happen. This is an edge case
which happens because of padding of the control bytes and adding the
Group::WIDTH. Taking the buckets from 4 to 16 (4x) only takes the
allocated bytes from 36 to 48 (~1.3x).

This platform isn't the only one with edges. Here's aarch64 on an M1
for the same `HashSet<u8>`:

| buckets | capacity | allocated bytes |
| ------- | -------- | --------------- |
|       4 |        3 |              20 |
|       8 |        7 |              24 |
|      16 |       14 |              40 |

Notice 4 -> 8 buckets leading to only 4 more bytes (20 -> 24) instead
of roughly doubling.

Generalized, `buckets * table_layout.size` needs to be at least as big
as `table_layout.ctrl_align`. For the cases I listed above, we'd get
these new minimum bucket sizes:

 - x86_64 with SSE: 16
 - aarch64: 8

This is a niche optimization. However, it also removes possible
undefined behavior edge case in resize operations. In addition, it
may be a useful property to utilize over-sized allocations (see
rust-lang#523).
@morrisonlevi
Copy link
Author

morrisonlevi commented May 8, 2024

I found an issue with the previous implementation with very small TableLayout sizes with large ctrl_aligns. The optimization in #524 will correct the domain of maximum_buckets_in because buckets * table_layout.size (or x * y above) will now be at least ctrl_align, so we can safely ignore it in the equation.

@morrisonlevi morrisonlevi force-pushed the oversized-allocations branch from 602a381 to d93c955 Compare May 8, 2024 15:37
Allocators are allowed to return a larger memory chunk than was asked
for. If the amount extra is large enough, then the hash map can use the
extra space. The Global allocator will not hit this path, because it
won't over-size enough to matter, but custom allocators may. An example
of an allocator which allocates full system pages is included in the
test suite (Unix only because it uses `mmap`).
@morrisonlevi morrisonlevi force-pushed the oversized-allocations branch from d93c955 to 89f6d1f Compare May 8, 2024 15:38
@morrisonlevi morrisonlevi marked this pull request as draft May 8, 2024 15:42
@morrisonlevi morrisonlevi marked this pull request as ready for review May 8, 2024 15:43
morrisonlevi added a commit to morrisonlevi/hashbrown that referenced this pull request Jun 21, 2024
Consider `HashSet<u8>` on x86_64 with SSE with various bucket sizes and
how many bytes the allocation ends up being:

| buckets | capacity | allocated bytes |
| ------- | -------- | --------------- |
|       4 |        3 |              36 |
|       8 |        7 |              40 |
|      16 |       14 |              48 |
|      32 |       28 |              80 |

In general, doubling the number of buckets should roughly double the
number of bytes used. However, for small bucket sizes for these small
TableLayouts (4 -> 8, 8 -> 16), it doesn't happen. This is an edge case
which happens because of padding of the control bytes and adding the
Group::WIDTH. Taking the buckets from 4 to 16 (4x) only takes the
allocated bytes from 36 to 48 (~1.3x).

This platform isn't the only one with edges. Here's aarch64 on an M1
for the same `HashSet<u8>`:

| buckets | capacity | allocated bytes |
| ------- | -------- | --------------- |
|       4 |        3 |              20 |
|       8 |        7 |              24 |
|      16 |       14 |              40 |

Notice 4 -> 8 buckets leading to only 4 more bytes (20 -> 24) instead
of roughly doubling.

Generalized, `buckets * table_layout.size` needs to be at least as big
as `table_layout.ctrl_align`. For the cases I listed above, we'd get
these new minimum bucket sizes:

 - x86_64 with SSE: 16
 - aarch64: 8

This is a niche optimization. However, it also removes possible
undefined behavior edge case in resize operations. In addition, it
may be a useful property to utilize over-sized allocations (see
rust-lang#523).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant