Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize to_byte_idx #17

Merged
merged 6 commits into from
Oct 15, 2023
Merged

optimize to_byte_idx #17

merged 6 commits into from
Oct 15, 2023

Conversation

CeleritasCelery
Copy link
Contributor

@CeleritasCelery CeleritasCelery commented Mar 2, 2023

I got a chance to optimize to_byte_idx. There are a few things I changed here:

small string fast path

added a fast path for small strings. We already had this chars::count. This resulted in about 10% speedup on the trival strings.

fixed chunking bug

I found a performance bug where max_round_length was being divided by MAX_ACC instead of SIZE. This made the fast path way shorter than needed. For example if your char_idx was 1000, then the old logic would only use of the fast path for the first 3 chunks (48 bytes). And if char_idx was below 255 then it would skip the fast path entirely. Fixing this 1 line resulted in up to 40% speedup on some benchmarks.

loop unrolling

This is the same approach that was used last time, with unrolling the loop for faster performance. Now the routine goes brrrr and I am getting over 50GB/s on my arm machine.

benchmark text file path

lastly I updated the benchmarks so that they can be run from any directory instead of just the crate root.

benchmarks

Everything looks like it improved except for jp_0102 which shows a 8% performance regression on both arm and x86. I can't seem to explain the regression.

arm benchmarks

chars::to_byte_idx/en_0001
                        time:   [931.41 ps 931.82 ps 932.45 ps]
                        thrpt:  [1022.8 MiB/s 1023.5 MiB/s 1023.9 MiB/s]
                 change:
                        time:   [-40.121% -40.032% -39.932%] (p = 0.00 < 0.05)
                        thrpt:  [+66.477% +66.755% +67.003%]
                        Performance has improved.
chars::to_byte_idx/en_0010
                        time:   [4.4668 ns 4.4751 ns 4.4839 ns]
                        thrpt:  [2.0771 GiB/s 2.0811 GiB/s 2.0850 GiB/s]
                 change:
                        time:   [-15.638% -15.402% -15.148%] (p = 0.00 < 0.05)
                        thrpt:  [+17.853% +18.206% +18.536%]
                        Performance has improved.
chars::to_byte_idx/en_0100
                        time:   [4.8657 ns 4.8773 ns 4.8899 ns]
                        thrpt:  [19.046 GiB/s 19.095 GiB/s 19.140 GiB/s]
                 change:
                        time:   [-20.208% -20.005% -19.790%] (p = 0.00 < 0.05)
                        thrpt:  [+24.673% +25.007% +25.326%]
                        Performance has improved.
chars::to_byte_idx/en_1000
                        time:   [21.431 ns 21.468 ns 21.512 ns]
                        thrpt:  [43.292 GiB/s 43.383 GiB/s 43.457 GiB/s]
                 change:
                        time:   [-53.628% -53.540% -53.445%] (p = 0.00 < 0.05)
                        thrpt:  [+114.80% +115.24% +115.65%]
                        Performance has improved.
chars::to_byte_idx/en_10000
                        time:   [182.21 ns 182.28 ns 182.38 ns]
                        thrpt:  [51.065 GiB/s 51.093 GiB/s 51.114 GiB/s]
                 change:
                        time:   [-59.001% -58.937% -58.866%] (p = 0.00 < 0.05)
                        thrpt:  [+143.11% +143.53% +143.91%]
                        Performance has improved.
chars::to_byte_idx/jp_0003
                        time:   [2.1751 ns 2.1762 ns 2.1777 ns]
                        thrpt:  [1.2830 GiB/s 1.2839 GiB/s 1.2845 GiB/s]
                 change:
                        time:   [-12.671% -12.526% -12.387%] (p = 0.00 < 0.05)
                        thrpt:  [+14.138% +14.319% +14.509%]
                        Performance has improved.
chars::to_byte_idx/jp_0102
                        time:   [7.3852 ns 7.4068 ns 7.4302 ns]
                        thrpt:  [12.785 GiB/s 12.825 GiB/s 12.863 GiB/s]
                 change:
                        time:   [+7.8921% +8.2350% +8.5959%] (p = 0.00 < 0.05)
                        thrpt:  [-7.9155% -7.6084% -7.3148%]
                        Performance has regressed.
chars::to_byte_idx/jp_1001
                        time:   [36.486 ns 36.526 ns 36.573 ns]
                        thrpt:  [25.490 GiB/s 25.523 GiB/s 25.551 GiB/s]
                 change:
                        time:   [-21.254% -21.135% -21.003%] (p = 0.00 < 0.05)
                        thrpt:  [+26.587% +26.799% +26.990%]
                        Performance has improved.
chars::to_byte_idx/jp_10000
                        time:   [348.39 ns 348.75 ns 349.11 ns]
                        thrpt:  [26.704 GiB/s 26.732 GiB/s 26.759 GiB/s]
                 change:
                        time:   [-20.138% -20.019% -19.894%] (p = 0.00 < 0.05)
                        thrpt:  [+24.834% +25.029% +25.216%]
                        Performance has improved.

x86 benchmarks

note that this is with a really old Intel Core 2 Duo E8500 on linux

chars::to_byte_idx/en_0001
                        time:   [2.2377 ns 2.2380 ns 2.2384 ns]
                        thrpt:  [426.06 MiB/s 426.13 MiB/s 426.18 MiB/s]
                 change:
                        time:   [-45.754% -45.708% -45.643%] (p = 0.00 < 0.05)
                        thrpt:  [+83.968% +84.188% +84.347%]
                        Performance has improved.
chars::to_byte_idx/en_0010
                        time:   [19.928 ns 19.954 ns 19.974 ns]
                        thrpt:  [477.45 MiB/s 477.93 MiB/s 478.56 MiB/s]
                 change:
                        time:   [-9.9976% -9.8883% -9.7874%] (p = 0.00 < 0.05)
                        thrpt:  [+10.849% +10.973% +11.108%]
                        Performance has improved.
chars::to_byte_idx/en_0100
                        time:   [18.047 ns 18.057 ns 18.068 ns]
                        thrpt:  [5.1545 GiB/s 5.1576 GiB/s 5.1606 GiB/s]
                 change:
                        time:   [-23.535% -23.478% -23.407%] (p = 0.00 < 0.05)
                        thrpt:  [+30.560% +30.681% +30.779%]
                        Performance has improved.
chars::to_byte_idx/en_1000
                        time:   [72.238 ns 72.262 ns 72.285 ns]
                        thrpt:  [12.884 GiB/s 12.888 GiB/s 12.892 GiB/s]
                 change:
                        time:   [-57.904% -57.856% -57.805%] (p = 0.00 < 0.05)
                        thrpt:  [+136.99% +137.28% +137.55%]
                        Performance has improved.
chars::to_byte_idx/en_10000
                        time:   [578.62 ns 578.99 ns 579.39 ns]
                        thrpt:  [16.074 GiB/s 16.085 GiB/s 16.096 GiB/s]
                 change:
                        time:   [-62.480% -62.423% -62.367%] (p = 0.00 < 0.05)
                        thrpt:  [+165.73% +166.12% +166.52%]
                        Performance has improved.
chars::to_byte_idx/jp_0003
                        time:   [4.8644 ns 4.8660 ns 4.8677 ns]
                        thrpt:  [587.75 MiB/s 587.96 MiB/s 588.16 MiB/s]
                 change:
                        time:   [-26.968% -26.915% -26.850%] (p = 0.00 < 0.05)
                        thrpt:  [+36.705% +36.828% +36.926%]
                        Performance has improved.
chars::to_byte_idx/jp_0102
                        time:   [27.801 ns 27.806 ns 27.812 ns]
                        thrpt:  [3.4156 GiB/s 3.4163 GiB/s 3.4170 GiB/s]
                 change:
                        time:   [+5.3434% +5.4355% +5.5065%] (p = 0.00 < 0.05)
                        thrpt:  [-5.2191% -5.1553% -5.0723%]
                        Performance has regressed.
chars::to_byte_idx/jp_1001
                        time:   [139.85 ns 139.88 ns 139.91 ns]
                        thrpt:  [6.6634 GiB/s 6.6648 GiB/s 6.6662 GiB/s]
                 change:
                        time:   [-20.610% -20.588% -20.561%] (p = 0.00 < 0.05)
                        thrpt:  [+25.883% +25.925% +25.961%]
                        Performance has improved.
chars::to_byte_idx/jp_10000
                        time:   [1.2295 µs 1.2297 µs 1.2299 µs]
                        thrpt:  [7.5799 GiB/s 7.5812 GiB/s 7.5823 GiB/s]
                 change:
                        time:   [-22.340% -22.256% -22.178%] (p = 0.00 < 0.05)
                        thrpt:  [+28.498% +28.627% +28.767%]
                        Performance has improved.

Previous it was dividing the char count by the MAX_ACC, which is 255 for SIMD
functions. This meant that a char index of 1000 would only use 3 chunks at a
time and anything below 255 would skip the fast path completely. Fixing this
lead to a 40% improvement on all on-trivial benchmarks.
@cessen
Copy link
Owner

cessen commented Mar 6, 2023

Thanks for this! I'm currently on an extended vacation, so unless this is urgent I'll get to it some time in April. Sorry for the delay!

Copy link
Owner

@cessen cessen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on this! As I mentioned in the other PR, I ended up moving, etc., and it's taken me a while to settle in and get back into rhythm.

Over-all this looks great. Just a few nits.

src/chars.rs Outdated Show resolved Hide resolved
src/chars.rs Show resolved Hide resolved
src/chars.rs Outdated Show resolved Hide resolved
@CeleritasCelery
Copy link
Contributor Author

Done. Glad you are back!

@cessen
Copy link
Owner

cessen commented Oct 15, 2023

Thanks!

@cessen cessen merged commit 354a1c2 into cessen:master Oct 15, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants