optimize to_byte_idx #17

CeleritasCelery · 2023-03-02T23:20:22Z

I got a chance to optimize to_byte_idx. There are a few things I changed here:

small string fast path

added a fast path for small strings. We already had this chars::count. This resulted in about 10% speedup on the trival strings.

fixed chunking bug

I found a performance bug where max_round_length was being divided by MAX_ACC instead of SIZE. This made the fast path way shorter than needed. For example if your char_idx was 1000, then the old logic would only use of the fast path for the first 3 chunks (48 bytes). And if char_idx was below 255 then it would skip the fast path entirely. Fixing this 1 line resulted in up to 40% speedup on some benchmarks.

loop unrolling

This is the same approach that was used last time, with unrolling the loop for faster performance. Now the routine goes brrrr and I am getting over 50GB/s on my arm machine.

benchmark text file path

lastly I updated the benchmarks so that they can be run from any directory instead of just the crate root.

benchmarks

Everything looks like it improved except for jp_0102 which shows a 8% performance regression on both arm and x86. I can't seem to explain the regression.

arm benchmarks

chars::to_byte_idx/en_0001
                        time:   [931.41 ps 931.82 ps 932.45 ps]
                        thrpt:  [1022.8 MiB/s 1023.5 MiB/s 1023.9 MiB/s]
                 change:
                        time:   [-40.121% -40.032% -39.932%] (p = 0.00 < 0.05)
                        thrpt:  [+66.477% +66.755% +67.003%]
                        Performance has improved.
chars::to_byte_idx/en_0010
                        time:   [4.4668 ns 4.4751 ns 4.4839 ns]
                        thrpt:  [2.0771 GiB/s 2.0811 GiB/s 2.0850 GiB/s]
                 change:
                        time:   [-15.638% -15.402% -15.148%] (p = 0.00 < 0.05)
                        thrpt:  [+17.853% +18.206% +18.536%]
                        Performance has improved.
chars::to_byte_idx/en_0100
                        time:   [4.8657 ns 4.8773 ns 4.8899 ns]
                        thrpt:  [19.046 GiB/s 19.095 GiB/s 19.140 GiB/s]
                 change:
                        time:   [-20.208% -20.005% -19.790%] (p = 0.00 < 0.05)
                        thrpt:  [+24.673% +25.007% +25.326%]
                        Performance has improved.
chars::to_byte_idx/en_1000
                        time:   [21.431 ns 21.468 ns 21.512 ns]
                        thrpt:  [43.292 GiB/s 43.383 GiB/s 43.457 GiB/s]
                 change:
                        time:   [-53.628% -53.540% -53.445%] (p = 0.00 < 0.05)
                        thrpt:  [+114.80% +115.24% +115.65%]
                        Performance has improved.
chars::to_byte_idx/en_10000
                        time:   [182.21 ns 182.28 ns 182.38 ns]
                        thrpt:  [51.065 GiB/s 51.093 GiB/s 51.114 GiB/s]
                 change:
                        time:   [-59.001% -58.937% -58.866%] (p = 0.00 < 0.05)
                        thrpt:  [+143.11% +143.53% +143.91%]
                        Performance has improved.
chars::to_byte_idx/jp_0003
                        time:   [2.1751 ns 2.1762 ns 2.1777 ns]
                        thrpt:  [1.2830 GiB/s 1.2839 GiB/s 1.2845 GiB/s]
                 change:
                        time:   [-12.671% -12.526% -12.387%] (p = 0.00 < 0.05)
                        thrpt:  [+14.138% +14.319% +14.509%]
                        Performance has improved.
chars::to_byte_idx/jp_0102
                        time:   [7.3852 ns 7.4068 ns 7.4302 ns]
                        thrpt:  [12.785 GiB/s 12.825 GiB/s 12.863 GiB/s]
                 change:
                        time:   [+7.8921% +8.2350% +8.5959%] (p = 0.00 < 0.05)
                        thrpt:  [-7.9155% -7.6084% -7.3148%]
                        Performance has regressed.
chars::to_byte_idx/jp_1001
                        time:   [36.486 ns 36.526 ns 36.573 ns]
                        thrpt:  [25.490 GiB/s 25.523 GiB/s 25.551 GiB/s]
                 change:
                        time:   [-21.254% -21.135% -21.003%] (p = 0.00 < 0.05)
                        thrpt:  [+26.587% +26.799% +26.990%]
                        Performance has improved.
chars::to_byte_idx/jp_10000
                        time:   [348.39 ns 348.75 ns 349.11 ns]
                        thrpt:  [26.704 GiB/s 26.732 GiB/s 26.759 GiB/s]
                 change:
                        time:   [-20.138% -20.019% -19.894%] (p = 0.00 < 0.05)
                        thrpt:  [+24.834% +25.029% +25.216%]
                        Performance has improved.

x86 benchmarks

note that this is with a really old Intel Core 2 Duo E8500 on linux

chars::to_byte_idx/en_0001
                        time:   [2.2377 ns 2.2380 ns 2.2384 ns]
                        thrpt:  [426.06 MiB/s 426.13 MiB/s 426.18 MiB/s]
                 change:
                        time:   [-45.754% -45.708% -45.643%] (p = 0.00 < 0.05)
                        thrpt:  [+83.968% +84.188% +84.347%]
                        Performance has improved.
chars::to_byte_idx/en_0010
                        time:   [19.928 ns 19.954 ns 19.974 ns]
                        thrpt:  [477.45 MiB/s 477.93 MiB/s 478.56 MiB/s]
                 change:
                        time:   [-9.9976% -9.8883% -9.7874%] (p = 0.00 < 0.05)
                        thrpt:  [+10.849% +10.973% +11.108%]
                        Performance has improved.
chars::to_byte_idx/en_0100
                        time:   [18.047 ns 18.057 ns 18.068 ns]
                        thrpt:  [5.1545 GiB/s 5.1576 GiB/s 5.1606 GiB/s]
                 change:
                        time:   [-23.535% -23.478% -23.407%] (p = 0.00 < 0.05)
                        thrpt:  [+30.560% +30.681% +30.779%]
                        Performance has improved.
chars::to_byte_idx/en_1000
                        time:   [72.238 ns 72.262 ns 72.285 ns]
                        thrpt:  [12.884 GiB/s 12.888 GiB/s 12.892 GiB/s]
                 change:
                        time:   [-57.904% -57.856% -57.805%] (p = 0.00 < 0.05)
                        thrpt:  [+136.99% +137.28% +137.55%]
                        Performance has improved.
chars::to_byte_idx/en_10000
                        time:   [578.62 ns 578.99 ns 579.39 ns]
                        thrpt:  [16.074 GiB/s 16.085 GiB/s 16.096 GiB/s]
                 change:
                        time:   [-62.480% -62.423% -62.367%] (p = 0.00 < 0.05)
                        thrpt:  [+165.73% +166.12% +166.52%]
                        Performance has improved.
chars::to_byte_idx/jp_0003
                        time:   [4.8644 ns 4.8660 ns 4.8677 ns]
                        thrpt:  [587.75 MiB/s 587.96 MiB/s 588.16 MiB/s]
                 change:
                        time:   [-26.968% -26.915% -26.850%] (p = 0.00 < 0.05)
                        thrpt:  [+36.705% +36.828% +36.926%]
                        Performance has improved.
chars::to_byte_idx/jp_0102
                        time:   [27.801 ns 27.806 ns 27.812 ns]
                        thrpt:  [3.4156 GiB/s 3.4163 GiB/s 3.4170 GiB/s]
                 change:
                        time:   [+5.3434% +5.4355% +5.5065%] (p = 0.00 < 0.05)
                        thrpt:  [-5.2191% -5.1553% -5.0723%]
                        Performance has regressed.
chars::to_byte_idx/jp_1001
                        time:   [139.85 ns 139.88 ns 139.91 ns]
                        thrpt:  [6.6634 GiB/s 6.6648 GiB/s 6.6662 GiB/s]
                 change:
                        time:   [-20.610% -20.588% -20.561%] (p = 0.00 < 0.05)
                        thrpt:  [+25.883% +25.925% +25.961%]
                        Performance has improved.
chars::to_byte_idx/jp_10000
                        time:   [1.2295 µs 1.2297 µs 1.2299 µs]
                        thrpt:  [7.5799 GiB/s 7.5812 GiB/s 7.5823 GiB/s]
                 change:
                        time:   [-22.340% -22.256% -22.178%] (p = 0.00 < 0.05)
                        thrpt:  [+28.498% +28.627% +28.767%]
                        Performance has improved.

Previous it was dividing the char count by the MAX_ACC, which is 255 for SIMD functions. This meant that a char index of 1000 would only use 3 chunks at a time and anything below 255 would skip the fast path completely. Fixing this lead to a 40% improvement on all on-trivial benchmarks.

cessen · 2023-03-06T12:06:05Z

Thanks for this! I'm currently on an extended vacation, so unless this is urgent I'll get to it some time in April. Sorry for the delay!

cessen

Sorry for the delay on this! As I mentioned in the other PR, I ended up moving, etc., and it's taken me a while to settle in and get back into rhythm.

Over-all this looks great. Just a few nits.

src/chars.rs

CeleritasCelery · 2023-10-15T22:00:26Z

Done. Glad you are back!

cessen · 2023-10-15T22:28:34Z

Thanks!

CeleritasCelery added 5 commits March 2, 2023 13:27

Enable benchmarks to be run from any directory

b7dfb3f

Refactor char counting

fadacae

Add fast path for small texts in to_byte_idx

dbed125

Add loop unrolling to to_byte_idx

a8df02f

cessen requested changes Oct 15, 2023

View reviewed changes

src/chars.rs Outdated Show resolved Hide resolved

src/chars.rs Show resolved Hide resolved

src/chars.rs Outdated Show resolved Hide resolved

Apply feedback from cessen

36535b3

cessen merged commit 354a1c2 into cessen:master Oct 15, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize to_byte_idx #17

optimize to_byte_idx #17

CeleritasCelery commented Mar 2, 2023 •

edited

Loading

cessen commented Mar 6, 2023

cessen left a comment

CeleritasCelery commented Oct 15, 2023

cessen commented Oct 15, 2023

optimize to_byte_idx #17

optimize to_byte_idx #17

Conversation

CeleritasCelery commented Mar 2, 2023 • edited Loading

small string fast path

fixed chunking bug

loop unrolling

benchmark text file path

benchmarks

note that this is with a really old Intel Core 2 Duo E8500 on linux

cessen commented Mar 6, 2023

cessen left a comment

Choose a reason for hiding this comment

CeleritasCelery commented Oct 15, 2023

cessen commented Oct 15, 2023

CeleritasCelery commented Mar 2, 2023 •

edited

Loading