Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize lines_crlf #21

Merged
merged 2 commits into from
Oct 31, 2024
Merged

Optimize lines_crlf #21

merged 2 commits into from
Oct 31, 2024

Conversation

CeleritasCelery
Copy link
Contributor

@CeleritasCelery CeleritasCelery commented Apr 22, 2023

This change is not ready to be merged because it does not yet support x86. I don't know how to do the operation we need with SSE intrinsics (see #18).

This new approach adds a another method to ByteChunk that lets you shift a value across two vectors. This let's us turn a 2-pass counting algorithm into a single pass. Making this change alone resulted in a 40% speedup.

There is also some loop unrolling, but there was not a large difference in unrolling over a factor of 2.

benchmarks

aarch64 benchmarks
lines_crlf::count_breaks/lines_100_LF
                        time:   [6.5590 ns 6.5608 ns 6.5632 ns]
                        thrpt:  [16.035 GiB/s 16.041 GiB/s 16.045 GiB/s]
                 change:
                        time:   [-29.627% -29.504% -29.396%] (p = 0.00 < 0.05)
                        thrpt:  [+41.634% +41.851% +42.100%]
                        Performance has improved.
lines_crlf::count_breaks/lines_100_CR
                        time:   [6.5589 ns 6.5658 ns 6.5773 ns]
                        thrpt:  [16.001 GiB/s 16.028 GiB/s 16.045 GiB/s]
                 change:
                        time:   [-29.667% -29.542% -29.417%] (p = 0.00 < 0.05)
                        thrpt:  [+41.677% +41.928% +42.181%]
                        Performance has improved.
lines_crlf::count_breaks/lines_100_CRLF
                        time:   [10.238 ns 10.287 ns 10.389 ns]
                        thrpt:  [10.578 GiB/s 10.683 GiB/s 10.734 GiB/s]
                 change:
                        time:   [-17.826% -17.581% -17.271%] (p = 0.00 < 0.05)
                        thrpt:  [+20.877% +21.332% +21.692%]
                        Performance has improved.
lines_crlf::count_breaks/lines_1000_LF
                        time:   [55.172 ns 55.218 ns 55.277 ns]
                        thrpt:  [19.039 GiB/s 19.059 GiB/s 19.075 GiB/s]
                 change:
                        time:   [-58.229% -58.153% -58.074%] (p = 0.00 < 0.05)
                        thrpt:  [+138.51% +138.97% +139.40%]
                        Performance has improved.
lines_crlf::count_breaks/lines_1000_CR
                        time:   [55.204 ns 55.294 ns 55.397 ns]
                        thrpt:  [18.997 GiB/s 19.033 GiB/s 19.064 GiB/s]
                 change:
                        time:   [-57.971% -57.894% -57.818%] (p = 0.00 < 0.05)
                        thrpt:  [+137.07% +137.50% +137.93%]
                        Performance has improved.
lines_crlf::count_breaks/lines_1000_CRLF
                        time:   [58.350 ns 58.377 ns 58.411 ns]
                        thrpt:  [18.814 GiB/s 18.825 GiB/s 18.834 GiB/s]
                 change:
                        time:   [-57.676% -57.594% -57.517%] (p = 0.00 < 0.05)
                        thrpt:  [+135.39% +135.82% +136.27%]
                        Performance has improved.
lines_crlf::count_breaks/lines_10000_LF
                        time:   [474.85 ns 475.57 ns 476.47 ns]
                        thrpt:  [22.087 GiB/s 22.129 GiB/s 22.163 GiB/s]
                 change:
                        time:   [-60.927% -60.850% -60.763%] (p = 0.00 < 0.05)
                        thrpt:  [+154.86% +155.43% +155.93%]
                        Performance has improved.
lines_crlf::count_breaks/lines_10000_CR
                        time:   [474.14 ns 474.34 ns 474.60 ns]
                        thrpt:  [22.174 GiB/s 22.187 GiB/s 22.196 GiB/s]
                 change:
                        time:   [-60.732% -60.666% -60.600%] (p = 0.00 < 0.05)
                        thrpt:  [+153.81% +154.23% +154.66%]
                        Performance has improved.
lines_crlf::count_breaks/lines_10000_CRLF
                        time:   [496.71 ns 496.91 ns 497.17 ns]
                        thrpt:  [22.104 GiB/s 22.116 GiB/s 22.125 GiB/s]
                 change:
                        time:   [-60.879% -60.813% -60.736%] (p = 0.00 < 0.05)
                        thrpt:  [+154.68% +155.19% +155.62%]
                        Performance has improved.

lines_crlf::from_byte_idx/lines_100_LF
                        time:   [6.6764 ns 6.6804 ns 6.6849 ns]
                        thrpt:  [15.743 GiB/s 15.754 GiB/s 15.763 GiB/s]
                 change:
                        time:   [-28.329% -28.193% -28.053%] (p = 0.00 < 0.05)
                        thrpt:  [+38.991% +39.263% +39.527%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_100_CR
                        time:   [6.6766 ns 6.6823 ns 6.6894 ns]
                        thrpt:  [15.732 GiB/s 15.749 GiB/s 15.762 GiB/s]
                 change:
                        time:   [-28.438% -28.289% -28.135%] (p = 0.00 < 0.05)
                        thrpt:  [+39.150% +39.448% +39.739%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_100_CRLF
                        time:   [10.398 ns 10.403 ns 10.408 ns]
                        thrpt:  [10.559 GiB/s 10.564 GiB/s 10.569 GiB/s]
                 change:
                        time:   [-16.452% -16.284% -16.133%] (p = 0.00 < 0.05)
                        thrpt:  [+19.237% +19.451% +19.692%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_1000_LF
                        time:   [54.964 ns 55.003 ns 55.054 ns]
                        thrpt:  [19.116 GiB/s 19.133 GiB/s 19.147 GiB/s]
                 change:
                        time:   [-58.536% -58.452% -58.373%] (p = 0.00 < 0.05)
                        thrpt:  [+140.23% +140.68% +141.17%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_1000_CR
                        time:   [54.934 ns 54.948 ns 54.966 ns]
                        thrpt:  [19.146 GiB/s 19.153 GiB/s 19.157 GiB/s]
                 change:
                        time:   [-58.211% -58.139% -58.062%] (p = 0.00 < 0.05)
                        thrpt:  [+138.45% +138.89% +139.30%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_1000_CRLF
                        time:   [58.496 ns 58.511 ns 58.531 ns]
                        thrpt:  [18.776 GiB/s 18.782 GiB/s 18.787 GiB/s]
                 change:
                        time:   [-57.525% -57.453% -57.373%] (p = 0.00 < 0.05)
                        thrpt:  [+134.59% +135.04% +135.43%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_10000_LF
                        time:   [474.25 ns 474.40 ns 474.63 ns]
                        thrpt:  [22.173 GiB/s 22.184 GiB/s 22.191 GiB/s]
                 change:
                        time:   [-60.981% -60.917% -60.851%] (p = 0.00 < 0.05)
                        thrpt:  [+155.43% +155.86% +156.29%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_10000_CR
                        time:   [474.25 ns 474.41 ns 474.67 ns]
                        thrpt:  [22.171 GiB/s 22.183 GiB/s 22.190 GiB/s]
                 change:
                        time:   [-60.614% -60.546% -60.479%] (p = 0.00 < 0.05)
                        thrpt:  [+153.03% +153.46% +153.90%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_10000_CRLF
                        time:   [498.51 ns 499.28 ns 500.21 ns]
                        thrpt:  [21.970 GiB/s 22.011 GiB/s 22.045 GiB/s]
                 change:
                        time:   [-60.672% -60.594% -60.520%] (p = 0.00 < 0.05)
                        thrpt:  [+153.29% +153.77% +154.27%]
                        Performance has improved.

lines_crlf::to_byte_idx/lines_100_LF
                        time:   [6.3709 ns 6.3828 ns 6.4011 ns]
                        thrpt:  [16.441 GiB/s 16.488 GiB/s 16.519 GiB/s]
                 change:
                        time:   [-33.821% -33.678% -33.516%] (p = 0.00 < 0.05)
                        thrpt:  [+50.412% +50.779% +51.106%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_100_CR
                        time:   [6.3680 ns 6.3697 ns 6.3720 ns]
                        thrpt:  [16.516 GiB/s 16.522 GiB/s 16.526 GiB/s]
                 change:
                        time:   [-35.982% -35.870% -35.759%] (p = 0.00 < 0.05)
                        thrpt:  [+55.664% +55.933% +56.205%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_100_CRLF
                        time:   [9.9426 ns 9.9496 ns 9.9579 ns]
                        thrpt:  [11.036 GiB/s 11.045 GiB/s 11.053 GiB/s]
                 change:
                        time:   [-23.824% -23.695% -23.559%] (p = 0.00 < 0.05)
                        thrpt:  [+30.820% +31.054% +31.275%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_1000_LF
                        time:   [54.693 ns 54.740 ns 54.800 ns]
                        thrpt:  [19.204 GiB/s 19.225 GiB/s 19.242 GiB/s]
                 change:
                        time:   [-34.536% -34.394% -34.244%] (p = 0.00 < 0.05)
                        thrpt:  [+52.077% +52.424% +52.755%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_1000_CR
                        time:   [54.769 ns 54.804 ns 54.846 ns]
                        thrpt:  [19.188 GiB/s 19.203 GiB/s 19.215 GiB/s]
                 change:
                        time:   [-36.013% -35.896% -35.787%] (p = 0.00 < 0.05)
                        thrpt:  [+55.731% +55.997% +56.283%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_1000_CRLF
                        time:   [58.142 ns 58.164 ns 58.191 ns]
                        thrpt:  [18.885 GiB/s 18.894 GiB/s 18.901 GiB/s]
                 change:
                        time:   [-35.565% -35.455% -35.343%] (p = 0.00 < 0.05)
                        thrpt:  [+54.662% +54.930% +55.194%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_10000_LF
                        time:   [473.48 ns 473.64 ns 473.92 ns]
                        thrpt:  [22.206 GiB/s 22.219 GiB/s 22.227 GiB/s]
                 change:
                        time:   [-39.204% -39.100% -38.995%] (p = 0.00 < 0.05)
                        thrpt:  [+63.922% +64.203% +64.484%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_10000_CR
                        time:   [473.50 ns 473.67 ns 473.95 ns]
                        thrpt:  [22.205 GiB/s 22.218 GiB/s 22.226 GiB/s]
                 change:
                        time:   [-39.481% -39.384% -39.290%] (p = 0.00 < 0.05)
                        thrpt:  [+64.718% +64.974% +65.238%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_10000_CRLF
                        time:   [497.40 ns 497.47 ns 497.55 ns]
                        thrpt:  [22.087 GiB/s 22.091 GiB/s 22.094 GiB/s]
                 change:
                        time:   [-39.299% -39.184% -39.061%] (p = 0.00 < 0.05)
                        thrpt:  [+64.100% +64.430% +64.743%]
                        Performance has improved.

@cessen
Copy link
Owner

cessen commented Apr 23, 2023

Thanks! The general idea seems good (though I haven't reviewed the code closely). I do want to keep the algorithm the same across architectures, however, just to keep the maintenance burden reasonable. So if we can't find a performant way to implement shift_across() on x86_64, we might need to leave this on the table.

@CeleritasCelery
Copy link
Contributor Author

I tried a basic approach to define it like this

    #[inline(always)]
    fn shift_across(&self, n: Self) -> Self {
        unsafe {
            let n = x86_64::_mm_slli_si128(n, 1);
            let upper = core::mem::transmute::<Self, [u8; Self::SIZE]>(*self);
            let mut lower = core::mem::transmute::<Self, [u8; Self::SIZE]>(n);
            lower[0] = upper[Self::SIZE - 1];
            core::mem::transmute::<[u8; Self::SIZE], Self>(lower)
        }
    }

but that resulted in a major 4x slowdown across all benchmarks. So it is going to need to be a different approach.

@CeleritasCelery
Copy link
Contributor Author

I no longer have time to work on this and I won't be submitting any more perf PR's for the immediate future. Feel free to close this if you don't want try and make this approach work.

@CeleritasCelery
Copy link
Contributor Author

CeleritasCelery commented Jun 10, 2024

@cessen I happened to come across an efficient way to do this on x86. I implemented it and it works well. Benchmarks are below. Throughput has increase 20-80% depending on the benchmark with this single pass approach.

x86_64 benchmarks
lines_crlf::count_breaks/lines_100_LF
                        time:   [23.226 ns 23.232 ns 23.239 ns]
                        thrpt:  [4.5286 GiB/s 4.5299 GiB/s 4.5311 GiB/s]
                 change:
                        time:   [-16.945% -16.888% -16.841%] (p = 0.00 < 0.05)
                        thrpt:  [+20.252% +20.319% +20.401%]
                        Performance has improved.
lines_crlf::count_breaks/lines_100_CR
                        time:   [23.226 ns 23.231 ns 23.236 ns]
                        thrpt:  [4.5291 GiB/s 4.5301 GiB/s 4.5311 GiB/s]
                 change:
                        time:   [-16.881% -16.828% -16.744%] (p = 0.00 < 0.05)
                        thrpt:  [+20.111% +20.233% +20.310%]
                        Performance has improved.
lines_crlf::count_breaks/lines_100_CRLF
                        time:   [30.983 ns 30.990 ns 30.996 ns]
                        thrpt:  [3.5455 GiB/s 3.5462 GiB/s 3.5469 GiB/s]
                 change:
                        time:   [-9.7291% -9.6790% -9.6373%] (p = 0.00 < 0.05)
                        thrpt:  [+10.665% +10.716% +10.778%]
                        Performance has improved.
lines_crlf::count_breaks/lines_1000_LF
                        time:   [177.77 ns 177.79 ns 177.81 ns]
                        thrpt:  [5.9187 GiB/s 5.9194 GiB/s 5.9201 GiB/s]
                 change:
                        time:   [-32.601% -32.577% -32.560%] (p = 0.00 < 0.05)
                        thrpt:  [+48.281% +48.318% +48.370%]
                        Performance has improved.
lines_crlf::count_breaks/lines_1000_CR
                        time:   [177.83 ns 177.86 ns 177.89 ns]
                        thrpt:  [5.9160 GiB/s 5.9170 GiB/s 5.9179 GiB/s]
                 change:
                        time:   [-24.170% -24.130% -24.097%] (p = 0.00 < 0.05)
                        thrpt:  [+31.747% +31.805% +31.874%]
                        Performance has improved.
lines_crlf::count_breaks/lines_1000_CRLF
                        time:   [187.96 ns 187.99 ns 188.02 ns]
                        thrpt:  [5.8450 GiB/s 5.8459 GiB/s 5.8469 GiB/s]
                 change:
                        time:   [-25.359% -25.333% -25.313%] (p = 0.00 < 0.05)
                        thrpt:  [+33.892% +33.929% +33.974%]
                        Performance has improved.
lines_crlf::count_breaks/lines_10000_LF
                        time:   [1.6073 µs 1.6073 µs 1.6073 µs]
                        thrpt:  [6.5475 GiB/s 6.5477 GiB/s 6.5478 GiB/s]
                 change:
                        time:   [-32.045% -31.967% -31.889%] (p = 0.00 < 0.05)
                        thrpt:  [+46.819% +46.987% +47.156%]
                        Performance has improved.
lines_crlf::count_breaks/lines_10000_CR
                        time:   [1.6073 µs 1.6075 µs 1.6077 µs]
                        thrpt:  [6.5460 GiB/s 6.5469 GiB/s 6.5474 GiB/s]
                 change:
                        time:   [-22.743% -22.714% -22.677%] (p = 0.00 < 0.05)
                        thrpt:  [+29.327% +29.389% +29.438%]
                        Performance has improved.
lines_crlf::count_breaks/lines_10000_CRLF
                        time:   [1.6836 µs 1.6837 µs 1.6839 µs]
                        thrpt:  [6.5264 GiB/s 6.5270 GiB/s 6.5274 GiB/s]
                 change:
                        time:   [-23.364% -23.330% -23.293%] (p = 0.00 < 0.05)
                        thrpt:  [+30.367% +30.430% +30.486%]
                        Performance has improved.

lines_crlf::from_byte_idx/lines_100_LF
                        time:   [23.461 ns 23.465 ns 23.468 ns]
                        thrpt:  [4.4843 GiB/s 4.4850 GiB/s 4.4856 GiB/s]
                 change:
                        time:   [-17.056% -17.001% -16.947%] (p = 0.00 < 0.05)
                        thrpt:  [+20.406% +20.484% +20.564%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_100_CR
                        time:   [23.460 ns 23.463 ns 23.466 ns]
                        thrpt:  [4.4848 GiB/s 4.4854 GiB/s 4.4860 GiB/s]
                 change:
                        time:   [-17.053% -16.992% -16.926%] (p = 0.00 < 0.05)
                        thrpt:  [+20.375% +20.470% +20.559%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_100_CRLF
                        time:   [31.122 ns 31.128 ns 31.135 ns]
                        thrpt:  [3.5297 GiB/s 3.5305 GiB/s 3.5312 GiB/s]
                 change:
                        time:   [-9.9060% -9.7928% -9.7091%] (p = 0.00 < 0.05)
                        thrpt:  [+10.753% +10.856% +10.995%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_1000_LF
                        time:   [178.06 ns 178.09 ns 178.13 ns]
                        thrpt:  [5.9081 GiB/s 5.9093 GiB/s 5.9104 GiB/s]
                 change:
                        time:   [-32.661% -32.630% -32.605%] (p = 0.00 < 0.05)
                        thrpt:  [+48.379% +48.433% +48.503%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_1000_CR
                        time:   [178.09 ns 178.13 ns 178.16 ns]
                        thrpt:  [5.9069 GiB/s 5.9081 GiB/s 5.9092 GiB/s]
                 change:
                        time:   [-24.232% -24.208% -24.189%] (p = 0.00 < 0.05)
                        thrpt:  [+31.907% +31.940% +31.981%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_1000_CRLF
                        time:   [188.19 ns 188.24 ns 188.30 ns]
                        thrpt:  [5.8363 GiB/s 5.8382 GiB/s 5.8395 GiB/s]
                 change:
                        time:   [-25.091% -25.054% -24.999%] (p = 0.00 < 0.05)
                        thrpt:  [+33.331% +33.429% +33.496%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_10000_LF
                        time:   [1.6065 µs 1.6066 µs 1.6069 µs]
                        thrpt:  [6.5494 GiB/s 6.5503 GiB/s 6.5509 GiB/s]
                 change:
                        time:   [-32.121% -32.038% -31.949%] (p = 0.00 < 0.05)
                        thrpt:  [+46.948% +47.141% +47.322%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_10000_CR
                        time:   [1.6066 µs 1.6068 µs 1.6071 µs]
                        thrpt:  [6.5482 GiB/s 6.5497 GiB/s 6.5506 GiB/s]
                 change:
                        time:   [-22.507% -22.465% -22.422%] (p = 0.00 < 0.05)
                        thrpt:  [+28.902% +28.974% +29.044%]
                        Performance has improved.
lines_crlf::from_byte_idx/lines_10000_CRLF
                        time:   [1.6709 µs 1.6710 µs 1.6711 µs]
                        thrpt:  [6.5761 GiB/s 6.5767 GiB/s 6.5770 GiB/s]
                 change:
                        time:   [-24.019% -23.972% -23.934%] (p = 0.00 < 0.05)
                        thrpt:  [+31.464% +31.530% +31.611%]
                        Performance has improved.

lines_crlf::to_byte_idx/lines_100_LF
                        time:   [27.695 ns 27.703 ns 27.712 ns]
                        thrpt:  [3.7977 GiB/s 3.7989 GiB/s 3.8000 GiB/s]
                 change:
                        time:   [-22.503% -22.464% -22.430%] (p = 0.00 < 0.05)
                        thrpt:  [+28.915% +28.972% +29.038%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_100_CR
                        time:   [27.703 ns 27.714 ns 27.728 ns]
                        thrpt:  [3.7954 GiB/s 3.7973 GiB/s 3.7988 GiB/s]
                 change:
                        time:   [-24.568% -24.497% -24.424%] (p = 0.00 < 0.05)
                        thrpt:  [+32.316% +32.445% +32.570%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_100_CRLF
                        time:   [38.603 ns 38.621 ns 38.639 ns]
                        thrpt:  [2.8442 GiB/s 2.8455 GiB/s 2.8468 GiB/s]
                 change:
                        time:   [-28.157% -28.091% -28.024%] (p = 0.00 < 0.05)
                        thrpt:  [+38.934% +39.065% +39.192%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_1000_LF
                        time:   [220.19 ns 220.27 ns 220.37 ns]
                        thrpt:  [4.7757 GiB/s 4.7777 GiB/s 4.7794 GiB/s]
                 change:
                        time:   [-37.749% -37.677% -37.608%] (p = 0.00 < 0.05)
                        thrpt:  [+60.276% +60.454% +60.640%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_1000_CR
                        time:   [220.16 ns 220.21 ns 220.26 ns]
                        thrpt:  [4.7779 GiB/s 4.7790 GiB/s 4.7800 GiB/s]
                 change:
                        time:   [-41.670% -41.648% -41.626%] (p = 0.00 < 0.05)
                        thrpt:  [+71.310% +71.375% +71.437%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_1000_CRLF
                        time:   [231.20 ns 231.32 ns 231.46 ns]
                        thrpt:  [4.7480 GiB/s 4.7507 GiB/s 4.7534 GiB/s]
                 change:
                        time:   [-39.793% -39.715% -39.640%] (p = 0.00 < 0.05)
                        thrpt:  [+65.672% +65.877% +66.094%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_10000_LF
                        time:   [1.9426 µs 1.9427 µs 1.9427 µs]
                        thrpt:  [5.4171 GiB/s 5.4173 GiB/s 5.4174 GiB/s]
                 change:
                        time:   [-38.454% -38.407% -38.375%] (p = 0.00 < 0.05)
                        thrpt:  [+62.272% +62.357% +62.480%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_10000_CR
                        time:   [1.9427 µs 1.9428 µs 1.9430 µs]
                        thrpt:  [5.4164 GiB/s 5.4169 GiB/s 5.4172 GiB/s]
                 change:
                        time:   [-44.405% -44.392% -44.372%] (p = 0.00 < 0.05)
                        thrpt:  [+79.766% +79.829% +79.872%]
                        Performance has improved.
lines_crlf::to_byte_idx/lines_10000_CRLF
                        time:   [2.0376 µs 2.0381 µs 2.0386 µs]
                        thrpt:  [5.3907 GiB/s 5.3922 GiB/s 5.3933 GiB/s]
                 change:
                        time:   [-39.446% -39.427% -39.385%] (p = 0.00 < 0.05)
                        thrpt:  [+64.977% +65.091% +65.141%]
                        Performance has improved.

@CeleritasCelery CeleritasCelery changed the title WIP: Optimize lines_crlf Optimize lines_crlf Jun 10, 2024
This change is not ready to be merged because it does not yet support x86. I
don't know how to do the operation we need with SSE intrinsics.

This new approach adds a another method to ByteChunk that lets you shift a value
across two vectors. This let's us turn a 2-pass counting algorithim into a
single pass. Making this change alone resulted in a 40% speedup.

There is also some loop unrolling, but there was not a large difference in
unrolling over a factor of 2.
@cessen
Copy link
Owner

cessen commented Jun 23, 2024

Awesome, thanks! I'm a little busy with other things right now, but I'll try to get this reviewed and merged soon.

Copy link
Owner

@cessen cessen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me so long to get to this! As I mentioned before, I was busy... but then I totally forgot about this. :-(

Anyway, the PR looks great, and gives impressive speedups. Awesome work! Mostly just some minor nits and then I think we can land this.

The only functional question I have is in regards to processing two chunks at a time, which adds quite a bit of code complexity. I'd like to see the numbers when processing just one chunk at a time, to make a determination on whether the added code complexity is worth it. (I would test myself, but I don't have an Apple system.) (See my new comment further below.)

src/byte_chunk.rs Outdated Show resolved Hide resolved
src/lines_crlf.rs Show resolved Hide resolved
src/lines_crlf.rs Show resolved Hide resolved
src/lines_crlf.rs Outdated Show resolved Hide resolved
src/lines_crlf.rs Outdated Show resolved Hide resolved
src/lines_crlf.rs Show resolved Hide resolved
src/lines_crlf.rs Show resolved Hide resolved
src/lines_crlf.rs Outdated Show resolved Hide resolved
@cessen
Copy link
Owner

cessen commented Oct 31, 2024

I'd like to see the numbers when processing just one chunk at a time, to make a determination on whether the added code complexity is worth it.

After writing that, I realized I could at least test on x86_64 to see what the impact is there. So to answer my own question: on x86_64 the impact is roughly a 20% performance hit across the board. And that does seem pretty significant. So let's leave the loop-unrolling in.

@CeleritasCelery
Copy link
Contributor Author

I have applied your feedback.

@cessen
Copy link
Owner

cessen commented Oct 31, 2024

Awesome, thanks!

The CI failure is unrelated to this PR, and is rather because a transitive dev-dependency used in test running requires > Rust 1.65 (which we still support and therefore have in the CI). I'll need to address that separately. This PR is good to land.

@cessen cessen merged commit e1299e9 into cessen:master Oct 31, 2024
1 of 2 checks passed
@cessen
Copy link
Owner

cessen commented Oct 31, 2024

Also just cut a release (v0.4.4) with the new optimized code.

Thanks a bunch for putting in the time to figure this out and implement it!

@CeleritasCelery CeleritasCelery deleted the lines_crlf branch October 31, 2024 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants