-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize lines_crlf #21
Conversation
Thanks! The general idea seems good (though I haven't reviewed the code closely). I do want to keep the algorithm the same across architectures, however, just to keep the maintenance burden reasonable. So if we can't find a performant way to implement |
I tried a basic approach to define it like this #[inline(always)]
fn shift_across(&self, n: Self) -> Self {
unsafe {
let n = x86_64::_mm_slli_si128(n, 1);
let upper = core::mem::transmute::<Self, [u8; Self::SIZE]>(*self);
let mut lower = core::mem::transmute::<Self, [u8; Self::SIZE]>(n);
lower[0] = upper[Self::SIZE - 1];
core::mem::transmute::<[u8; Self::SIZE], Self>(lower)
}
} but that resulted in a major 4x slowdown across all benchmarks. So it is going to need to be a different approach. |
I no longer have time to work on this and I won't be submitting any more perf PR's for the immediate future. Feel free to close this if you don't want try and make this approach work. |
@cessen I happened to come across an efficient way to do this on x86. I implemented it and it works well. Benchmarks are below. Throughput has increase 20-80% depending on the benchmark with this single pass approach. x86_64 benchmarks
|
a9cdd6e
to
a8de2c8
Compare
This change is not ready to be merged because it does not yet support x86. I don't know how to do the operation we need with SSE intrinsics. This new approach adds a another method to ByteChunk that lets you shift a value across two vectors. This let's us turn a 2-pass counting algorithim into a single pass. Making this change alone resulted in a 40% speedup. There is also some loop unrolling, but there was not a large difference in unrolling over a factor of 2.
423410f
to
b31888a
Compare
Awesome, thanks! I'm a little busy with other things right now, but I'll try to get this reviewed and merged soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took me so long to get to this! As I mentioned before, I was busy... but then I totally forgot about this. :-(
Anyway, the PR looks great, and gives impressive speedups. Awesome work! Mostly just some minor nits and then I think we can land this.
The only functional question I have is in regards to processing two chunks at a time, which adds quite a bit of code complexity. I'd like to see the numbers when processing just one chunk at a time, to make a determination on whether the added code complexity is worth it. (I would test myself, but I don't have an Apple system.) (See my new comment further below.)
After writing that, I realized I could at least test on x86_64 to see what the impact is there. So to answer my own question: on x86_64 the impact is roughly a 20% performance hit across the board. And that does seem pretty significant. So let's leave the loop-unrolling in. |
93368f1
to
1dbb919
Compare
I have applied your feedback. |
Awesome, thanks! The CI failure is unrelated to this PR, and is rather because a transitive dev-dependency used in test running requires > Rust 1.65 (which we still support and therefore have in the CI). I'll need to address that separately. This PR is good to land. |
Also just cut a release (v0.4.4) with the new optimized code. Thanks a bunch for putting in the time to figure this out and implement it! |
This change is not ready to be merged because it does not yet support x86. I don't know how to do the operation we need with SSE intrinsics (see #18).This new approach adds a another method to ByteChunk that lets you shift a value across two vectors. This let's us turn a 2-pass counting algorithm into a single pass. Making this change alone resulted in a 40% speedup.
There is also some loop unrolling, but there was not a large difference in unrolling over a factor of 2.
benchmarks
aarch64 benchmarks