-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Aarch64 SIMD support #12
Conversation
See rust-lang/rust#107617 for a discussion on the inline issue. Another commenter reported seeing performance regressions on x86 with I also have a question about correctness. in the SSE implementation of shift right the Lines 194 to 203 in e92188e
This will shift across byte boundaries everywhere except for between the two u64 (since it is a u64 shift). If the code could ever shift cross byte boundaries then this seems like a bug. But it would only show up if you happened to be right in the middle of a chunk. |
Ah, thanks! Yeah, benchmarking with various different line breaks is a good idea. In practice, I don't think we need to benchmark VT and FF specifically, because they use the same code path as LF, which is already tested. But adding benchmarks for CR, CRLF, NL, and one of LS/PS would be good. However, maybe that's better handled as a separate PR? Or I can add it myself. Unless you noticed any particular performance implications relevant to this PR.
Those are actually important, and we want to keep them. You can see the discussion when they were added here: #9 (comment) One of
As long as the bechmarks are well-named so they can be easily filtered, I'm not concerned about total benchmark time of the whole suite.
Ah! Good catch. For my use case (Ropey) you never hit cases much longer than 1000 bytes anyway, so I'm not personally invested in them being specifically optimized for. And sub-1000-ish byte strings is always going to be the priority. But I definitely agree that I wonder, though, if it makes sense to just build those benchmark strings at run time by repeating the 1000-byte texts ten times. I realize that in the grand scheme of things adding a handful of 10 KB files isn't a big deal. But I'd prefer to keep the repo pretty lean if possible, and I doubt it matters from a benchmarking perspective. (I also wonder if it wouldn't make sense to do the same for the line break variations: just search-and-replace the line endings at run time to test the different cases we care about.)
Ah, this should probably be better documented for contributors (I'll do that myself, no need to do it in this PR). The algorithms used in Having said that, it's always possible there are bugs! But in this case, if there are bugs they're in the algorithms, not the
That's definitely interesting! And yeah, a lot of these functions are a little on the heavy side, for sure. I'm not against removing (Benchmarking is hard.) |
fbcea08
to
ae4a477
Compare
I am splitting this up into multiple PR's. I know better, but I got a little over zealous 😄 . This one will only have the arm simd support. I was able to find an solution to the benchmarking issue with some help from other Rust folks. If we enable |
src/chars.rs
Outdated
let char_boundaries = |val: &T| val.bitand(T::splat(0xc0)).cmp_eq_byte(0x80); | ||
|
||
// Take care of the middle bytes in big chunks. Loop unrolled | ||
for chunks in middle.chunks_exact(4) { | ||
let mut iter = chunks.iter(); | ||
while let Some(chunk) = iter.next() { | ||
let val1 = char_boundaries(chunk); | ||
let val2 = char_boundaries(iter.next().unwrap()); | ||
let val3 = char_boundaries(iter.next().unwrap()); | ||
let val4 = char_boundaries(iter.next().unwrap()); | ||
let val1_2 = val1.add(val2); | ||
let val3_4 = val3.add(val4); | ||
inv_count += val1_2.add(val3_4).sum_bytes(); | ||
} | ||
inv_count += acc.sum_bytes(); | ||
} | ||
|
||
// Take care of the rest of the chunk | ||
let mut acc = T::zero(); | ||
for chunk in middle.chunks_exact(4).remainder() { | ||
acc = acc.add(char_boundaries(chunk)); | ||
} | ||
inv_count += acc.sum_bytes(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only change that I'm feeling a little stuck on. I follow what's going on here, but I don't think it's clear why this is faster on ARM:
for chunk in middle {
inv_count += chunk.bitand(T::splat(0xc0)).cmp_eq_byte(0x80).sum_bytes();
}
...than the existing code. Because this new version requires doing sum_bytes()
more often, which (in theory) should make it slower.
I'm not doubting that the new version is faster on ARM, but I think figuring out why that's the case deserves some investigation before committing to the code change. Especially since it results in an expected regression on x86 (presumably due to the additional sum_bytes()
calls) which needs to be mitigated with explicit loop unrolling.
#[inline(always)]
fn sum_bytes(&self) -> usize {
unsafe { aarch64::vaddlvq_u8(*self).into() }
} However the x86 version is more instructions and requires copying the value from the simd register to the scalar register, which is expensive. #[inline(always)]
fn sum_bytes(&self) -> usize {
let half_sum = unsafe { x86_64::_mm_sad_epu8(*self, x86_64::_mm_setzero_si128()) };
let (low, high) = unsafe { core::mem::transmute::<Self, (u64, u64)>(half_sum) };
(low + high) as usize
} If you don't want to include the loop unrolling, I think we could back out that change. It is much slower, but still faster then the scalar version (with thin lto). Here is the comparison of the slowdown if we remove the loop unrolling.
However I eventually want to add loop unrolling to all algorithms. Because CPU's have multiple SIMD execution units loop unrolling will give a sizeable speedup. That is what memchar does. It even provides about a 10-20% speedup on my x86 machine. |
I think I might have given the wrong impression: I'm not against the loop unrolling. It's a fine alternative to the current code structure, as long as the scalar version still optimizes roughly as well. I'm not trying to block that. It's just that the situation is counter-intuitive to me, and I'd like to understand what's going on.
Yeah, that makes sense. Additionally, I think there's a good chance that additional instruction gets "pipelined away" in terms of execution time. In retrospect, what's actually throwing me off is why the explicit SIMD on ARM chokes with the old structure, to the extent that it's outperformed by the scalar code. I suppose that can just be chalked up to LLVM somehow not being able to apply the same optimizations for some reason. But that feels unsatisfying without verifying it. But yeah, this PR shouldn't be blocked on investigating that. I have some additional comments (which I'll add momentarily), but otherwise I think this is good to land. |
ae4a477
to
f37a6d6
Compare
This is awesome. Thanks for putting the work in on this! |
I thought that was weird as well. I get the impression from talking the Rust lang folks that the arm code gen is not as robust as x86. Either way I have applied your feedback and the loop looks cleaner. Performance was unchanged as well. |
Were you planning to get to this some time soon-ish? It's totally fine if not. Just want to know because if you are, then I'll hold off on making the next release. Otherwise I'm thinking of making the next release pretty soon (assuming I have the time/energy). |
It will be a few weeks before I work on that. I am waiting to get my x86 machine so that I can reliably benchmark both versions. |
Got it. A few weeks or so is fine. I'm actually pretty busy right now, so putting off the release a bit is probably good for me. :-) |
closes #10
I finally found out what the issue was with the benchmarks. For some reason adding the
#[inline]
annotation to the public functions was causing a massive slowdown on the benchmarks. I don't know if this is criterion issue or a rust issue. I opened an issue over at criterion, but I think it will be a while before they get to it.bheisler/criterion.rs#649
performance
However most of these functions are probably not great inline candidates anyways, because they tend to be pretty heavy. But in this PR I have removed the annotations.
Overall I am pretty happy with the initial results. There are 87 performance improvements and 14 regressions.
I will continue to look at the regressions. One thing that surprised me is that the char_count algorithm was about 10-15% slower on ARM simd. I reworked the algorithm to make it faster using the code below and it improved significantly.
But this approach made the performance worse on x86 platforms (I think that is because SSE doesn't have a reducing sum instruction). So I did some loop unrolling instead. This will improve x86 as well. All the algorithms would benefit greatly from loop unrolling but we can add that later.