Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Aarch64 SIMD support #12

Merged
merged 1 commit into from
Feb 6, 2023
Merged

Add Aarch64 SIMD support #12

merged 1 commit into from
Feb 6, 2023

Conversation

CeleritasCelery
Copy link
Contributor

@CeleritasCelery CeleritasCelery commented Feb 2, 2023

closes #10

I finally found out what the issue was with the benchmarks. For some reason adding the #[inline] annotation to the public functions was causing a massive slowdown on the benchmarks. I don't know if this is criterion issue or a rust issue. I opened an issue over at criterion, but I think it will be a while before they get to it.

bheisler/criterion.rs#649

performance

However most of these functions are probably not great inline candidates anyways, because they tend to be pretty heavy. But in this PR I have removed the annotations.

Overall I am pretty happy with the initial results. There are 87 performance improvements and 14 regressions.

I will continue to look at the regressions. One thing that surprised me is that the char_count algorithm was about 10-15% slower on ARM simd. I reworked the algorithm to make it faster using the code below and it improved significantly.

        // Take care of the middle bytes in big chunks
        for chunk in middle {
            inv_count += chunk.bitand(T::splat(0xc0)).cmp_eq_byte(0x80).sum_bytes();
        }

But this approach made the performance worse on x86 platforms (I think that is because SSE doesn't have a reducing sum instruction). So I did some loop unrolling instead. This will improve x86 as well. All the algorithms would benefit greatly from loop unrolling but we can add that later.

@CeleritasCelery
Copy link
Contributor Author

CeleritasCelery commented Feb 3, 2023

See rust-lang/rust#107617 for a discussion on the inline issue. Another commenter reported seeing performance regressions on x86 with #[inline] so leaving that out may be the right call.

I also have a question about correctness.

in the SSE implementation of shift right the _mm_srli_epi64 intrinsic is used.

fn shr(&self, n: usize) -> Self {
match n {
0 => *self,
1 => unsafe { x86_64::_mm_srli_epi64(*self, 1) },
2 => unsafe { x86_64::_mm_srli_epi64(*self, 2) },
3 => unsafe { x86_64::_mm_srli_epi64(*self, 3) },
4 => unsafe { x86_64::_mm_srli_epi64(*self, 4) },
_ => unreachable!(),
}
}

This will shift across byte boundaries everywhere except for between the two u64 (since it is a u64 shift). If the code could ever shift cross byte boundaries then this seems like a bug. But it would only show up if you happened to be right in the middle of a chunk.

@cessen
Copy link
Owner

cessen commented Feb 3, 2023

First is that I added carriage returns, vertical tabs, and form feeds.

Ah, thanks! Yeah, benchmarking with various different line breaks is a good idea. In practice, I don't think we need to benchmark VT and FF specifically, because they use the same code path as LF, which is already tested. But adding benchmarks for CR, CRLF, NL, and one of LS/PS would be good.

However, maybe that's better handled as a separate PR? Or I can add it myself. Unless you noticed any particular performance implications relevant to this PR.

Second is I removed the single character benchmarks

Those are actually important, and we want to keep them. You can see the discussion when they were added here: #9 (comment)

One of str_indices primary use cases (and the reason it was created) is text editors, and in that context you're frequently counting single-character strings when e.g. the user types text. So although we don't currently have any single-character specific code paths, we nevertheless want to track any regressions there.

and are not worth the benchmark time.

As long as the bechmarks are well-named so they can be easily filtered, I'm not concerned about total benchmark time of the whole suite.

added a 10,000 character benchmark. most of the algorithms work in chunks of 255 * 16 (4080) bytes.

Ah! Good catch. For my use case (Ropey) you never hit cases much longer than 1000 bytes anyway, so I'm not personally invested in them being specifically optimized for. And sub-1000-ish byte strings is always going to be the priority. But I definitely agree that str_indices as a crate should care about longer strings as well.

I wonder, though, if it makes sense to just build those benchmark strings at run time by repeating the 1000-byte texts ten times. I realize that in the grand scheme of things adding a handful of 10 KB files isn't a big deal. But I'd prefer to keep the repo pretty lean if possible, and I doubt it matters from a benchmarking perspective.

(I also wonder if it wouldn't make sense to do the same for the line break variations: just search-and-replace the line endings at run time to test the different cases we care about.)

I also have a question about correctness.

in the SSE implementation of shift right the _mm_srli_epi64 intrinsic is used.

Ah, this should probably be better documented for contributors (I'll do that myself, no need to do it in this PR). The algorithms used in str_indices are mostly based on bit manipulation tricks to effectively do "SIMD" on multiple bytes at once, and all of those that involve shr() are correct regardless of whether shr() shifts across byte boundaries or not. Even the scalar shr() implementation is literally just a >> operator on a usize.

Having said that, it's always possible there are bugs! But in this case, if there are bugs they're in the algorithms, not the shr() implementation.

For some reason adding the #[inline] annotation to the public functions was causing a massive slowdown on the benchmarks.

That's definitely interesting! And yeah, a lot of these functions are a little on the heavy side, for sure.

I'm not against removing #[inline] from them if it improves performance. But I'd like to see that proved out in "real" uses rather than just str_indices benchmarks. In previous optimization passes we ran into interesting cases where str_indices benchmarks would be improved, but benchmarks of other crates that used str_indices would regress.

(Benchmarking is hard.)

@CeleritasCelery
Copy link
Contributor Author

I am splitting this up into multiple PR's. I know better, but I got a little over zealous 😄 . This one will only have the arm simd support. I was able to find an solution to the benchmarking issue with some help from other Rust folks. If we enable lto = "thin" then issue goes away, even with inlining. That means we don't loose the ability to inline across crates.

src/chars.rs Outdated
Comment on lines 131 to 151
let char_boundaries = |val: &T| val.bitand(T::splat(0xc0)).cmp_eq_byte(0x80);

// Take care of the middle bytes in big chunks. Loop unrolled
for chunks in middle.chunks_exact(4) {
let mut iter = chunks.iter();
while let Some(chunk) = iter.next() {
let val1 = char_boundaries(chunk);
let val2 = char_boundaries(iter.next().unwrap());
let val3 = char_boundaries(iter.next().unwrap());
let val4 = char_boundaries(iter.next().unwrap());
let val1_2 = val1.add(val2);
let val3_4 = val3.add(val4);
inv_count += val1_2.add(val3_4).sum_bytes();
}
inv_count += acc.sum_bytes();
}

// Take care of the rest of the chunk
let mut acc = T::zero();
for chunk in middle.chunks_exact(4).remainder() {
acc = acc.add(char_boundaries(chunk));
}
inv_count += acc.sum_bytes();

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only change that I'm feeling a little stuck on. I follow what's going on here, but I don't think it's clear why this is faster on ARM:

for chunk in middle {
    inv_count += chunk.bitand(T::splat(0xc0)).cmp_eq_byte(0x80).sum_bytes();
}

...than the existing code. Because this new version requires doing sum_bytes() more often, which (in theory) should make it slower.

I'm not doubting that the new version is faster on ARM, but I think figuring out why that's the case deserves some investigation before committing to the code change. Especially since it results in an expected regression on x86 (presumably due to the additional sum_bytes() calls) which needs to be mitigated with explicit loop unrolling.

@CeleritasCelery
Copy link
Contributor Author

CeleritasCelery commented Feb 5, 2023

...than the existing code. Because this new version requires doing sum_bytes() more often, which (in theory) should make it slower.

sum_bytes is only a single instruction on ARM because it has a reduction add (and that instruction has the same throughput as an lane-wise add). That is why I think it is faster. We remove the extra loop overhead and the additional adds.

    #[inline(always)]
    fn sum_bytes(&self) -> usize {
        unsafe { aarch64::vaddlvq_u8(*self).into() }
    }

However the x86 version is more instructions and requires copying the value from the simd register to the scalar register, which is expensive.

    #[inline(always)]
    fn sum_bytes(&self) -> usize {
        let half_sum = unsafe { x86_64::_mm_sad_epu8(*self, x86_64::_mm_setzero_si128()) };
        let (low, high) = unsafe { core::mem::transmute::<Self, (u64, u64)>(half_sum) };
        (low + high) as usize
    }

If you don't want to include the loop unrolling, I think we could back out that change. It is much slower, but still faster then the scalar version (with thin lto). Here is the comparison of the slowdown if we remove the loop unrolling.

chars::count/en_0001    time:   [584.59 ps 585.41 ps 586.30 ps]
                        thrpt:  [1.5885 GiB/s 1.5909 GiB/s 1.5931 GiB/s]
                 change:
                        time:   [+0.1944% +0.4395% +0.7303%] (p = 0.00 < 0.05)
                        thrpt:  [-0.7250% -0.4375% -0.1941%]
                        Change within noise threshold.
chars::count/en_0010    time:   [3.7321 ns 3.7364 ns 3.7409 ns]
                        thrpt:  [2.4896 GiB/s 2.4926 GiB/s 2.4954 GiB/s]
                 change:
                        time:   [-0.3371% -0.1082% +0.1058%] (p = 0.36 > 0.05)
                        thrpt:  [-0.1057% +0.1084% +0.3383%]
                        No change in performance detected.
chars::count/en_0100    time:   [4.3493 ns 4.3528 ns 4.3569 ns]
                        thrpt:  [23.941 GiB/s 23.963 GiB/s 23.983 GiB/s]
                 change:
                        time:   [+63.929% +64.285% +64.636%] (p = 0.00 < 0.05)
                        thrpt:  [-39.260% -39.130% -38.998%]
                        Performance has regressed.
chars::count/en_1000    time:   [29.750 ns 29.816 ns 29.903 ns]
                        thrpt:  [34.259 GiB/s 34.359 GiB/s 34.436 GiB/s]
                 change:
                        time:   [+29.554% +29.884% +30.220%] (p = 0.00 < 0.05)
                        thrpt:  [-23.207% -23.008% -22.812%]
                        Performance has regressed.
chars::count/en_10000   time:   [334.29 ns 334.67 ns 335.11 ns]
                        thrpt:  [30.571 GiB/s 30.611 GiB/s 30.646 GiB/s]
                 change:
                        time:   [+71.938% +72.288% +72.627%] (p = 0.00 < 0.05)
                        thrpt:  [-42.072% -41.958% -41.840%]
                        Performance has regressed.
chars::count/jp_0003    time:   [1.5555 ns 1.5569 ns 1.5585 ns]
                        thrpt:  [1.7927 GiB/s 1.7946 GiB/s 1.7962 GiB/s]
                 change:
                        time:   [+0.0357% +0.2184% +0.4241%] (p = 0.02 < 0.05)
                        thrpt:  [-0.4223% -0.2179% -0.0357%]
                        Change within noise threshold.
chars::count/jp_0102    time:   [5.5987 ns 5.6029 ns 5.6076 ns]
                        thrpt:  [19.099 GiB/s 19.116 GiB/s 19.130 GiB/s]
                 change:
                        time:   [+59.887% +60.255% +60.570%] (p = 0.00 < 0.05)
                        thrpt:  [-37.722% -37.599% -37.456%]
                        Performance has regressed.
chars::count/jp_1001    time:   [25.527 ns 25.547 ns 25.572 ns]
                        thrpt:  [39.187 GiB/s 39.226 GiB/s 39.257 GiB/s]
                 change:
                        time:   [+28.287% +28.535% +28.771%] (p = 0.00 < 0.05)
                        thrpt:  [-22.343% -22.200% -22.050%]
                        Performance has regressed.
chars::count/jp_10000   time:   [327.28 ns 327.58 ns 327.96 ns]
                        thrpt:  [30.556 GiB/s 30.591 GiB/s 30.620 GiB/s]
                 change:
                        time:   [+71.743% +72.059% +72.350%] (p = 0.00 < 0.05)
                        thrpt:  [-41.979% -41.880% -41.774%]
                        Performance has regressed.

However I eventually want to add loop unrolling to all algorithms. Because CPU's have multiple SIMD execution units loop unrolling will give a sizeable speedup. That is what memchar does. It even provides about a 10-20% speedup on my x86 machine.

@cessen
Copy link
Owner

cessen commented Feb 6, 2023

I think I might have given the wrong impression: I'm not against the loop unrolling. It's a fine alternative to the current code structure, as long as the scalar version still optimizes roughly as well. I'm not trying to block that.

It's just that the situation is counter-intuitive to me, and I'd like to understand what's going on.

sum_bytes is only a single instruction on ARM because it has a reduction add (and that instruction has the same throughput as an lane-wise add).

Yeah, that makes sense. Additionally, I think there's a good chance that additional instruction gets "pipelined away" in terms of execution time.

In retrospect, what's actually throwing me off is why the explicit SIMD on ARM chokes with the old structure, to the extent that it's outperformed by the scalar code. I suppose that can just be chalked up to LLVM somehow not being able to apply the same optimizations for some reason. But that feels unsatisfying without verifying it.

But yeah, this PR shouldn't be blocked on investigating that.

I have some additional comments (which I'll add momentarily), but otherwise I think this is good to land.

src/chars.rs Outdated Show resolved Hide resolved
src/chars.rs Outdated Show resolved Hide resolved
@cessen
Copy link
Owner

cessen commented Feb 6, 2023

This is awesome. Thanks for putting the work in on this!

@cessen cessen merged commit b178417 into cessen:master Feb 6, 2023
@CeleritasCelery
Copy link
Contributor Author

In retrospect, what's actually throwing me off is why the explicit SIMD on ARM chokes with the old structure, to the extent that it's outperformed by the scalar code. I suppose that can just be chalked up to LLVM somehow not being able to apply the same optimizations for some reason.

I thought that was weird as well. I get the impression from talking the Rust lang folks that the arm code gen is not as robust as x86.

Either way I have applied your feedback and the loop looks cleaner. Performance was unchanged as well.

@cessen
Copy link
Owner

cessen commented Feb 20, 2023

@CeleritasCelery

However I eventually want to add loop unrolling to all algorithms.

Were you planning to get to this some time soon-ish? It's totally fine if not. Just want to know because if you are, then I'll hold off on making the next release. Otherwise I'm thinking of making the next release pretty soon (assuming I have the time/energy).

@CeleritasCelery
Copy link
Contributor Author

It will be a few weeks before I work on that. I am waiting to get my x86 machine so that I can reliably benchmark both versions.

@cessen
Copy link
Owner

cessen commented Feb 20, 2023

Got it. A few weeks or so is fine. I'm actually pretty busy right now, so putting off the release a bit is probably good for me. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add explicit SIMD for aarch64 platforms.
2 participants