Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEON intrinsics are broken on big-endian #1484

Open
Amanieu opened this issue Oct 19, 2023 · 20 comments
Open

NEON intrinsics are broken on big-endian #1484

Amanieu opened this issue Oct 19, 2023 · 20 comments

Comments

@Amanieu
Copy link
Member

Amanieu commented Oct 19, 2023

These are currently broken because the order of elements inside vectors is reversed on big-endian systems: the ARM ABI requires that element 0 is located at the highest address of the vector type. However LLVM intrinsics expect element 0 to be located at the lowest address.

See https://llvm.org/docs/BigEndianNEON.html and arm_neon.h in Clang for more details.

@RalfJung
Copy link
Member

the ARM ABI requires that element 0 is located at the highest address of the vector type. However LLVM intrinsics expect element 0 to be located at the lowest address.

What exactly does this mean? Is there a bug in LLVM? If so, where is it tracked?

Or is the problem that Rust stdarch wants to expose the intrinsics the way they work on hardware, but LLVM doesn't provide those semantics? If so, could that be fixed by doing appropriate translation of indices before calling the intrinsics?

@Amanieu
Copy link
Member Author

Amanieu commented Feb 14, 2024

The short answer is that, on big-endian, LLVM portable vectors have a different element ordering than the one in the vector types used by the NEON intrinsics.

The C intrinsics work around this by reversing the element ordering in vectors before & after each intrinsic. We need to do the same in stdarch.

@RalfJung
Copy link
Member

Oh I see, so this is a mismatch about the simd_x intrinsics vs vendor-specific intrinsics? Okay makes sense.

OTOH this is good news for portable-simd, seems like there we'll be getting consistent behavior across platforms without extra work then.

calebzulawski pushed a commit to rust-lang/portable-simd that referenced this issue Feb 17, 2024
calebzulawski pushed a commit to rust-lang/portable-simd that referenced this issue Feb 17, 2024
calebzulawski pushed a commit to rust-lang/portable-simd that referenced this issue Feb 17, 2024
calebzulawski pushed a commit to rust-lang/portable-simd that referenced this issue Feb 17, 2024
calebzulawski pushed a commit to rust-lang/portable-simd that referenced this issue Apr 9, 2024
calebzulawski pushed a commit to rust-lang/portable-simd that referenced this issue Apr 9, 2024
he32 added a commit to he32/memchr that referenced this issue Sep 29, 2024
As noted in rust-lang/stdarch#1484,
the NEON intrinsics are broken on big-endian aarch64.

This is part of fixing rust to build for & on big-endian aarch64,
following up rust-lang/rust#129819.
he32 added a commit to he32/zerocopy that referenced this issue Oct 2, 2024
Neon / SIMD is known to be problematical in rust, ref.
rust-lang/stdarch#1484, even
though the CPU itself supports it.
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Oct 2, 2024
This is done by avoiding attempts at using neon / SIMD in
big-endian mode by patching some of the vendored crates.
Neon / SIMD is known to be problematical in rust, ref.
rust-lang/stdarch#1484, even
though the CPU itself supports it.

I've also tried reporting the memchr fixes upstream, ref.
BurntSushi/memchr#162
So far not yet adopted.

Zerocopy has also received a pull request:
google/zerocopy#1795
he32 added a commit to he32/bytecount that referenced this issue Oct 2, 2024
Do this by avoiding trying to use neon / SIMD on big-endian aarch64.
Neon intrinsics are problematical on big-endian targets, ref.
rust-lang/stdarch#1484
@workingjubilee
Copy link
Member

@he32

I currently appear unable to find the actual connecting tissue between library/stdarch/crates/core_arch/src/aarch64/neon/mod.rs and LLVM

The actual place that intrinsics themselves are handled is in two places: if it's an architecture-specific intrinsic, it uses link_llvm_intrinsics, which effectively specifies a lowering directly to LLVM textual IR. Otherwise, if it's one of rustc's "portable" intrinsics (simd_add and the like), the primary definition is in rustc_codegen_llvm: https://github.com/rust-lang/rust/blob/master/compiler/rustc_codegen_llvm/src/intrinsic.rs

@he32
Copy link

he32 commented Nov 10, 2024

As stated elsewhere, I reported the issue rust-lang/rust#129819 earlier, and now have workarounds in place so that I'm able to produce a working rust compiler on big-endian NetBSD/aarch64 by avoiding attempts to use the NEON extensions in that mode.

However, since those extensions are available in the CPU, a better solution would be to fix this issue and then probably to revert the workarounds.

Since I'm a relative rust newbie, I have been thinking about what it would take to get some forward motion on that underlying issue. I have so far come to the conclusion that it would be helpful to have a test program (in rust, of course), which excercises / validates all the NEON SIMD extensions, runnable on 64-bit little-endian aarch64 system. Perhaps such a program already exists, and it's just a matter of pointing to it? My newbie status would make it difficult for me to come up with such a program, and I am hoping that it would be helpful in exploring a fix to this underlying issue. Since I do this in my copious spare time (as the expression goes), I can't make any firm commitments, but I think this will be a useful starting point for anyone wanting to tackle this issue properly.

The second worry I have is whether adding swizzling / byte-swapping of arguments and results before/after using NEON intrinsics will tend to negate the gains otherwise achieved by the NEON extensions compared to little-endian mode. I don't have a good intuition for that -- anyone have a better suggestion for that? Ideally, the test program could also do some measurement / validation of that? (Or would that be asking too much? It would not be required to act as the initial stepping stone, at least.)

@RalfJung
Copy link
Member

The second worry I have is whether adding swizzling / byte-swapping of arguments and results before/after using NEON intrinsics will tend to negate the gains otherwise achieved by the NEON extensions compared to little-endian mode.

I assume LLVM has to insert its own byte swapping when lowering the portable SIMD operations on big-endian NEON... so hopefully the codegen backend is good enough to realize that the two swaps cancel each other out, and remove both of them? If not, that seems worth reporting as an LLVM bug.

have a test program (in rust, of course), which excercises / validates all the NEON SIMD extensions, runnable on 64-bit little-endian aarch64 system

The stdarch test suite should be able to serve that purpose. The tricky part probably is that we don't have a way to run it on CI. Miri can run some of it (when it only needs generic SIMD intrinsics), not sure if that is good enough to gain confidence for re-landing the intrinsics.

@RalfJung
Copy link
Member

RalfJung commented Nov 10, 2024

I also found this gem in the LLVM docs mentioned above:

Make sure appropriate bitconverts are created so that vector values get passed over call boundaries as 1-element vectors (which is the same as if they were loaded with LDR).

Is that something the frontend has to do? That would mean we need a special case in our ABI handling code to use PassMode::Cast with a 1-element vector for all by-val vector passing on these targets. @workingjubilee is going to love this. ;)

@workingjubilee
Copy link
Member

...whaaa? so a <16 x i8> becomes <1 x i128>?

@RalfJung
Copy link
Member

That's how I understand this, yes.

@he32
Copy link

he32 commented Nov 10, 2024

have a test program (in rust, of course), which excercises / validates all the NEON SIMD extensions, runnable on 64-bit little-endian aarch64 system

The stdarch test suite should be able to serve that purpose.

Hmm, then I need to go look there (pointer to directory?), and see if the NEON stuff is easily identifiable / isolateable.

The tricky part probably is that we don't have a way to run it on CI. Miri can run some of it (when it only needs generic SIMD intrinsics), not sure if that is good enough to gain confidence for re-landing the intrinsics.

I am "old school", so was thinking foremost of doing the development without any CI support. Getting CI in place for this would then be a separate issue to be tackled separately.

@RalfJung
Copy link
Member

We shouldn't land anything without CI support, but ofc you can develop in whatever order suits you best. :)

@he32
Copy link

he32 commented Nov 10, 2024

We shouldn't land anything without CI support, but ofc you can develop in whatever order suits you best. :)

I completely understand, and probably agree, and recall having seen hints which might help in that direction. We'll see. First things first.

@Amanieu
Copy link
Member Author

Amanieu commented Nov 10, 2024

The easiest way to support big-endian would be to migrate all NEON intrinsics to use the stdarch-gen code generation framework and then have that automatically insert the needed swizzles on big-endian. However this is a huge amount of work and not a trivial undertaking.

Another approach would be to adapt the new code generator used for SVE intrinsics in #1509 to also generate NEON intriniscs, which may be easier to work with than the current code generator. cc @JamieCunliffe

@he32
Copy link

he32 commented Nov 10, 2024

The second worry I have is whether adding swizzling / byte-swapping of arguments and results before/after using NEON intrinsics will tend to negate the gains otherwise achieved by the NEON extensions compared to little-endian mode.

I assume LLVM has to insert its own byte swapping when lowering the portable SIMD operations on big-endian NEON... so hopefully the codegen backend is good enough to realize that the two swaps cancel each other out, and remove both of them? If not, that seems worth reporting as an LLVM bug.

Then I do not understand. As I understand it, doing swaps both pre- and post-SIMD operations are a necessary part of making the SIMD operations work as intended. They are therefore not "cancellable".

have a test program (in rust, of course), which excercises / validates all the NEON SIMD extensions, runnable on 64-bit little-endian aarch64 system

The stdarch test suite should be able to serve that purpose.

Sadly, this fails the "here is a concrete set of operations to test & validate" test. It's like saying to me "it is in there, somewhere, in the rust compiler sources -- you go figure out where and what by yourself".

@RalfJung
Copy link
Member

Yeah the test suite is complicated I am afraid -- I was just giving you some pointers that hopefully lead into the right direction. That's all I can offer, sorry. I don't know how the stdarch test suite is set up. I'm afraid I don't think there is such a thing as a simple test program; stdarch offers many thousand operations and has some fancy setup to test them all.

@Amanieu
Copy link
Member Author

Amanieu commented Nov 10, 2024

You can look at the intrinsic-test crate in this repo which checks that the intrinsics match the behavior of those same intrinsics in Clang. See our CI script for an example of how to run it.

@he32
Copy link

he32 commented Nov 10, 2024

Yeah the test suite is complicated I am afraid -- I was just giving you some pointers that hopefully lead into the right direction. That's all I can offer, sorry. I don't know how the stdarch test suite is set up. I'm afraid I don't think there is such a thing as a simple test program; stdarch offers many thousand operations and has some fancy setup to test them all.

OK, I understand. I have found at least some of what needs to be looked at, and I'm trying to follow the various suggestions here. Next will need to be some experimentation etc. We'll see how that goes. Thanks anyway!

@jswrenn
Copy link
Member

jswrenn commented Nov 12, 2024

Did something recently happen in this space? Zerocopy's nightly CI started failing this past weekend due to SIMD intrinsic name resolution errors on aarch64_be-unknown-linux-gnu while building memchr. It doesn't look like memchr has changed; was rustc changed?

@taiki-e
Copy link
Member

taiki-e commented Nov 12, 2024

@jswrenn That is because rust-lang/rust#132714 bumped memchr without considering aarch64_be is not supported in the latest memchr (BurntSushi/memchr#162).

(According to the author of that rust-lang/rust PR, the memchr bump might be able to be reverted: taiki-e/atomic-maybe-uninit@6ea62cc#commitcomment-148891566)

@Jamesbarford
Copy link
Contributor

FWIW I was going to look at getting this working once the new ARM generator PR has gone in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants