-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simd-using functions sometimes scalarize after inlining, even if they use vector ops on their own #321
Comments
Sometimes we may need to change the algorithm according to specific target features. #[cfg(target_arch = "aarch64")]
pub fn all_ascii_chunk(s: &[u8; CHUNK]) -> bool {
use std::simd::*;
let x = Simd::<u8, CHUNK>::from_array(*s);
x.reduce_max() < 0x80
} It's annoying that portable-simd is not so "portable" when we try to optimize the codegen of some functions. |
@Nugine that's not the case here. These operations are fully supported, and LLVM even knows that. Please, look more closely at the example (and read the body of the issue again) -- the issue is that in the small function it vectorizes fine, but when that function is called from a loop it become scalarized. I of course know that neon does not support the operations as I wrote them, but translates them vectorize fine in the small example (and for u8x16), and writing this way is done in order to get good codegen on other targets where emitting a check on the maximum would not be efficient (movemsk is much faster than horizontal max), without target conditionals. Anyway, your comment is basically off topic for this issue. |
The weird thing is it appears the problem is not an optimization problem, its a codegen/ISel (instruction selection) problem, prior to codegen, the optimized LLVM IR is perfectly vectorized, the code gets inlined as expected, and everything is good: define noundef zeroext i1 @_ZN7example9all_ascii17h6a179b75d513aa15E(ptr noalias noundef nonnull readonly align 1 %s.0, i64 %s.1) unnamed_addr #0 personality ptr @rust_eh_personality !dbg !5 {
%0 = getelementptr inbounds [32 x i8], ptr %s.0, i64 %s.1, !dbg !10
br label %bb1.i, !dbg !34
bb1.i: ; preds = %bb3.i, %start
%1 = phi ptr [ %2, %bb3.i ], [ %s.0, %start ]
%_10.i.i = icmp eq ptr %1, %0, !dbg !39
br i1 %_10.i.i, label %"_ZN91_$LT$core..slice..iter..Iter$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$3all17h8d243705e3f9eef8E.exit", label %bb3.i, !dbg !39
bb3.i: ; preds = %bb1.i
%2 = getelementptr inbounds [32 x i8], ptr %1, i64 1, !dbg !44
%.val.i = load <32 x i8>, ptr %1, align 1, !dbg !58, !alias.scope !59, !noalias !62
%3 = icmp slt <32 x i8> %.val.i, zeroinitializer, !dbg !65
%4 = bitcast <32 x i1> %3 to i32, !dbg !79
%5 = icmp eq i32 %4, 0, !dbg !79
br i1 %5, label %bb1.i, label %"_ZN91_$LT$core..slice..iter..Iter$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$3all17h8d243705e3f9eef8E.exit", !dbg !90
"_ZN91_$LT$core..slice..iter..Iter$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$3all17h8d243705e3f9eef8E.exit": ; preds = %bb1.i, %bb3.i
ret i1 %_10.i.i, !dbg !91
} However, during the first and most important pass of codegen (ISel), LLVM thinks that for some reason the "pseudo-instructions" (the intrinsics) should be unrolled into scalar ops instead of becoming vector ops: (the vector instructions are below the first red block, but the right side is too large to show that too). |
That makes me think this is another instance of #146. In the simpler case, there's probably an optimization pass that's vectorizing the scalar output from ISel. The more complicated case probably disrupts that optimization. Just a guess. |
It's interesting that the "fold" version generates vectorized instructions. #![feature(portable_simd)]
extern crate core;
use core::simd::*;
const N: usize = 32;
pub fn check(chunk: &[u8; N]) -> bool {
let x = Simd::<u8, N>::from_array(*chunk);
let h = Simd::<u8, N>::splat(0x80);
x.simd_lt(h).all()
}
pub fn all_ascii_v1(s: &[[u8; N]]) -> bool {
s.iter().fold(true, |acc, x| acc & check(x))
}
pub fn all_ascii_v2(s: &[[u8; N]]) -> bool {
s.iter().all(check)
} example::check:
ldp q1, q0, [x0]
orr.16b v0, v1, v0
cmlt.16b v0, v0, #0
umaxv.16b b0, v0
fmov w8, s0
mvn w8, w8
and w0, w8, #0x1
ret
example::all_ascii_v1:
cbz x1, LBB1_4
mov x8, x0
lsl x9, x1, #5
mov w0, #1
LBB1_2:
ldp q1, q0, [x8], #32
orr.16b v0, v1, v0
cmlt.16b v0, v0, #0
umaxv.16b b0, v0
fmov w10, s0
bic w10, w0, w10
and w0, w10, #0x1
subs x9, x9, #32
b.ne LBB1_2
ret
LBB1_4:
mov w0, #1
ret |
I think the root cause of this is that |
I don't think the problem is limited to bitcasts, see llvm/llvm-project#50466 There is no bitcast, only truncate and reduce: https://github.com/rust-lang/rust/blob/e187f8871e3d553181c9d2d4ac111197a139ca0d/compiler/rustc_codegen_llvm/src/intrinsic.rs#L1724 |
The intrinsic does initially get lowered to a |
Godbolt: https://rust.godbolt.org/z/hhMWb6Eja
On aarch64 I have this code:
Wonderfully,
all_ascii_chunk
compiles to essentially what I want (I mean, it's not perfect, but I certainly wouldn't file a bug about it):Unfortunately, when it gets called in a loop from
all_ascii
, we... seem to completely loose our ability to do something reasonable, and get this monstrosity:And performance takes a nose-dive. This is very annoying, because this kind of issue means that I can't rely on functions that appear to codegen well continuing to do so when called :(. I know we have little control over this, but its... kind of an a huge issue for using
std::simd
to optimize functions if we can't rely it behaving consistently.This is possibly related to the bad aarch64 scalar reductions I saw before, although it doesn't seem like it because
all_ascii_chunk
is fine.The text was updated successfully, but these errors were encountered: