Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebAssembly: Suboptimal codegen of combined SIMD operations #57182

Closed
alexcrichton opened this issue Aug 16, 2022 · 4 comments
Closed

WebAssembly: Suboptimal codegen of combined SIMD operations #57182

alexcrichton opened this issue Aug 16, 2022 · 4 comments

Comments

@alexcrichton
Copy link
Contributor

Originally reported at rust-lang/stdarch#1322 it appears that combining the codegen for these three wasm simd instructions results in a scalarized lowering rather than using the instructions themselves:

  • i16x8.extend_low_i8x16_u
  • i32x4.extend_low_i16x8_u
  • f32x4.convert_i32x4_u

This godbolt example has the Rust source code, the generated WebAssembly today, and the corresponding optimized LLVM IR that is being lowered.

cc @tlively

@alexcrichton
Copy link
Contributor Author

Er sorry should include the actual sources here. Input LLVM is:

target datalayout = "e-m:e-p:32:32-p10:8:8-p20:8:8-i64:64-n32:64-S128-ni:1:10:20"
target triple = "wasm32-unknown-unknown"

define dso_local <4 x i32> @convert(<4 x i32> %x) unnamed_addr #0 {
  %0 = bitcast <4 x i32> %x to <16 x i8>
  %1 = shufflevector <16 x i8> %0, <16 x i8> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %2 = zext <8 x i8> %1 to <8 x i16>
  %3 = shufflevector <8 x i16> %2, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %4 = uitofp <4 x i16> %3 to <4 x float>
  %5 = bitcast <4 x float> %4 to <4 x i32>
  ret <4 x i32> %5
}

attributes #0 = { mustprogress nofree nosync nounwind willreturn "target-cpu"="generic" "target-features"="+simd128" }

and the output wasm is:

convert:
        local.get       0
        i16x8.extend_low_i8x16_u
        local.tee       0
        i16x8.extract_lane_u    0
        f32.convert_i32_u
        f32x4.splat
        local.get       0
        i16x8.extract_lane_u    1
        f32.convert_i32_u
        f32x4.replace_lane      1
        local.get       0
        i16x8.extract_lane_u    2
        f32.convert_i32_u
        f32x4.replace_lane      2
        local.get       0
        i16x8.extract_lane_u    3
        f32.convert_i32_u
        f32x4.replace_lane      3
        end_function

@llvmbot
Copy link
Member

llvmbot commented Aug 16, 2022

@llvm/issue-subscribers-backend-webassembly

@tlively
Copy link
Collaborator

tlively commented Aug 16, 2022

Also cc @dtig, since I know she was tracking SIMD optimization issues.

@lukel97
Copy link
Contributor

lukel97 commented Jan 3, 2023

I think the issue here is that it's not selecting TxN.extend_low.SxM_u instructions whenever there's no explicit zero_extend SDNode:

define <4 x i32> @bad(<4 x i32> %x) {
  %1 = bitcast <4 x i32> %x to <8 x i16>
  %2 = shufflevector <8 x i16> %1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %3 = uitofp <4 x i16> %2 to <4 x float>
  %4 = bitcast <4 x float> %3 to <4 x i32>
  ret <4 x i32> %4
}

define <4 x i32> @good(<4 x i32> %x) {
  %1 = bitcast <4 x i32> %x to <8 x i16>
  %2 = shufflevector <8 x i16> %1, <8 x i16> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %3 = zext <4 x i16> %2 to <4 x i32>
  %4 = uitofp <4 x i32> %3 to <4 x float>
  %5 = bitcast <4 x float> %4 to <4 x i32>
  ret <4 x i32> %5
}

lukel97 added a commit that referenced this issue Jan 3, 2023
These test cases should be updated in a following patch once fixed
Part of #57182
CarlosAlbertoEnciso pushed a commit to SNSystems/llvm-debuginfo-analyzer that referenced this issue Jan 4, 2023
These test cases should be updated in a following patch once fixed
Part of llvm/llvm-project#57182
@lukel97 lukel97 closed this as completed in fb66026 Jan 6, 2023
CarlosAlbertoEnciso pushed a commit to SNSystems/llvm-debuginfo-analyzer that referenced this issue Jan 9, 2023
During DAG legalization, {u,s}itofp instructions on v2i8, v2i16, v4i8
and v4i16 types ended up being legalized into scalar instructions, when
they could just be extended to v2i32/v4i32 instead.

Fixes llvm/llvm-project#57182

Differential Revision: https://reviews.llvm.org/D140916
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants