Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saturating truncation produces extra instructions #68466

Closed
calebzulawski opened this issue Oct 7, 2023 · 2 comments
Closed

Saturating truncation produces extra instructions #68466

calebzulawski opened this issue Oct 7, 2023 · 2 comments

Comments

@calebzulawski
Copy link
Contributor

See: https://llvm.godbolt.org/z/4KdejfEsG

The following two functions:

declare <4 x i16> @llvm.smax.v4i16(<4 x i16>, <4 x i16>)
declare <4 x i16> @llvm.smin.v4i16(<4 x i16>, <4 x i16>)
declare <8 x i16> @llvm.smax.v8i16(<8 x i16>, <8 x i16>)
declare <8 x i16> @llvm.smin.v8i16(<8 x i16>, <8 x i16>)

define <4 x i8> @saturate4(<4 x i16> %x) {
  %1 = tail call <4 x i16> @llvm.smax.v4i16(<4 x i16> %x, <4 x i16> zeroinitializer)
  %2 = tail call <4 x i16> @llvm.smin.v4i16(<4 x i16> %1, <4 x i16> <i16 255, i16 255, i16 255, i16 255>)
  %3 = trunc <4 x i16> %2 to <4 x i8>
  ret <4 x i8> %3
}

define <8 x i8> @saturate8(<8 x i16> %x) {
  %1 = tail call <8 x i16> @llvm.smax.v8i16(<8 x i16> %x, <8 x i16> zeroinitializer)
  %2 = tail call <8 x i16> @llvm.smin.v8i16(<8 x i16> %1, <8 x i16> <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>)
  %3 = trunc <8 x i16> %2 to <8 x i8>
  ret <8 x i8> %3
}

produce the following:

.LCPI0_0:
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
        .zero   2
        .zero   2
        .zero   2
        .zero   2
saturate4:                              # @saturate4
        pxor    xmm1, xmm1
        pmaxsw  xmm0, xmm1
        pminsw  xmm0, xmmword ptr [rip + .LCPI0_0]
        packuswb        xmm0, xmm0
        ret
saturate8:                              # @saturate8
        packuswb        xmm0, xmm0
        ret

The saturate4 function produces extra min/max. I believe the trunc followed by shufflevector is being optimized before the saturating truncation could be detected.

Discovered in rust-lang/portable-simd#369 (comment)

@llvmbot
Copy link
Member

llvmbot commented Oct 7, 2023

@llvm/issue-subscribers-backend-x86

See: https://llvm.godbolt.org/z/4KdejfEsG

The following two functions:

declare &lt;4 x i16&gt; @<!-- -->llvm.smax.v4i16(&lt;4 x i16&gt;, &lt;4 x i16&gt;)
declare &lt;4 x i16&gt; @<!-- -->llvm.smin.v4i16(&lt;4 x i16&gt;, &lt;4 x i16&gt;)
declare &lt;8 x i16&gt; @<!-- -->llvm.smax.v8i16(&lt;8 x i16&gt;, &lt;8 x i16&gt;)
declare &lt;8 x i16&gt; @<!-- -->llvm.smin.v8i16(&lt;8 x i16&gt;, &lt;8 x i16&gt;)

define &lt;4 x i8&gt; @<!-- -->saturate4(&lt;4 x i16&gt; %x) {
  %1 = tail call &lt;4 x i16&gt; @<!-- -->llvm.smax.v4i16(&lt;4 x i16&gt; %x, &lt;4 x i16&gt; zeroinitializer)
  %2 = tail call &lt;4 x i16&gt; @<!-- -->llvm.smin.v4i16(&lt;4 x i16&gt; %1, &lt;4 x i16&gt; &lt;i16 255, i16 255, i16 255, i16 255&gt;)
  %3 = trunc &lt;4 x i16&gt; %2 to &lt;4 x i8&gt;
  ret &lt;4 x i8&gt; %3
}

define &lt;8 x i8&gt; @<!-- -->saturate8(&lt;8 x i16&gt; %x) {
  %1 = tail call &lt;8 x i16&gt; @<!-- -->llvm.smax.v8i16(&lt;8 x i16&gt; %x, &lt;8 x i16&gt; zeroinitializer)
  %2 = tail call &lt;8 x i16&gt; @<!-- -->llvm.smin.v8i16(&lt;8 x i16&gt; %1, &lt;8 x i16&gt; &lt;i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255&gt;)
  %3 = trunc &lt;8 x i16&gt; %2 to &lt;8 x i8&gt;
  ret &lt;8 x i8&gt; %3
}

produce the following:

.LCPI0_0:
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
        .short  255                             # 0xff
        .zero   2
        .zero   2
        .zero   2
        .zero   2
saturate4:                              # @<!-- -->saturate4
        pxor    xmm1, xmm1
        pmaxsw  xmm0, xmm1
        pminsw  xmm0, xmmword ptr [rip + .LCPI0_0]
        packuswb        xmm0, xmm0
        ret
saturate8:                              # @<!-- -->saturate8
        packuswb        xmm0, xmm0
        ret

The saturate4 function produces extra min/max. I believe the trunc followed by shufflevector is being optimized before the saturating truncation could be detected.

Discovered in rust-lang/portable-simd#369 (comment)

@okaneco
Copy link

okaneco commented Oct 8, 2023

Another example of extra instructions generated: https://llvm.godbolt.org/z/bEW6j4asW
Original Rust code: https://rust.godbolt.org/z/hjM3G4qxY

declare <4 x i32> @llvm.smax.v4i32(<4 x i32>, <4 x i32>)
declare <4 x i32> @llvm.smin.v4i32(<4 x i32>, <4 x i32>)
declare <8 x i32> @llvm.smax.v8i32(<8 x i32>, <8 x i32>)
declare <8 x i32> @llvm.smin.v8i32(<8 x i32>, <8 x i32>)

define <4 x i8> @saturate4(<4 x i32> %x) {
  %1 = tail call <4 x i32> @llvm.smax.v4i32(<4 x i32> %x, <4 x i32> zeroinitializer)
  %2 = tail call <4 x i32> @llvm.smin.v4i32(<4 x i32> %1, <4 x i32> <i32 255, i32 255, i32 255, i32 255>)
  %3 = trunc <4 x i32> %2 to <4 x i8>
  ret <4 x i8> %3
}

define <8 x i8> @saturate8(<8 x i32> %x) {
  %1 = tail call <8 x i32> @llvm.smax.v8i32(<8 x i32> %x, <8 x i32> zeroinitializer)
  %2 = tail call <8 x i32> @llvm.smin.v8i32(<8 x i32> %1, <8 x i32> <i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255>)
  %3 = trunc <8 x i32> %2 to <8 x i8>
  ret <8 x i8> %3
}
.LCPI0_0:
        .long   255                             # 0xff
        .long   255                             # 0xff
        .long   255                             # 0xff
        .long   255                             # 0xff
saturate4:                              # @saturate4
        pxor    xmm1, xmm1
        movdqa  xmm2, xmm0
        pcmpgtd xmm2, xmm1
        pand    xmm0, xmm2
        movdqa  xmm1, xmmword ptr [rip + .LCPI0_0] # xmm1 = [255,255,255,255]
        movdqa  xmm2, xmm1
        pcmpgtd xmm2, xmm0
        pand    xmm0, xmm2
        pandn   xmm2, xmm1
        por     xmm0, xmm2
        packuswb        xmm0, xmm0
        packuswb        xmm0, xmm0
        ret
saturate8:                              # @saturate8
        packssdw        xmm0, xmm1
        packuswb        xmm0, xmm0
        ret

@RKSimon RKSimon self-assigned this Oct 8, 2023
RKSimon added a commit that referenced this issue Nov 1, 2023
Adds additional test coverage for Issue #68466
@RKSimon RKSimon closed this as completed in f471f6f Nov 1, 2023
ChunyuLiao pushed a commit that referenced this issue Dec 29, 2023
This patch closed #73424, which is also a missed-optimization case
similar to #68466 on X86.

## Source Code
```
define void @trunc_sat_i8i16(ptr %x, ptr %y) {
  %1 = load <8 x i16>, ptr %x, align 16
  %2 = tail call <8 x i16> @llvm.smax.v8i16(<8 x i16> %1, <8 x i16> <i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128>)
  %3 = tail call <8 x i16> @llvm.smin.v8i16(<8 x i16> %2, <8 x i16> <i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127>)
  %4 = trunc <8 x i16> %3 to <8 x i8>
  store <8 x i8> %4, ptr %y, align 8
  ret void
}
```
## Before this patch: 
```
trunc_sat_i8i16:                  # @trunc_maxmin_id_i8i16
        vsetivli        zero, 8, e16, m1, ta, ma
        vle16.v v8, (a0)
        li      a0, -128
        vmax.vx v8, v8, a0
        li      a0, 127
        vmin.vx v8, v8, a0
        vsetvli zero, zero, e8, mf2, ta, ma
        vnsrl.wi        v8, v8, 0
        vse8.v  v8, (a1)
        ret
```

## After this patch: 
```
trunc_sat_i8i16:                  # @trunc_maxmin_id_i8i16
	vsetivli	zero, 8, e8, mf2, ta, ma
	vle16.v	v8, (a0)
	csrwi	vxrm, 0
	vnclip.wi	v8, v8, 0
	vse8.v	v8, (a1)
	ret
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants