Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llvm][AMDGPU] Fold llvm.amdgcn.wavefrontsize early #114481

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
3ba88ce
Add pass to handle AMDGCN pseudo-intrinsics (abstract placeholders fo…
AlexVlx Oct 31, 2024
1376596
Merge branch 'main' of https://github.com/llvm/llvm-project into hand…
AlexVlx Oct 31, 2024
826c291
Implement review feedback.
AlexVlx Nov 1, 2024
ab6f5a2
Do not fold early for `generic` mcpu.
AlexVlx Nov 1, 2024
f8705fb
Fix formatting (again).
AlexVlx Nov 1, 2024
ed870a8
Merge branch 'main' of https://github.com/llvm/llvm-project into hand…
AlexVlx Nov 1, 2024
f5751a5
Merge branch 'main' of https://github.com/llvm/llvm-project into hand…
AlexVlx Nov 4, 2024
026ed00
Remove pass, fold in InstCombine.
AlexVlx Nov 4, 2024
195decc
Remove leftovers.
AlexVlx Nov 4, 2024
1a7abaf
Remove pass.
AlexVlx Nov 4, 2024
9aed76c
Fix formatting.
AlexVlx Nov 4, 2024
246c22f
Really fix formatting.
AlexVlx Nov 4, 2024
5a11720
Merge branch 'main' of https://github.com/llvm/llvm-project into hand…
AlexVlx Nov 6, 2024
7cf7558
Split tests.
AlexVlx Nov 6, 2024
6a77b8a
Merge branch 'main' of https://github.com/llvm/llvm-project into hand…
AlexVlx Nov 6, 2024
be414a8
Merge branch 'main' of https://github.com/llvm/llvm-project into hand…
AlexVlx Nov 7, 2024
dedc593
Merge branch 'main' of https://github.com/llvm/llvm-project into hand…
AlexVlx Nov 7, 2024
c634b4e
Merge branch 'main' of https://github.com/llvm/llvm-project into hand…
AlexVlx Nov 18, 2024
c7be46f
Merge branch 'handle_wavefrontsize_early' of https://github.com/AlexV…
AlexVlx Nov 18, 2024
ed9f19f
Tweak `generic` mcpu handling.
AlexVlx Nov 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions clang/test/CodeGenOpenCL/builtins-amdgcn.cl
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// REQUIRES: amdgpu-registered-target
// RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-unknown-unknown -target-cpu tahiti -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefixes=CHECK-AMDGCN,CHECK %s
// RUN: %clang_cc1 -cl-std=CL2.0 -triple spirv64-amd-amdhsa -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefix=CHECK %s
// RUN: %clang_cc1 -cl-std=CL2.0 -triple spirv64-amd-amdhsa -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefixes=CHECK,CHECK-SPIRV %s


#pragma OPENCL EXTENSION cl_khr_fp64 : enable
Expand Down Expand Up @@ -866,7 +866,8 @@ void test_atomic_inc_dec(__attribute__((address_space(3))) uint *lptr, __attribu
// CHECK-LABEL test_wavefrontsize(
unsigned test_wavefrontsize() {

// CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wavefrontsize()
// CHECK-AMDGCN: ret i32 {{[0-9]+}}
// CHECK-SPIRV: {{.*}}call{{.*}} i32 @llvm.amdgcn.wavefrontsize()
return __builtin_amdgcn_wavefrontsize();
}

Expand Down
9 changes: 9 additions & 0 deletions llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1024,6 +1024,15 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
}
break;
}
case Intrinsic::amdgcn_wavefrontsize: {
// TODO: this is a workaround for the pseudo-generic target one gets with no
// specified mcpu, which spoofs its wave size to 64; it should be removed.
Copy link
Contributor

@jhuber6 jhuber6 Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A real solution would be two builds, but spoofing it as 64 works, (likely unintentinally) because we don't do any w64 specific changes yet and w64 can always be narrowed to w32 and not the other way around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this interpretation is actually correct, if you rely on lockstep of a full wave and you optimise around wavesize this will break in bad ways on wave32. The current generic is not particularly god, but we have to live with it for now I guess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already do some light 64->32 folds, that are only sort of correct.

Technically we could make exec_hi an allocatable scratch register in wave32, but what we do now bakes in an assumption that exec_hi must always be 0.

But yes, the only way to really avoid any possible edge cases (and support a future of machine linked libraries) requires just having totally separate builds

if ((ST->getCPU().empty() || ST->getCPU().starts_with("generic")) &&
Copy link
Contributor

@jhuber6 jhuber6 Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still don't know if this is even worth checking being .empty., but I suppose we could codify that -mcpu="" and -mcpu="generic" are the same thing? My understanding is that generic was some testing thing.

!ST->getFeatureString().contains("+wavefrontsize"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature string may also contain a -wavefrontsize.

It's probably safest to ignore the target-features. If we're really going to rely on this target-cpu hack for the library uses, rocm-device-libs is not using an explicit wavefrontsize feature anymore (all the uses were converted to the ballot wave64->wave32 hack)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also do have the subtarget already. Should probably move the logic in there, instead of spreading the default CPU logic parsing into a new place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a temporary hack, what's the point of putting it in the subtarget so that people get ideas, start using it, and then there's even more technical debt? The parsing might have to change anyway once we have a proper generic target (which the current hack is not).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the hack isolated to one place, instead of spreading it around. You've already missed "generic-hsa" for example. The wavesize target parsing is also hacky, and we already have other hacky parsing in the subtarget constructor.

We could also implement this by making the generic target actually have 0 wavesize, and replacing the isWave64 predicates with wavesize != 64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable, there's also an argument that the backend likely can't do anything useful without -mcpu= set so we could make that an error. The NVIDIA tools basically do that themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I don't think we should be doing live design of generic (which is part of what got us here anyway), so I'd rather not build even more technical debt around its current form which was meant to be a test only kludge:

// The code produced for "generic" is only useful for tests and cannot
// reasonably be expected to execute on any particular target.

Which is to say I don't want to what is there now, I want it to not break. I've adjusted the check to cover generic-hsa and corrected the commit message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine with multiple builds, but right now the AMDGCN infra doesn't support it very well since we'd need to port the ROCm Device Libs to use my build system. Beyond that it'd be pretty easy to just default the triple depending on mcpu and -mwavefrontsize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we couldn't make a helper that is like ST.hasCPU(). The comments for 'generic-hsa' are different because the 'generic' IR in the RTlibs never touches the backend without having a CPU set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can simply live with starts_with for now, as the check won't be infectious and we'll remove it in the future anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if we even need to bother checking for 'generic' since that's not what any of the existing targets use for generic AFAIC. It's just not setting -mcpu when you compile, which I think should probably be an error in the backend.

break;
return IC.replaceInstUsesWith(
II, ConstantInt::get(II.getType(), ST->getWavefrontSize()));
}
case Intrinsic::amdgcn_wqm_vote: {
// wqm_vote is identity when the argument is constant.
if (!isa<Constant>(II.getArgOperand(0)))
Expand Down
29 changes: 1 addition & 28 deletions llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll
Original file line number Diff line number Diff line change
Expand Up @@ -4,48 +4,28 @@
; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+wavefrontsize32 -verify-machineinstrs -amdgpu-enable-vopd=0 < %s | FileCheck -check-prefixes=GCN,W32 %s
; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s

; RUN: opt -O3 -S < %s | FileCheck -check-prefix=OPT %s
; RUN: opt -mtriple=amdgcn-- -O3 -S < %s | FileCheck -check-prefix=OPT %s
; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
; RUN: opt -mtriple=amdgcn-- -passes='default<O3>' -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s
jhuber6 marked this conversation as resolved.
Show resolved Hide resolved
; RUN: opt -mtriple=amdgcn-- -mcpu=tonga -O3 -S < %s | FileCheck -check-prefix=OPT %s
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s

; GCN-LABEL: {{^}}fold_wavefrontsize:
; OPT-LABEL: define amdgpu_kernel void @fold_wavefrontsize(

; W32: v_mov_b32_e32 [[V:v[0-9]+]], 32
; W64: v_mov_b32_e32 [[V:v[0-9]+]], 64
; GCN: store_{{dword|b32}} v{{.+}}, [[V]]

; OPT: %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
; OPT: store i32 %tmp, ptr addrspace(1) %arg, align 4
; OPT-NEXT: ret void

define amdgpu_kernel void @fold_wavefrontsize(ptr addrspace(1) nocapture %arg) {

bb:
%tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
store i32 %tmp, ptr addrspace(1) %arg, align 4
ret void
}

; GCN-LABEL: {{^}}fold_and_optimize_wavefrontsize:
; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(

; W32: v_mov_b32_e32 [[V:v[0-9]+]], 1{{$}}
; W64: v_mov_b32_e32 [[V:v[0-9]+]], 2{{$}}
; GCN-NOT: cndmask
; GCN: store_{{dword|b32}} v{{.+}}, [[V]]

; OPT: %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
; OPT: %tmp1 = icmp ugt i32 %tmp, 32
; OPT: %tmp2 = select i1 %tmp1, i32 2, i32 1
; OPT: store i32 %tmp2, ptr addrspace(1) %arg
; OPT-NEXT: ret void

define amdgpu_kernel void @fold_and_optimize_wavefrontsize(ptr addrspace(1) nocapture %arg) {
bb:
Expand All @@ -57,13 +37,6 @@ bb:
}

; GCN-LABEL: {{^}}fold_and_optimize_if_wavefrontsize:
; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(

; OPT: bb:
; OPT: %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
; OPT: %tmp1 = icmp ugt i32 %tmp, 32
; OPT: bb3:
; OPT-NEXT: ret void

define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(ptr addrspace(1) nocapture %arg) {
bb:
Expand Down
114 changes: 114 additions & 0 deletions llvm/test/Transforms/InstCombine/AMDGPU/llvm.amdgcn.wavefrontsize.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
; RUN: opt -mtriple=amdgcn-- -passes=instcombine -S < %s | FileCheck -check-prefix=OPT %s
; RUN: opt -mtriple=amdgcn-- -mattr=+wavefrontsize32 -passes=instcombine -S < %s | FileCheck -check-prefix=OPT-W32 %s
; RUN: opt -mtriple=amdgcn-- -mattr=+wavefrontsize64 -passes=instcombine -S < %s | FileCheck -check-prefix=OPT-W64 %s
; RUN: opt -mtriple=amdgcn-- -mcpu=tonga -passes=instcombine -S < %s | FileCheck -check-prefix=OPT-W64 %s
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=+wavefrontsize32 -passes=instcombine -S < %s | FileCheck -check-prefix=OPT-W32 %s
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=+wavefrontsize64 -passes=instcombine -S < %s | FileCheck -check-prefix=OPT-W64 %s
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -mattr=+wavefrontsize32 -passes=instcombine -S < %s | FileCheck -check-prefix=OPT-W32 %s
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -mattr=+wavefrontsize64 -passes=instcombine -S < %s | FileCheck -check-prefix=OPT-W64 %s

define amdgpu_kernel void @fold_wavefrontsize(ptr addrspace(1) nocapture %arg) {
; OPT-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
; OPT-SAME: ptr addrspace(1) nocapture [[ARG:%.*]]) {
; OPT-NEXT: [[BB:.*:]]
; OPT-NEXT: [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR1:[0-9]+]]
; OPT-NEXT: store i32 [[TMP]], ptr addrspace(1) [[ARG]], align 4
; OPT-NEXT: ret void
;
; OPT-W32-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
; OPT-W32-SAME: ptr addrspace(1) nocapture [[ARG:%.*]]) #[[ATTR0:[0-9]+]] {
; OPT-W32-NEXT: [[BB:.*:]]
; OPT-W32-NEXT: store i32 32, ptr addrspace(1) [[ARG]], align 4
; OPT-W32-NEXT: ret void
;
; OPT-W64-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
; OPT-W64-SAME: ptr addrspace(1) nocapture [[ARG:%.*]]) #[[ATTR0:[0-9]+]] {
; OPT-W64-NEXT: [[BB:.*:]]
; OPT-W64-NEXT: store i32 64, ptr addrspace(1) [[ARG]], align 4
; OPT-W64-NEXT: ret void
;
bb:
%tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
store i32 %tmp, ptr addrspace(1) %arg, align 4
ret void
}

define amdgpu_kernel void @fold_and_optimize_wavefrontsize(ptr addrspace(1) nocapture %arg) {
; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
; OPT-SAME: ptr addrspace(1) nocapture [[ARG:%.*]]) {
; OPT-NEXT: [[BB:.*:]]
; OPT-NEXT: [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR1]]
; OPT-NEXT: [[TMP1:%.*]] = icmp ugt i32 [[TMP]], 32
; OPT-NEXT: [[TMP2:%.*]] = select i1 [[TMP1]], i32 2, i32 1
; OPT-NEXT: store i32 [[TMP2]], ptr addrspace(1) [[ARG]], align 4
; OPT-NEXT: ret void
;
; OPT-W32-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
; OPT-W32-SAME: ptr addrspace(1) nocapture [[ARG:%.*]]) #[[ATTR0]] {
; OPT-W32-NEXT: [[BB:.*:]]
; OPT-W32-NEXT: store i32 1, ptr addrspace(1) [[ARG]], align 4
; OPT-W32-NEXT: ret void
;
; OPT-W64-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
; OPT-W64-SAME: ptr addrspace(1) nocapture [[ARG:%.*]]) #[[ATTR0]] {
; OPT-W64-NEXT: [[BB:.*:]]
; OPT-W64-NEXT: store i32 2, ptr addrspace(1) [[ARG]], align 4
; OPT-W64-NEXT: ret void
;
bb:
%tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
%tmp1 = icmp ugt i32 %tmp, 32
%tmp2 = select i1 %tmp1, i32 2, i32 1
store i32 %tmp2, ptr addrspace(1) %arg
ret void
}

define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(ptr addrspace(1) nocapture %arg) {
; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
; OPT-SAME: ptr addrspace(1) nocapture [[ARG:%.*]]) {
; OPT-NEXT: [[BB:.*:]]
; OPT-NEXT: [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR1]]
; OPT-NEXT: [[TMP1:%.*]] = icmp ugt i32 [[TMP]], 32
; OPT-NEXT: br i1 [[TMP1]], label %[[BB2:.*]], label %[[BB3:.*]]
; OPT: [[BB2]]:
; OPT-NEXT: store i32 1, ptr addrspace(1) [[ARG]], align 4
; OPT-NEXT: br label %[[BB3]]
; OPT: [[BB3]]:
; OPT-NEXT: ret void
;
; OPT-W32-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
; OPT-W32-SAME: ptr addrspace(1) nocapture [[ARG:%.*]]) #[[ATTR0]] {
; OPT-W32-NEXT: [[BB:.*:]]
; OPT-W32-NEXT: br i1 false, label %[[BB2:.*]], label %[[BB3:.*]]
; OPT-W32: [[BB2]]:
; OPT-W32-NEXT: br label %[[BB3]]
; OPT-W32: [[BB3]]:
; OPT-W32-NEXT: ret void
;
; OPT-W64-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
; OPT-W64-SAME: ptr addrspace(1) nocapture [[ARG:%.*]]) #[[ATTR0]] {
; OPT-W64-NEXT: [[BB:.*:]]
; OPT-W64-NEXT: br i1 true, label %[[BB2:.*]], label %[[BB3:.*]]
; OPT-W64: [[BB2]]:
; OPT-W64-NEXT: store i32 1, ptr addrspace(1) [[ARG]], align 4
; OPT-W64-NEXT: br label %[[BB3]]
; OPT-W64: [[BB3]]:
; OPT-W64-NEXT: ret void
;
bb:
%tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
%tmp1 = icmp ugt i32 %tmp, 32
br i1 %tmp1, label %bb2, label %bb3

bb2: ; preds = %bb
store i32 1, ptr addrspace(1) %arg, align 4
br label %bb3

bb3: ; preds = %bb2, %bb
ret void
}

declare i32 @llvm.amdgcn.wavefrontsize() #0

attributes #0 = { nounwind readnone speculatable }