[llvm][AMDGPU] Fold `llvm.amdgcn.wavefrontsize` early #114481

AlexVlx · 2024-10-31T22:51:19Z

Fold llvm.amdgcn.wavefrontsize early, during InstCombine, so that it's concrete value is used throughout subsequent optimisation passes.

…r target specific info), and add handling for `llvm.amdgcn.wavefrontsize`.

llvmbot · 2024-10-31T22:51:55Z

@llvm/pr-subscribers-clang

Author: Alex Voicu (AlexVlx)

Changes

Pseudo-intrinsics are a mechanism for conveying that some target specific constant info / quantity exists, without inducing AST mutation based on said quantity. They enable making e.g. control flow decisions in a HLL that end up with a constant evaluated predicate in the ME and optimised accordingly, without making the AST (even more) target specific. llvm.amdgcn.wavefrontsize is an example of such a pseudo-intrinsic, but we will add more in the near future.

This change adds an opt pass that will handle pseudo-intrinsics as early as possible (roughly immediately after Clang CodeGen), so that all subsequent processing over IR is done with target specific info in place. For the time being it handles only llvm.amdgcn.wavefrontsize, replacing all calls to it with to the actual hardware quantity for concrete targets.

Full diff: https://github.com/llvm/llvm-project/pull/114481.diff

7 Files Affected:

(modified) clang/test/CodeGenOpenCL/builtins-amdgcn.cl (+3-2)
(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+9)
(added) llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp (+49)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+2)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2-1)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll (+73-26)

diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
index bf5f2971cf118c..de6a06dad6a08d 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
@@ -1,6 +1,6 @@
 // REQUIRES: amdgpu-registered-target
 // RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-unknown-unknown -target-cpu tahiti -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefixes=CHECK-AMDGCN,CHECK %s
-// RUN: %clang_cc1 -cl-std=CL2.0 -triple spirv64-amd-amdhsa -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefix=CHECK %s
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple spirv64-amd-amdhsa -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefixes=CHECK,CHECK-SPIRV %s
 
 
 #pragma OPENCL EXTENSION cl_khr_fp64 : enable
@@ -866,7 +866,8 @@ void test_atomic_inc_dec(__attribute__((address_space(3))) uint *lptr, __attribu
 // CHECK-LABEL test_wavefrontsize(
 unsigned test_wavefrontsize() {
 
-  // CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wavefrontsize()
+  // CHECK-AMDGCN: ret i32 {{[0-9]+}}
+  // CHECK-SPIRV: {{.*}}call{{.*}} i32 @llvm.amdgcn.wavefrontsize()
   return __builtin_amdgcn_wavefrontsize();
 }
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 95d0ad0f9dc96a..17d3e6ab7c65ab 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -345,6 +345,15 @@ extern char &AMDGPUPrintfRuntimeBindingID;
 void initializeAMDGPUResourceUsageAnalysisPass(PassRegistry &);
 extern char &AMDGPUResourceUsageAnalysisID;
 
+struct AMDGPUExpandPseudoIntrinsicsPass
+    : PassInfoMixin<AMDGPUExpandPseudoIntrinsicsPass> {
+  const AMDGPUTargetMachine &TM;
+  AMDGPUExpandPseudoIntrinsicsPass(const AMDGPUTargetMachine &ATM) : TM(ATM) {}
+  PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
+
+  static bool isRequired() { return true; }
+};
+
 struct AMDGPUPrintfRuntimeBindingPass
     : PassInfoMixin<AMDGPUPrintfRuntimeBindingPass> {
   PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp b/llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp
new file mode 100644
index 00000000000000..faa23bb8550dbc
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp
@@ -0,0 +1,49 @@
+//===- AMDGPUExpandPseudoIntrinsics.cpp - Pseudo Intrinsic Expander Pass --===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// This file implements a pass that deals with expanding AMDGCN generic pseudo-
+// intrinsics into target specific quantities / sequences. In this context, a
+// pseudo-intrinsic is an AMDGCN intrinsic that does not directly map to a
+// specific instruction, but rather is intended as a mechanism for abstractly
+// conveying target specific info to a HLL / the FE, without concretely
+// impacting the AST. An example of such an intrinsic is amdgcn.wavefrontsize.
+// This pass should run as early as possible / immediately after Clang CodeGen,
+// so that the optimisation pipeline and the BE operate with concrete target
+// data.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "GCNSubtarget.h"
+
+#include "llvm/IR/Constants.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/Module.h"
+#include "llvm/Pass.h"
+
+using namespace llvm;
+
+static inline PreservedAnalyses expandWaveSizeIntrinsic(const GCNSubtarget &ST,
+                                                        Function *WaveSize) {
+  if (WaveSize->hasZeroLiveUses())
+    return PreservedAnalyses::all();
+
+  for (auto &&U : WaveSize->users())
+    U->replaceAllUsesWith(ConstantInt::get(WaveSize->getReturnType(),
+                                           ST.getWavefrontSize()));
+
+  return PreservedAnalyses::none();
+}
+
+PreservedAnalyses
+  AMDGPUExpandPseudoIntrinsicsPass::run(Module &M, ModuleAnalysisManager &) {
+
+  if (auto WS = M.getFunction("llvm.amdgcn.wavefrontsize"))
+    return expandWaveSizeIntrinsic(TM.getSubtarget<GCNSubtarget>(*WS), WS);
+
+  return PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 174a90f0aa419d..323c195c329168 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -27,6 +27,8 @@ MODULE_PASS("amdgpu-perf-hint",
               *static_cast<const GCNTargetMachine *>(this)))
 MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
 MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
+MODULE_PASS("amdgpu-expand-pseudo-intrinsics",
+            AMDGPUExpandPseudoIntrinsicsPass(*this))
 #undef MODULE_PASS
 
 #ifndef MODULE_PASS_WITH_PARAMS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index d93ec34a703d3d..2bf8df6588c59c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -739,7 +739,8 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
 #include "llvm/Passes/TargetPassRegistry.inc"
 
   PB.registerPipelineStartEPCallback(
-      [](ModulePassManager &PM, OptimizationLevel Level) {
+      [this](ModulePassManager &PM, OptimizationLevel Level) {
+        PM.addPass(AMDGPUExpandPseudoIntrinsicsPass(*this));
         FunctionPassManager FPM;
         PM.addPass(createModuleToFunctionPassAdaptor(std::move(FPM)));
         if (EnableHipStdPar)
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index fed29c3e14aae2..c9d4452b4a035c 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -54,6 +54,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUCodeGenPrepare.cpp
   AMDGPUCombinerHelper.cpp
   AMDGPUCtorDtorLowering.cpp
+  AMDGPUExpandPseudoIntrinsics.cpp
   AMDGPUExportClustering.cpp
   AMDGPUFrameLowering.cpp
   AMDGPUGlobalISelDivergenceLowering.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll
index 824d3708c027db..efa53def5ee686 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll
@@ -1,3 +1,4 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
 ; RUN: llc -mtriple=amdgcn -mcpu=fiji -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W32 %s
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s
@@ -5,28 +6,43 @@
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s
 
 ; RUN: opt -O3 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -O3 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -passes='default<O3>' -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=tonga -O3 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s
+; RUN: opt -mtriple=amdgcn-- -O3 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -passes='default<O3>' -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=tonga -O3 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s
 
 ; GCN-LABEL: {{^}}fold_wavefrontsize:
-; OPT-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
 
 ; W32:       v_mov_b32_e32 [[V:v[0-9]+]], 32
 ; W64:       v_mov_b32_e32 [[V:v[0-9]+]], 64
 ; GCN:       store_{{dword|b32}} v{{.+}}, [[V]]
 
-; OPT:   %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
-; OPT:   store i32 %tmp, ptr addrspace(1) %arg, align 4
-; OPT-NEXT:  ret void
 
 define amdgpu_kernel void @fold_wavefrontsize(ptr addrspace(1) nocapture %arg) {
+; OPT-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
+; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; OPT-NEXT:  [[BB:.*:]]
+; OPT-NEXT:    [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2:[0-9]+]]
+; OPT-NEXT:    store i32 [[TMP]], ptr addrspace(1) [[ARG]], align 4
+; OPT-NEXT:    ret void
+;
+; OPT-W64-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
+; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; OPT-W64-NEXT:  [[BB:.*:]]
+; OPT-W64-NEXT:    store i32 64, ptr addrspace(1) [[ARG]], align 4
+; OPT-W64-NEXT:    ret void
+;
+; OPT-W32-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
+; OPT-W32-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; OPT-W32-NEXT:  [[BB:.*:]]
+; OPT-W32-NEXT:    store i32 32, ptr addrspace(1) [[ARG]], align 4
+; OPT-W32-NEXT:    ret void
+;
 bb:
   %tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
   store i32 %tmp, ptr addrspace(1) %arg, align 4
@@ -34,20 +50,35 @@ bb:
 }
 
 ; GCN-LABEL: {{^}}fold_and_optimize_wavefrontsize:
-; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
 
 ; W32:       v_mov_b32_e32 [[V:v[0-9]+]], 1{{$}}
 ; W64:       v_mov_b32_e32 [[V:v[0-9]+]], 2{{$}}
 ; GCN-NOT:   cndmask
 ; GCN:       store_{{dword|b32}} v{{.+}}, [[V]]
 
-; OPT:   %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
-; OPT:   %tmp1 = icmp ugt i32 %tmp, 32
-; OPT:   %tmp2 = select i1 %tmp1, i32 2, i32 1
-; OPT:   store i32 %tmp2, ptr addrspace(1) %arg
-; OPT-NEXT:  ret void
 
 define amdgpu_kernel void @fold_and_optimize_wavefrontsize(ptr addrspace(1) nocapture %arg) {
+; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
+; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-NEXT:  [[BB:.*:]]
+; OPT-NEXT:    [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2]]
+; OPT-NEXT:    [[TMP1:%.*]] = icmp ugt i32 [[TMP]], 32
+; OPT-NEXT:    [[TMP2:%.*]] = select i1 [[TMP1]], i32 2, i32 1
+; OPT-NEXT:    store i32 [[TMP2]], ptr addrspace(1) [[ARG]], align 4
+; OPT-NEXT:    ret void
+;
+; OPT-W64-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
+; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-W64-NEXT:  [[BB:.*:]]
+; OPT-W64-NEXT:    store i32 2, ptr addrspace(1) [[ARG]], align 4
+; OPT-W64-NEXT:    ret void
+;
+; OPT-W32-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
+; OPT-W32-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-W32-NEXT:  [[BB:.*:]]
+; OPT-W32-NEXT:    store i32 1, ptr addrspace(1) [[ARG]], align 4
+; OPT-W32-NEXT:    ret void
+;
 bb:
   %tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
   %tmp1 = icmp ugt i32 %tmp, 32
@@ -57,15 +88,31 @@ bb:
 }
 
 ; GCN-LABEL: {{^}}fold_and_optimize_if_wavefrontsize:
-; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
-
-; OPT:       bb:
-; OPT:   %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
-; OPT:   %tmp1 = icmp ugt i32 %tmp, 32
-; OPT:   bb3:
-; OPT-NEXT:  ret void
 
 define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(ptr addrspace(1) nocapture %arg) {
+; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
+; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-NEXT:  [[BB:.*:]]
+; OPT-NEXT:    [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2]]
+; OPT-NEXT:    [[TMP1:%.*]] = icmp ugt i32 [[TMP]], 32
+; OPT-NEXT:    br i1 [[TMP1]], label %[[BB2:.*]], label %[[BB3:.*]]
+; OPT:       [[BB2]]:
+; OPT-NEXT:    store i32 1, ptr addrspace(1) [[ARG]], align 4
+; OPT-NEXT:    br label %[[BB3]]
+; OPT:       [[BB3]]:
+; OPT-NEXT:    ret void
+;
+; OPT-W64-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
+; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-W64-NEXT:  [[BB:.*:]]
+; OPT-W64-NEXT:    store i32 1, ptr addrspace(1) [[ARG]], align 4
+; OPT-W64-NEXT:    ret void
+;
+; OPT-W32-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
+; OPT-W32-SAME: ptr addrspace(1) nocapture readnone [[ARG:%.*]]) local_unnamed_addr #[[ATTR1:[0-9]+]] {
+; OPT-W32-NEXT:  [[BB:.*:]]
+; OPT-W32-NEXT:    ret void
+;
 bb:
   %tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
   %tmp1 = icmp ugt i32 %tmp, 32

llvmbot · 2024-10-31T22:51:55Z

@llvm/pr-subscribers-backend-amdgpu

Author: Alex Voicu (AlexVlx)

Changes

Pseudo-intrinsics are a mechanism for conveying that some target specific constant info / quantity exists, without inducing AST mutation based on said quantity. They enable making e.g. control flow decisions in a HLL that end up with a constant evaluated predicate in the ME and optimised accordingly, without making the AST (even more) target specific. llvm.amdgcn.wavefrontsize is an example of such a pseudo-intrinsic, but we will add more in the near future.

This change adds an opt pass that will handle pseudo-intrinsics as early as possible (roughly immediately after Clang CodeGen), so that all subsequent processing over IR is done with target specific info in place. For the time being it handles only llvm.amdgcn.wavefrontsize, replacing all calls to it with to the actual hardware quantity for concrete targets.

Full diff: https://github.com/llvm/llvm-project/pull/114481.diff

7 Files Affected:

(modified) clang/test/CodeGenOpenCL/builtins-amdgcn.cl (+3-2)
(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+9)
(added) llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp (+49)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+2)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+2-1)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll (+73-26)

diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
index bf5f2971cf118c..de6a06dad6a08d 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
@@ -1,6 +1,6 @@
 // REQUIRES: amdgpu-registered-target
 // RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-unknown-unknown -target-cpu tahiti -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefixes=CHECK-AMDGCN,CHECK %s
-// RUN: %clang_cc1 -cl-std=CL2.0 -triple spirv64-amd-amdhsa -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefix=CHECK %s
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple spirv64-amd-amdhsa -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefixes=CHECK,CHECK-SPIRV %s
 
 
 #pragma OPENCL EXTENSION cl_khr_fp64 : enable
@@ -866,7 +866,8 @@ void test_atomic_inc_dec(__attribute__((address_space(3))) uint *lptr, __attribu
 // CHECK-LABEL test_wavefrontsize(
 unsigned test_wavefrontsize() {
 
-  // CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wavefrontsize()
+  // CHECK-AMDGCN: ret i32 {{[0-9]+}}
+  // CHECK-SPIRV: {{.*}}call{{.*}} i32 @llvm.amdgcn.wavefrontsize()
   return __builtin_amdgcn_wavefrontsize();
 }
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 95d0ad0f9dc96a..17d3e6ab7c65ab 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -345,6 +345,15 @@ extern char &AMDGPUPrintfRuntimeBindingID;
 void initializeAMDGPUResourceUsageAnalysisPass(PassRegistry &);
 extern char &AMDGPUResourceUsageAnalysisID;
 
+struct AMDGPUExpandPseudoIntrinsicsPass
+    : PassInfoMixin<AMDGPUExpandPseudoIntrinsicsPass> {
+  const AMDGPUTargetMachine &TM;
+  AMDGPUExpandPseudoIntrinsicsPass(const AMDGPUTargetMachine &ATM) : TM(ATM) {}
+  PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
+
+  static bool isRequired() { return true; }
+};
+
 struct AMDGPUPrintfRuntimeBindingPass
     : PassInfoMixin<AMDGPUPrintfRuntimeBindingPass> {
   PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp b/llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp
new file mode 100644
index 00000000000000..faa23bb8550dbc
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp
@@ -0,0 +1,49 @@
+//===- AMDGPUExpandPseudoIntrinsics.cpp - Pseudo Intrinsic Expander Pass --===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// This file implements a pass that deals with expanding AMDGCN generic pseudo-
+// intrinsics into target specific quantities / sequences. In this context, a
+// pseudo-intrinsic is an AMDGCN intrinsic that does not directly map to a
+// specific instruction, but rather is intended as a mechanism for abstractly
+// conveying target specific info to a HLL / the FE, without concretely
+// impacting the AST. An example of such an intrinsic is amdgcn.wavefrontsize.
+// This pass should run as early as possible / immediately after Clang CodeGen,
+// so that the optimisation pipeline and the BE operate with concrete target
+// data.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "GCNSubtarget.h"
+
+#include "llvm/IR/Constants.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/Module.h"
+#include "llvm/Pass.h"
+
+using namespace llvm;
+
+static inline PreservedAnalyses expandWaveSizeIntrinsic(const GCNSubtarget &ST,
+                                                        Function *WaveSize) {
+  if (WaveSize->hasZeroLiveUses())
+    return PreservedAnalyses::all();
+
+  for (auto &&U : WaveSize->users())
+    U->replaceAllUsesWith(ConstantInt::get(WaveSize->getReturnType(),
+                                           ST.getWavefrontSize()));
+
+  return PreservedAnalyses::none();
+}
+
+PreservedAnalyses
+  AMDGPUExpandPseudoIntrinsicsPass::run(Module &M, ModuleAnalysisManager &) {
+
+  if (auto WS = M.getFunction("llvm.amdgcn.wavefrontsize"))
+    return expandWaveSizeIntrinsic(TM.getSubtarget<GCNSubtarget>(*WS), WS);
+
+  return PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 174a90f0aa419d..323c195c329168 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -27,6 +27,8 @@ MODULE_PASS("amdgpu-perf-hint",
               *static_cast<const GCNTargetMachine *>(this)))
 MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
 MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
+MODULE_PASS("amdgpu-expand-pseudo-intrinsics",
+            AMDGPUExpandPseudoIntrinsicsPass(*this))
 #undef MODULE_PASS
 
 #ifndef MODULE_PASS_WITH_PARAMS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index d93ec34a703d3d..2bf8df6588c59c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -739,7 +739,8 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
 #include "llvm/Passes/TargetPassRegistry.inc"
 
   PB.registerPipelineStartEPCallback(
-      [](ModulePassManager &PM, OptimizationLevel Level) {
+      [this](ModulePassManager &PM, OptimizationLevel Level) {
+        PM.addPass(AMDGPUExpandPseudoIntrinsicsPass(*this));
         FunctionPassManager FPM;
         PM.addPass(createModuleToFunctionPassAdaptor(std::move(FPM)));
         if (EnableHipStdPar)
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index fed29c3e14aae2..c9d4452b4a035c 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -54,6 +54,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPUCodeGenPrepare.cpp
   AMDGPUCombinerHelper.cpp
   AMDGPUCtorDtorLowering.cpp
+  AMDGPUExpandPseudoIntrinsics.cpp
   AMDGPUExportClustering.cpp
   AMDGPUFrameLowering.cpp
   AMDGPUGlobalISelDivergenceLowering.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll
index 824d3708c027db..efa53def5ee686 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll
@@ -1,3 +1,4 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
 ; RUN: llc -mtriple=amdgcn -mcpu=fiji -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W32 %s
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s
@@ -5,28 +6,43 @@
 ; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s
 
 ; RUN: opt -O3 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -O3 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -passes='default<O3>' -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=tonga -O3 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s
+; RUN: opt -mtriple=amdgcn-- -O3 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -passes='default<O3>' -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=tonga -O3 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s
 
 ; GCN-LABEL: {{^}}fold_wavefrontsize:
-; OPT-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
 
 ; W32:       v_mov_b32_e32 [[V:v[0-9]+]], 32
 ; W64:       v_mov_b32_e32 [[V:v[0-9]+]], 64
 ; GCN:       store_{{dword|b32}} v{{.+}}, [[V]]
 
-; OPT:   %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
-; OPT:   store i32 %tmp, ptr addrspace(1) %arg, align 4
-; OPT-NEXT:  ret void
 
 define amdgpu_kernel void @fold_wavefrontsize(ptr addrspace(1) nocapture %arg) {
+; OPT-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
+; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; OPT-NEXT:  [[BB:.*:]]
+; OPT-NEXT:    [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2:[0-9]+]]
+; OPT-NEXT:    store i32 [[TMP]], ptr addrspace(1) [[ARG]], align 4
+; OPT-NEXT:    ret void
+;
+; OPT-W64-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
+; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; OPT-W64-NEXT:  [[BB:.*:]]
+; OPT-W64-NEXT:    store i32 64, ptr addrspace(1) [[ARG]], align 4
+; OPT-W64-NEXT:    ret void
+;
+; OPT-W32-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
+; OPT-W32-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; OPT-W32-NEXT:  [[BB:.*:]]
+; OPT-W32-NEXT:    store i32 32, ptr addrspace(1) [[ARG]], align 4
+; OPT-W32-NEXT:    ret void
+;
 bb:
   %tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
   store i32 %tmp, ptr addrspace(1) %arg, align 4
@@ -34,20 +50,35 @@ bb:
 }
 
 ; GCN-LABEL: {{^}}fold_and_optimize_wavefrontsize:
-; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
 
 ; W32:       v_mov_b32_e32 [[V:v[0-9]+]], 1{{$}}
 ; W64:       v_mov_b32_e32 [[V:v[0-9]+]], 2{{$}}
 ; GCN-NOT:   cndmask
 ; GCN:       store_{{dword|b32}} v{{.+}}, [[V]]
 
-; OPT:   %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
-; OPT:   %tmp1 = icmp ugt i32 %tmp, 32
-; OPT:   %tmp2 = select i1 %tmp1, i32 2, i32 1
-; OPT:   store i32 %tmp2, ptr addrspace(1) %arg
-; OPT-NEXT:  ret void
 
 define amdgpu_kernel void @fold_and_optimize_wavefrontsize(ptr addrspace(1) nocapture %arg) {
+; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
+; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-NEXT:  [[BB:.*:]]
+; OPT-NEXT:    [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2]]
+; OPT-NEXT:    [[TMP1:%.*]] = icmp ugt i32 [[TMP]], 32
+; OPT-NEXT:    [[TMP2:%.*]] = select i1 [[TMP1]], i32 2, i32 1
+; OPT-NEXT:    store i32 [[TMP2]], ptr addrspace(1) [[ARG]], align 4
+; OPT-NEXT:    ret void
+;
+; OPT-W64-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
+; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-W64-NEXT:  [[BB:.*:]]
+; OPT-W64-NEXT:    store i32 2, ptr addrspace(1) [[ARG]], align 4
+; OPT-W64-NEXT:    ret void
+;
+; OPT-W32-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
+; OPT-W32-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-W32-NEXT:  [[BB:.*:]]
+; OPT-W32-NEXT:    store i32 1, ptr addrspace(1) [[ARG]], align 4
+; OPT-W32-NEXT:    ret void
+;
 bb:
   %tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
   %tmp1 = icmp ugt i32 %tmp, 32
@@ -57,15 +88,31 @@ bb:
 }
 
 ; GCN-LABEL: {{^}}fold_and_optimize_if_wavefrontsize:
-; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
-
-; OPT:       bb:
-; OPT:   %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
-; OPT:   %tmp1 = icmp ugt i32 %tmp, 32
-; OPT:   bb3:
-; OPT-NEXT:  ret void
 
 define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(ptr addrspace(1) nocapture %arg) {
+; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
+; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-NEXT:  [[BB:.*:]]
+; OPT-NEXT:    [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2]]
+; OPT-NEXT:    [[TMP1:%.*]] = icmp ugt i32 [[TMP]], 32
+; OPT-NEXT:    br i1 [[TMP1]], label %[[BB2:.*]], label %[[BB3:.*]]
+; OPT:       [[BB2]]:
+; OPT-NEXT:    store i32 1, ptr addrspace(1) [[ARG]], align 4
+; OPT-NEXT:    br label %[[BB3]]
+; OPT:       [[BB3]]:
+; OPT-NEXT:    ret void
+;
+; OPT-W64-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
+; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-W64-NEXT:  [[BB:.*:]]
+; OPT-W64-NEXT:    store i32 1, ptr addrspace(1) [[ARG]], align 4
+; OPT-W64-NEXT:    ret void
+;
+; OPT-W32-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
+; OPT-W32-SAME: ptr addrspace(1) nocapture readnone [[ARG:%.*]]) local_unnamed_addr #[[ATTR1:[0-9]+]] {
+; OPT-W32-NEXT:  [[BB:.*:]]
+; OPT-W32-NEXT:    ret void
+;
 bb:
   %tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
   %tmp1 = icmp ugt i32 %tmp, 32

…le_wavefrontsize_early

github-actions · 2024-10-31T22:57:12Z

✅ With the latest revision this PR passed the C/C++ code formatter.

arsenm

We do not want or need a new pass to handle this. This is not a fix to the structural issue of wavesize. The problem is there is no such thing as a "no wavesize" IR. There is only wave32 or wave64. Querying the target gives the wrong answer for faux "generic" IR. Throwing in a pass that happens to know where it runs in the pipeline to decide when to lower is not a real fix; that is not a modular IR.

The correct solution is to use separate wave32 and wave64 builds. InstCombine can then just directly fold the intrinsic based on the known target.

arsenm · 2024-10-31T22:59:15Z

llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp

+PreservedAnalyses
+  AMDGPUExpandPseudoIntrinsicsPass::run(Module &M, ModuleAnalysisManager &) {
+
+  if (auto WS = M.getFunction("llvm.amdgcn.wavefrontsize"))


You can query by intrinsic ID now

arsenm · 2024-10-31T22:59:35Z

llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp

+  AMDGPUExpandPseudoIntrinsicsPass::run(Module &M, ModuleAnalysisManager &) {
+
+  if (auto WS = M.getFunction("llvm.amdgcn.wavefrontsize"))
+    return expandWaveSizeIntrinsic(TM.getSubtarget<GCNSubtarget>(*WS), WS);


Subtarget is per function, cannot rely on getting the subtarget from an intrinsic declaration

jhuber6 · 2024-10-31T23:03:16Z

I think the current use of this intrinsic in 'generic' IR is sound so long as it's not guarding anything ABI related. Right now it's just used for loop bounds and array offsets pretty much. Though long-term I agree that it's probably most sound to just put these as separate builds, but that would require all of our device runtime stuff going through my interface / using multilibs.

I think the only thing important here is that we don't do this prematurely, i.e. clang --target=amdgcn-amd-amdhsa -c -flto does not fold it since it's before codegen.

@JonChesterfield was working on 'generic' IR as well, so maybe he has some opinions.

arsenm · 2024-10-31T23:05:39Z

Just adding this to the pass pipeline where it is is no better than just doing it in instcombine, which is the natural place to do this. This patch, like instcombine, still has the problem that we don't know if we're producing the final code.

jhuber6 · 2024-10-31T23:07:44Z

Just adding this to the pass pipeline where it is is no better than just doing it in instcombine, which is the natural place to do this. This patch, like instcombine, still has the problem that we don't know if we're producing the final code.

Yeah that was my concern, it could go in the backend passes but then that would be pretty late. However, this is probably legal in general if the function has wavefrontsize attrs.

arsenm · 2024-10-31T23:33:53Z

Yeah that was my concern, it could go in the backend passes but then that would be pretty late. However, this is probably legal in general if the function has wavefrontsize attrs.

To be more specific, there is currently no wavefrontsize attribute. There are the wavefrontsize32/64 subtarget features, which may be contained in target-features. There is also an implied default by the target-cpu. Both of these may also come from the global target machine configuration. We should not impart meaning to the presence or absence of an attribute

AlexVlx · 2024-11-01T00:41:36Z

We do not want or need a new pass to handle this. This is not a fix to the structural issue of wavesize. The problem is there is no such thing as a "no wavesize" IR. There is only wave32 or wave64. Querying the target gives the wrong answer for faux "generic" IR. Throwing in a pass that happens to know where it runs in the pipeline to decide when to lower is not a real fix; that is not a modular IR.

The correct solution is to use separate wave32 and wave64 builds. InstCombine can then just directly fold the intrinsic based on the known ta

The new pass is not just to handle this, it happens to handle this since it already exists. Having an unspecified, abstract quantity is not the same thing as it being absent. Faux "generic" IR sounds like a problematic concept, do you have an example? Multi-builds might be the correct solution for something, but it's unclear what that something is - yes, if you already "fix" the wave size value, then the intrinsic is fairly spurious anyway, but it does not address the need to NOT encode it early. InstCombine might be an idea, but it runs a wee bit late.

jhuber6 · 2024-11-01T00:47:56Z

Faux "generic" IR sounds like a problematic concept, do you have an example?

It's what libc and the ROCm DeviceLibs do, compile or IR without -mcpu and don't use any target specific attributes or intrinsics, then link it into a TU later when the target is known. It's fine in principle if you hold it right, but the wavefrontsize is the one sticking issue, hence why Matt would suggest having two builds of libc, one for amdgcn-amd-amdhsa-wave32 and amdgcn-amd-amdhsa-wave64 or something.

AlexVlx · 2024-11-01T00:57:35Z

Just adding this to the pass pipeline where it is is no better than just doing it in instcombine, which is the natural place to do this. This patch, like instcombine, still has the problem that we don't know if we're producing the final code.

We could just turn this off for a particular compilation and maintain the current unfoldable state. Which makes it possibly preferable to keep this as a pass. Or do the two separate builds. We probably need to appreciate that the vast majority of clients are neither libc nor ROCDL in their needs, and might not be keen on multi-builds for something that is actually intended to be generic, like, in our case, AMDGCN flavoured SPIR-V.

AlexVlx · 2024-11-01T00:59:44Z

Faux "generic" IR sounds like a problematic concept, do you have an example?

It's what libc and the ROCm DeviceLibs do, compile or IR without -mcpu and don't use any target specific attributes or intrinsics, then link it into a TU later when the target is known. It's fine in principle if you hold it right, but the wavefrontsize is the one sticking issue, hence why Matt would suggest having two builds of libc, one for amdgcn-amd-amdhsa-wave32 and amdgcn-amd-amdhsa-wave64 or something.

As per my other reply, this is not an invalid use case, but somewhat niche. We can have a control value for disabling this early fold, for such builds, to avoid the need to do two builds (which might also be fine for libc). I don't think ROCDL uses the intrinsic at all.

jhuber6 · 2024-11-01T01:02:00Z

As per my other reply, this is not an invalid use case, but somewhat niche. We can have a control value for disabling this early fold, for such builds, to avoid the need to do two builds (which might also be fine for libc). I don't think ROCDL uses the intrinsic at all.

ROCDL does something worse IMO, which is linking in a magic global constant instead of just using an intrinsic. I figured it would be safe to check if the caller has target attributes at all, but apparently +wavefrontsize32 on the function isn't enough as per Matt's reply.

arsenm · 2024-11-01T02:52:22Z

Mechanically, this pass can be replaced with trivial handling of the intrinsic in AMDGPUInstCombineIntrinsic; we don't need a new module pass. As inserted into the pipeline here, this does not have any advantage over handling it directly in instcombine.

We could just turn this off for a particular compilation and maintain the current unfoldable state.

This violates the fundamental principles of a modular compiler IR. Any mechanism which we would have to invent to stop this fold from happening in a specific bitcode library build will be quite unsavory, and require handholding of every user to not run into the same issue. I'd like to systematically avoid this class problem by having a separate library build.

but apparently +wavefrontsize32 on the function isn't enough as per Matt's reply.

The toothpaste is out of the tube once the IR is produced. If some toolchain were relying on the global target machine features, there are opportunities for error on each tool invocation. The absence of the attribute does not tell you what the final compilation context will be.

AlexVlx · 2024-11-04T17:30:22Z

This is really just a constant folding rather than a new pass. If the concern is that InstCombine works too late that is possible to add an earlier invocation.

Done.

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll

jhuber6 · 2024-11-04T17:34:58Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

@@ -1024,6 +1024,15 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
    }
    break;
  }
+  case Intrinsic::amdgcn_wavefrontsize: {
+    // TODO: this is a workaround for the pseudo-generic target one gets with no
+    // specified mcpu, which spoofs its wave size to 64; it should be removed.


A real solution would be two builds, but spoofing it as 64 works, (likely unintentinally) because we don't do any w64 specific changes yet and w64 can always be narrowed to w32 and not the other way around.

I don't think that this interpretation is actually correct, if you rely on lockstep of a full wave and you optimise around wavesize this will break in bad ways on wave32. The current generic is not particularly god, but we have to live with it for now I guess.

We already do some light 64->32 folds, that are only sort of correct.

Technically we could make exec_hi an allocatable scratch register in wave32, but what we do now bakes in an assumption that exec_hi must always be 0.

But yes, the only way to really avoid any possible edge cases (and support a future of machine linked libraries) requires just having totally separate builds

rampitec · 2024-11-04T17:44:31Z

llvm/lib/Target/AMDGPU/AMDGPU.h

@@ -345,6 +345,15 @@ extern char &AMDGPUPrintfRuntimeBindingID;
 void initializeAMDGPUResourceUsageAnalysisPass(PassRegistry &);
 extern char &AMDGPUResourceUsageAnalysisID;

+struct AMDGPUExpandPseudoIntrinsicsPass


The pass isn't needed now?

rampitec · 2024-11-04T17:47:03Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+  case Intrinsic::amdgcn_wavefrontsize: {
+    // TODO: this is a workaround for the pseudo-generic target one gets with no
+    // specified mcpu, which spoofs its wave size to 64; it should be removed.
+    if ((ST->getCPU().empty() || ST->getCPU() == "generic") &&


Less than ideal... I am not sure if there is a way to check that a fixed wavefront size is in the subtarget description and not added as an -mattr?

None that I could find because we spoof the Wave64 in when it's not specified, so the only differentiator that I could think of is that the mattr is never set. If you have an alternative, that'd be great.

I do not really have one. Maybe it is OK for now.

This is really gross. We also do have a "generic-hsa" target-cpu name

Thanks for the input. What is the suggested solution?

rampitec

In general LTGM.

rampitec · 2024-11-04T18:26:47Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+  case Intrinsic::amdgcn_wavefrontsize: {
+    // TODO: this is a workaround for the pseudo-generic target one gets with no
+    // specified mcpu, which spoofs its wave size to 64; it should be removed.
+    if ((ST->getCPU().empty() || ST->getCPU() == "generic") &&


I do not really have one. Maybe it is OK for now.

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

arsenm · 2024-11-04T19:07:37Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+  case Intrinsic::amdgcn_wavefrontsize: {
+    // TODO: this is a workaround for the pseudo-generic target one gets with no
+    // specified mcpu, which spoofs its wave size to 64; it should be removed.
+    if ((ST->getCPU().empty() || ST->getCPU() == "generic") &&


This is really gross. We also do have a "generic-hsa" target-cpu name

arsenm · 2024-11-04T19:09:45Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

@@ -1024,6 +1024,15 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const {
    }
    break;
  }
+  case Intrinsic::amdgcn_wavefrontsize: {
+    // TODO: this is a workaround for the pseudo-generic target one gets with no
+    // specified mcpu, which spoofs its wave size to 64; it should be removed.


We already do some light 64->32 folds, that are only sort of correct.

Technically we could make exec_hi an allocatable scratch register in wave32, but what we do now bakes in an assumption that exec_hi must always be 0.

But yes, the only way to really avoid any possible edge cases (and support a future of machine linked libraries) requires just having totally separate builds

arsenm · 2024-11-04T19:12:38Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+    // TODO: this is a workaround for the pseudo-generic target one gets with no
+    // specified mcpu, which spoofs its wave size to 64; it should be removed.
+    if ((ST->getCPU().empty() || ST->getCPU() == "generic") &&
+        !ST->getFeatureString().contains("+wavefrontsize"))


The feature string may also contain a -wavefrontsize.

It's probably safest to ignore the target-features. If we're really going to rely on this target-cpu hack for the library uses, rocm-device-libs is not using an explicit wavefrontsize feature anymore (all the uses were converted to the ballot wave64->wave32 hack)

You also do have the subtarget already. Should probably move the logic in there, instead of spreading the default CPU logic parsing into a new place

Since this is a temporary hack, what's the point of putting it in the subtarget so that people get ideas, start using it, and then there's even more technical debt? The parsing might have to change anyway once we have a proper generic target (which the current hack is not).

To keep the hack isolated to one place, instead of spreading it around. You've already missed "generic-hsa" for example. The wavesize target parsing is also hacky, and we already have other hacky parsing in the subtarget constructor.

We could also implement this by making the generic target actually have 0 wavesize, and replacing the isWave64 predicates with wavesize != 64

That seems reasonable, there's also an argument that the backend likely can't do anything useful without -mcpu= set so we could make that an error. The NVIDIA tools basically do that themselves.

Well, I don't think we should be doing live design of generic (which is part of what got us here anyway), so I'd rather not build even more technical debt around its current form which was meant to be a test only kludge:

// The code produced for "generic" is only useful for tests and cannot // reasonably be expected to execute on any particular target.

Which is to say I don't want to what is there now, I want it to not break. I've adjusted the check to cover generic-hsa and corrected the commit message.

I'd be fine with multiple builds, but right now the AMDGCN infra doesn't support it very well since we'd need to port the ROCm Device Libs to use my build system. Beyond that it'd be pretty easy to just default the triple depending on mcpu and -mwavefrontsize.

I guess we couldn't make a helper that is like ST.hasCPU(). The comments for 'generic-hsa' are different because the 'generic' IR in the RTlibs never touches the backend without having a CPU set.

Perhaps we can simply live with starts_with for now, as the check won't be infectious and we'll remove it in the future anyway.

Unsure if we even need to bother checking for 'generic' since that's not what any of the existing targets use for generic AFAIC. It's just not setting -mcpu when you compile, which I think should probably be an error in the backend.

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll

arsenm · 2024-11-04T19:13:39Z

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll

+; RUN: opt -mtriple=amdgcn-- -passes='default<O3>' -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=tonga -O3 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s


This codegen test shouldn't be running all of these passes

It already was, mostly? It seems worthwhile to individualise the possible / plausible scenarios.

Simplified.

…le_wavefrontsize_early

JonChesterfield · 2024-11-07T17:21:20Z

Awesome! This is absolutely something that has been on my todo stack for ages and it's very good to see someone else writing the thing.

It looks like the implementation is contentious so I'll leave that for the moment. Under some time constraints so please forgive the length of the following - TLDR is I love this and definitely want the feature.

A magic intrinsic which can be summoned from clang and used in things like if statements is the right construct for all the stuff which we known by codegen time and don't really need to care about in the front end. The code object version should probably be reified as one. The number of compute units. All the things currently handled by magic global variables in the rocm device library and relying on O1 to constant fold them out of existence. We do this, throw away the magic globals, everything is better.

The magic intrinsic can have guaranteed constant folding even at O0. That kills that class of O0 doesn't work bugs. This means it needs something that is certain to remove it ahead of a simplifycfg pass, or for the pass which removes it to also do the trivial fold of a branch. So it's not just constant folding, though having instcombine also constant fold it is fine as an optimisation.

The real value here in my opinion is towards being able to write IR libraries that don't know or care what target they're going to run on. Either because they're associated with spir-v, or because they're the libc which currently handwaves that problem, or the rocm device libs which handwaves it in a slightly different way, or the openmp runtime which currently builds K identical copies of the bitcode with different names in the spirit of correctness and stashes them in an archive. Lots of that is on sketchy QoI ground at present. But if we have an wavesize intrinsic that turns into 32 or 64 once the target is known, and hangs around in the IR until some later information about the target is revealed, we can have single IR and predictable correct semantics. Much better than the status quo.

Thanks Alex!

…le_wavefrontsize_early

arsenm

Commit message should be adjusted, it's talking about the old pass

arsenm · 2024-11-18T16:24:04Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+    // TODO: this is a workaround for the pseudo-generic target one gets with no
+    // specified mcpu, which spoofs its wave size to 64; it should be removed.
+    if ((ST->getCPU().empty() || ST->getCPU() == "generic") &&
+        !ST->getFeatureString().contains("+wavefrontsize"))


To keep the hack isolated to one place, instead of spreading it around. You've already missed "generic-hsa" for example. The wavesize target parsing is also hacky, and we already have other hacky parsing in the subtarget constructor.

We could also implement this by making the generic target actually have 0 wavesize, and replacing the isWave64 predicates with wavesize != 64

…le_wavefrontsize_early

…lx/llvm-project into handle_wavefrontsize_early

jhuber6 · 2024-11-23T02:32:42Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+  case Intrinsic::amdgcn_wavefrontsize: {
+    // TODO: this is a workaround for the pseudo-generic target one gets with no
+    // specified mcpu, which spoofs its wave size to 64; it should be removed.
+    if ((ST->getCPU().empty() || ST->getCPU().starts_with("generic")) &&


Still don't know if this is even worth checking being .empty., but I suppose we could codify that -mcpu="" and -mcpu="generic" are the same thing? My understanding is that generic was some testing thing.

arsenm · 2024-11-23T02:42:56Z

#117410 gives you a way to do this without explicitly looking at the features or CPU

Add pass to handle AMDGCN pseudo-intrinsics (abstract placeholders fo…

3ba88ce

…r target specific info), and add handling for `llvm.amdgcn.wavefrontsize`.

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU labels Oct 31, 2024

AlexVlx requested review from arsenm, rampitec and jhuber6 October 31, 2024 22:51

AlexVlx added tools:opt and removed clang Clang issues not falling into any other category labels Oct 31, 2024

AlexVlx changed the title ~~[opt][AMDGPU] Add pass to handle AMDGCN pseudo-intrinsics target specific info), start with llvm.amdgcn.wavefrontsize.~~ [opt][AMDGPU] Add pass to handle AMDGCN pseudo-intrinsics target specific info), start with llvm.amdgcn.wavefrontsize Oct 31, 2024

AlexVlx mentioned this pull request Oct 31, 2024

[Clang][HIP] Deprecate the AMDGCN_WAVEFRONT_SIZE macros #112849

Merged

Merge branch 'main' of https://github.com/llvm/llvm-project into hand…

1376596

…le_wavefrontsize_early

llvmbot added the clang Clang issues not falling into any other category label Oct 31, 2024

arsenm requested changes Oct 31, 2024

View reviewed changes

arsenm reviewed Oct 31, 2024

View reviewed changes

Implement review feedback.

826c291

AlexVlx added 2 commits November 1, 2024 02:02

Do not fold early for generic mcpu.

ab6f5a2

Fix formatting (again).

f8705fb

AlexVlx changed the title ~~[opt][AMDGPU] Add pass to handle AMDGCN pseudo-intrinsics, start with llvm.amdgcn.wavefrontsize~~ [llvm][AMDGPU] Fold llvm.amdgcn.wavefrontsize early Nov 4, 2024

AlexVlx requested review from arsenm and shiltian November 4, 2024 17:29

AlexVlx added llvm:instcombine and removed tools:opt clang Clang issues not falling into any other category labels Nov 4, 2024

jhuber6 reviewed Nov 4, 2024

View reviewed changes

rampitec reviewed Nov 4, 2024

View reviewed changes

Remove leftovers.

195decc

llvmbot added the clang Clang issues not falling into any other category label Nov 4, 2024

AlexVlx added 2 commits November 4, 2024 19:54

Remove pass.

1a7abaf

Fix formatting.

9aed76c

rampitec reviewed Nov 4, 2024

View reviewed changes

Really fix formatting.

246c22f

arsenm reviewed Nov 4, 2024

View reviewed changes

AlexVlx added 4 commits November 6, 2024 16:40

Merge branch 'main' of https://github.com/llvm/llvm-project into hand…

5a11720

…le_wavefrontsize_early

Split tests.

7cf7558

Merge branch 'main' of https://github.com/llvm/llvm-project into hand…

6a77b8a

…le_wavefrontsize_early

Merge branch 'main' of https://github.com/llvm/llvm-project into hand…

be414a8

…le_wavefrontsize_early

Merge branch 'main' of https://github.com/llvm/llvm-project into hand…

dedc593

…le_wavefrontsize_early

AlexVlx requested a review from arsenm November 17, 2024 12:39

arsenm reviewed Nov 18, 2024

View reviewed changes

AlexVlx added 3 commits November 18, 2024 16:53

Merge branch 'main' of https://github.com/llvm/llvm-project into hand…

c634b4e

…le_wavefrontsize_early

Merge branch 'handle_wavefrontsize_early' of https://github.com/AlexV…

c7be46f

…lx/llvm-project into handle_wavefrontsize_early

Tweak generic mcpu handling.

ed9f19f

jhuber6 reviewed Nov 23, 2024

View reviewed changes

[llvm][AMDGPU] Fold llvm.amdgcn.wavefrontsize early #114481

Are you sure you want to change the base?

[llvm][AMDGPU] Fold llvm.amdgcn.wavefrontsize early #114481

Conversation

AlexVlx commented Oct 31, 2024 • edited Loading

llvmbot commented Oct 31, 2024

llvmbot commented Oct 31, 2024

github-actions bot commented Oct 31, 2024 • edited Loading

arsenm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhuber6 commented Oct 31, 2024

arsenm commented Oct 31, 2024

jhuber6 commented Oct 31, 2024

arsenm commented Oct 31, 2024 • edited Loading

AlexVlx commented Nov 1, 2024

jhuber6 commented Nov 1, 2024 • edited Loading

AlexVlx commented Nov 1, 2024

AlexVlx commented Nov 1, 2024

jhuber6 commented Nov 1, 2024

arsenm commented Nov 1, 2024

AlexVlx commented Nov 4, 2024

jhuber6 Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rampitec left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JonChesterfield commented Nov 7, 2024

arsenm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhuber6 Nov 23, 2024 • edited Loading

Choose a reason for hiding this comment

arsenm commented Nov 23, 2024

[llvm][AMDGPU] Fold `llvm.amdgcn.wavefrontsize` early #114481

[llvm][AMDGPU] Fold `llvm.amdgcn.wavefrontsize` early #114481

AlexVlx commented Oct 31, 2024 •

edited

Loading

github-actions bot commented Oct 31, 2024 •

edited

Loading

arsenm commented Oct 31, 2024 •

edited

Loading

jhuber6 commented Nov 1, 2024 •

edited

Loading

jhuber6 Nov 4, 2024 •

edited

Loading

jhuber6 Nov 23, 2024 •

edited

Loading