-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[llvm][AMDGPU] Fold llvm.amdgcn.wavefrontsize
early
#114481
base: main
Are you sure you want to change the base?
Conversation
…r target specific info), and add handling for `llvm.amdgcn.wavefrontsize`.
@llvm/pr-subscribers-clang Author: Alex Voicu (AlexVlx) ChangesPseudo-intrinsics are a mechanism for conveying that some target specific constant info / quantity exists, without inducing AST mutation based on said quantity. They enable making e.g. control flow decisions in a HLL that end up with a constant evaluated predicate in the ME and optimised accordingly, without making the AST (even more) target specific. This change adds an Full diff: https://github.com/llvm/llvm-project/pull/114481.diff 7 Files Affected:
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
index bf5f2971cf118c..de6a06dad6a08d 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
@@ -1,6 +1,6 @@
// REQUIRES: amdgpu-registered-target
// RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-unknown-unknown -target-cpu tahiti -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefixes=CHECK-AMDGCN,CHECK %s
-// RUN: %clang_cc1 -cl-std=CL2.0 -triple spirv64-amd-amdhsa -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefix=CHECK %s
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple spirv64-amd-amdhsa -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefixes=CHECK,CHECK-SPIRV %s
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
@@ -866,7 +866,8 @@ void test_atomic_inc_dec(__attribute__((address_space(3))) uint *lptr, __attribu
// CHECK-LABEL test_wavefrontsize(
unsigned test_wavefrontsize() {
- // CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wavefrontsize()
+ // CHECK-AMDGCN: ret i32 {{[0-9]+}}
+ // CHECK-SPIRV: {{.*}}call{{.*}} i32 @llvm.amdgcn.wavefrontsize()
return __builtin_amdgcn_wavefrontsize();
}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 95d0ad0f9dc96a..17d3e6ab7c65ab 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -345,6 +345,15 @@ extern char &AMDGPUPrintfRuntimeBindingID;
void initializeAMDGPUResourceUsageAnalysisPass(PassRegistry &);
extern char &AMDGPUResourceUsageAnalysisID;
+struct AMDGPUExpandPseudoIntrinsicsPass
+ : PassInfoMixin<AMDGPUExpandPseudoIntrinsicsPass> {
+ const AMDGPUTargetMachine &TM;
+ AMDGPUExpandPseudoIntrinsicsPass(const AMDGPUTargetMachine &ATM) : TM(ATM) {}
+ PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
+
+ static bool isRequired() { return true; }
+};
+
struct AMDGPUPrintfRuntimeBindingPass
: PassInfoMixin<AMDGPUPrintfRuntimeBindingPass> {
PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp b/llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp
new file mode 100644
index 00000000000000..faa23bb8550dbc
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp
@@ -0,0 +1,49 @@
+//===- AMDGPUExpandPseudoIntrinsics.cpp - Pseudo Intrinsic Expander Pass --===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// This file implements a pass that deals with expanding AMDGCN generic pseudo-
+// intrinsics into target specific quantities / sequences. In this context, a
+// pseudo-intrinsic is an AMDGCN intrinsic that does not directly map to a
+// specific instruction, but rather is intended as a mechanism for abstractly
+// conveying target specific info to a HLL / the FE, without concretely
+// impacting the AST. An example of such an intrinsic is amdgcn.wavefrontsize.
+// This pass should run as early as possible / immediately after Clang CodeGen,
+// so that the optimisation pipeline and the BE operate with concrete target
+// data.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "GCNSubtarget.h"
+
+#include "llvm/IR/Constants.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/Module.h"
+#include "llvm/Pass.h"
+
+using namespace llvm;
+
+static inline PreservedAnalyses expandWaveSizeIntrinsic(const GCNSubtarget &ST,
+ Function *WaveSize) {
+ if (WaveSize->hasZeroLiveUses())
+ return PreservedAnalyses::all();
+
+ for (auto &&U : WaveSize->users())
+ U->replaceAllUsesWith(ConstantInt::get(WaveSize->getReturnType(),
+ ST.getWavefrontSize()));
+
+ return PreservedAnalyses::none();
+}
+
+PreservedAnalyses
+ AMDGPUExpandPseudoIntrinsicsPass::run(Module &M, ModuleAnalysisManager &) {
+
+ if (auto WS = M.getFunction("llvm.amdgcn.wavefrontsize"))
+ return expandWaveSizeIntrinsic(TM.getSubtarget<GCNSubtarget>(*WS), WS);
+
+ return PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 174a90f0aa419d..323c195c329168 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -27,6 +27,8 @@ MODULE_PASS("amdgpu-perf-hint",
*static_cast<const GCNTargetMachine *>(this)))
MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
+MODULE_PASS("amdgpu-expand-pseudo-intrinsics",
+ AMDGPUExpandPseudoIntrinsicsPass(*this))
#undef MODULE_PASS
#ifndef MODULE_PASS_WITH_PARAMS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index d93ec34a703d3d..2bf8df6588c59c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -739,7 +739,8 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
#include "llvm/Passes/TargetPassRegistry.inc"
PB.registerPipelineStartEPCallback(
- [](ModulePassManager &PM, OptimizationLevel Level) {
+ [this](ModulePassManager &PM, OptimizationLevel Level) {
+ PM.addPass(AMDGPUExpandPseudoIntrinsicsPass(*this));
FunctionPassManager FPM;
PM.addPass(createModuleToFunctionPassAdaptor(std::move(FPM)));
if (EnableHipStdPar)
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index fed29c3e14aae2..c9d4452b4a035c 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -54,6 +54,7 @@ add_llvm_target(AMDGPUCodeGen
AMDGPUCodeGenPrepare.cpp
AMDGPUCombinerHelper.cpp
AMDGPUCtorDtorLowering.cpp
+ AMDGPUExpandPseudoIntrinsics.cpp
AMDGPUExportClustering.cpp
AMDGPUFrameLowering.cpp
AMDGPUGlobalISelDivergenceLowering.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll
index 824d3708c027db..efa53def5ee686 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll
@@ -1,3 +1,4 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
; RUN: llc -mtriple=amdgcn -mcpu=fiji -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s
; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W32 %s
; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s
@@ -5,28 +6,43 @@
; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s
; RUN: opt -O3 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -O3 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -passes='default<O3>' -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=tonga -O3 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s
+; RUN: opt -mtriple=amdgcn-- -O3 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -passes='default<O3>' -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=tonga -O3 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s
; GCN-LABEL: {{^}}fold_wavefrontsize:
-; OPT-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
; W32: v_mov_b32_e32 [[V:v[0-9]+]], 32
; W64: v_mov_b32_e32 [[V:v[0-9]+]], 64
; GCN: store_{{dword|b32}} v{{.+}}, [[V]]
-; OPT: %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
-; OPT: store i32 %tmp, ptr addrspace(1) %arg, align 4
-; OPT-NEXT: ret void
define amdgpu_kernel void @fold_wavefrontsize(ptr addrspace(1) nocapture %arg) {
+; OPT-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
+; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; OPT-NEXT: [[BB:.*:]]
+; OPT-NEXT: [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2:[0-9]+]]
+; OPT-NEXT: store i32 [[TMP]], ptr addrspace(1) [[ARG]], align 4
+; OPT-NEXT: ret void
+;
+; OPT-W64-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
+; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; OPT-W64-NEXT: [[BB:.*:]]
+; OPT-W64-NEXT: store i32 64, ptr addrspace(1) [[ARG]], align 4
+; OPT-W64-NEXT: ret void
+;
+; OPT-W32-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
+; OPT-W32-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; OPT-W32-NEXT: [[BB:.*:]]
+; OPT-W32-NEXT: store i32 32, ptr addrspace(1) [[ARG]], align 4
+; OPT-W32-NEXT: ret void
+;
bb:
%tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
store i32 %tmp, ptr addrspace(1) %arg, align 4
@@ -34,20 +50,35 @@ bb:
}
; GCN-LABEL: {{^}}fold_and_optimize_wavefrontsize:
-; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
; W32: v_mov_b32_e32 [[V:v[0-9]+]], 1{{$}}
; W64: v_mov_b32_e32 [[V:v[0-9]+]], 2{{$}}
; GCN-NOT: cndmask
; GCN: store_{{dword|b32}} v{{.+}}, [[V]]
-; OPT: %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
-; OPT: %tmp1 = icmp ugt i32 %tmp, 32
-; OPT: %tmp2 = select i1 %tmp1, i32 2, i32 1
-; OPT: store i32 %tmp2, ptr addrspace(1) %arg
-; OPT-NEXT: ret void
define amdgpu_kernel void @fold_and_optimize_wavefrontsize(ptr addrspace(1) nocapture %arg) {
+; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
+; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-NEXT: [[BB:.*:]]
+; OPT-NEXT: [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2]]
+; OPT-NEXT: [[TMP1:%.*]] = icmp ugt i32 [[TMP]], 32
+; OPT-NEXT: [[TMP2:%.*]] = select i1 [[TMP1]], i32 2, i32 1
+; OPT-NEXT: store i32 [[TMP2]], ptr addrspace(1) [[ARG]], align 4
+; OPT-NEXT: ret void
+;
+; OPT-W64-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
+; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-W64-NEXT: [[BB:.*:]]
+; OPT-W64-NEXT: store i32 2, ptr addrspace(1) [[ARG]], align 4
+; OPT-W64-NEXT: ret void
+;
+; OPT-W32-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
+; OPT-W32-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-W32-NEXT: [[BB:.*:]]
+; OPT-W32-NEXT: store i32 1, ptr addrspace(1) [[ARG]], align 4
+; OPT-W32-NEXT: ret void
+;
bb:
%tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
%tmp1 = icmp ugt i32 %tmp, 32
@@ -57,15 +88,31 @@ bb:
}
; GCN-LABEL: {{^}}fold_and_optimize_if_wavefrontsize:
-; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
-
-; OPT: bb:
-; OPT: %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
-; OPT: %tmp1 = icmp ugt i32 %tmp, 32
-; OPT: bb3:
-; OPT-NEXT: ret void
define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(ptr addrspace(1) nocapture %arg) {
+; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
+; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-NEXT: [[BB:.*:]]
+; OPT-NEXT: [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2]]
+; OPT-NEXT: [[TMP1:%.*]] = icmp ugt i32 [[TMP]], 32
+; OPT-NEXT: br i1 [[TMP1]], label %[[BB2:.*]], label %[[BB3:.*]]
+; OPT: [[BB2]]:
+; OPT-NEXT: store i32 1, ptr addrspace(1) [[ARG]], align 4
+; OPT-NEXT: br label %[[BB3]]
+; OPT: [[BB3]]:
+; OPT-NEXT: ret void
+;
+; OPT-W64-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
+; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-W64-NEXT: [[BB:.*:]]
+; OPT-W64-NEXT: store i32 1, ptr addrspace(1) [[ARG]], align 4
+; OPT-W64-NEXT: ret void
+;
+; OPT-W32-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
+; OPT-W32-SAME: ptr addrspace(1) nocapture readnone [[ARG:%.*]]) local_unnamed_addr #[[ATTR1:[0-9]+]] {
+; OPT-W32-NEXT: [[BB:.*:]]
+; OPT-W32-NEXT: ret void
+;
bb:
%tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
%tmp1 = icmp ugt i32 %tmp, 32
|
@llvm/pr-subscribers-backend-amdgpu Author: Alex Voicu (AlexVlx) ChangesPseudo-intrinsics are a mechanism for conveying that some target specific constant info / quantity exists, without inducing AST mutation based on said quantity. They enable making e.g. control flow decisions in a HLL that end up with a constant evaluated predicate in the ME and optimised accordingly, without making the AST (even more) target specific. This change adds an Full diff: https://github.com/llvm/llvm-project/pull/114481.diff 7 Files Affected:
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
index bf5f2971cf118c..de6a06dad6a08d 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
@@ -1,6 +1,6 @@
// REQUIRES: amdgpu-registered-target
// RUN: %clang_cc1 -cl-std=CL2.0 -triple amdgcn-unknown-unknown -target-cpu tahiti -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefixes=CHECK-AMDGCN,CHECK %s
-// RUN: %clang_cc1 -cl-std=CL2.0 -triple spirv64-amd-amdhsa -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefix=CHECK %s
+// RUN: %clang_cc1 -cl-std=CL2.0 -triple spirv64-amd-amdhsa -emit-llvm -o - %s | FileCheck -enable-var-scope --check-prefixes=CHECK,CHECK-SPIRV %s
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
@@ -866,7 +866,8 @@ void test_atomic_inc_dec(__attribute__((address_space(3))) uint *lptr, __attribu
// CHECK-LABEL test_wavefrontsize(
unsigned test_wavefrontsize() {
- // CHECK: {{.*}}call{{.*}} i32 @llvm.amdgcn.wavefrontsize()
+ // CHECK-AMDGCN: ret i32 {{[0-9]+}}
+ // CHECK-SPIRV: {{.*}}call{{.*}} i32 @llvm.amdgcn.wavefrontsize()
return __builtin_amdgcn_wavefrontsize();
}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 95d0ad0f9dc96a..17d3e6ab7c65ab 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -345,6 +345,15 @@ extern char &AMDGPUPrintfRuntimeBindingID;
void initializeAMDGPUResourceUsageAnalysisPass(PassRegistry &);
extern char &AMDGPUResourceUsageAnalysisID;
+struct AMDGPUExpandPseudoIntrinsicsPass
+ : PassInfoMixin<AMDGPUExpandPseudoIntrinsicsPass> {
+ const AMDGPUTargetMachine &TM;
+ AMDGPUExpandPseudoIntrinsicsPass(const AMDGPUTargetMachine &ATM) : TM(ATM) {}
+ PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
+
+ static bool isRequired() { return true; }
+};
+
struct AMDGPUPrintfRuntimeBindingPass
: PassInfoMixin<AMDGPUPrintfRuntimeBindingPass> {
PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp b/llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp
new file mode 100644
index 00000000000000..faa23bb8550dbc
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPUExpandPseudoIntrinsics.cpp
@@ -0,0 +1,49 @@
+//===- AMDGPUExpandPseudoIntrinsics.cpp - Pseudo Intrinsic Expander Pass --===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// This file implements a pass that deals with expanding AMDGCN generic pseudo-
+// intrinsics into target specific quantities / sequences. In this context, a
+// pseudo-intrinsic is an AMDGCN intrinsic that does not directly map to a
+// specific instruction, but rather is intended as a mechanism for abstractly
+// conveying target specific info to a HLL / the FE, without concretely
+// impacting the AST. An example of such an intrinsic is amdgcn.wavefrontsize.
+// This pass should run as early as possible / immediately after Clang CodeGen,
+// so that the optimisation pipeline and the BE operate with concrete target
+// data.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "GCNSubtarget.h"
+
+#include "llvm/IR/Constants.h"
+#include "llvm/IR/Function.h"
+#include "llvm/IR/Module.h"
+#include "llvm/Pass.h"
+
+using namespace llvm;
+
+static inline PreservedAnalyses expandWaveSizeIntrinsic(const GCNSubtarget &ST,
+ Function *WaveSize) {
+ if (WaveSize->hasZeroLiveUses())
+ return PreservedAnalyses::all();
+
+ for (auto &&U : WaveSize->users())
+ U->replaceAllUsesWith(ConstantInt::get(WaveSize->getReturnType(),
+ ST.getWavefrontSize()));
+
+ return PreservedAnalyses::none();
+}
+
+PreservedAnalyses
+ AMDGPUExpandPseudoIntrinsicsPass::run(Module &M, ModuleAnalysisManager &) {
+
+ if (auto WS = M.getFunction("llvm.amdgcn.wavefrontsize"))
+ return expandWaveSizeIntrinsic(TM.getSubtarget<GCNSubtarget>(*WS), WS);
+
+ return PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index 174a90f0aa419d..323c195c329168 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -27,6 +27,8 @@ MODULE_PASS("amdgpu-perf-hint",
*static_cast<const GCNTargetMachine *>(this)))
MODULE_PASS("amdgpu-printf-runtime-binding", AMDGPUPrintfRuntimeBindingPass())
MODULE_PASS("amdgpu-unify-metadata", AMDGPUUnifyMetadataPass())
+MODULE_PASS("amdgpu-expand-pseudo-intrinsics",
+ AMDGPUExpandPseudoIntrinsicsPass(*this))
#undef MODULE_PASS
#ifndef MODULE_PASS_WITH_PARAMS
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index d93ec34a703d3d..2bf8df6588c59c 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -739,7 +739,8 @@ void AMDGPUTargetMachine::registerPassBuilderCallbacks(PassBuilder &PB) {
#include "llvm/Passes/TargetPassRegistry.inc"
PB.registerPipelineStartEPCallback(
- [](ModulePassManager &PM, OptimizationLevel Level) {
+ [this](ModulePassManager &PM, OptimizationLevel Level) {
+ PM.addPass(AMDGPUExpandPseudoIntrinsicsPass(*this));
FunctionPassManager FPM;
PM.addPass(createModuleToFunctionPassAdaptor(std::move(FPM)));
if (EnableHipStdPar)
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index fed29c3e14aae2..c9d4452b4a035c 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -54,6 +54,7 @@ add_llvm_target(AMDGPUCodeGen
AMDGPUCodeGenPrepare.cpp
AMDGPUCombinerHelper.cpp
AMDGPUCtorDtorLowering.cpp
+ AMDGPUExpandPseudoIntrinsics.cpp
AMDGPUExportClustering.cpp
AMDGPUFrameLowering.cpp
AMDGPUGlobalISelDivergenceLowering.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll
index 824d3708c027db..efa53def5ee686 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wavefrontsize.ll
@@ -1,3 +1,4 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
; RUN: llc -mtriple=amdgcn -mcpu=fiji -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s
; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W32 %s
; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s
@@ -5,28 +6,43 @@
; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s
; RUN: opt -O3 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -O3 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -passes='default<O3>' -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=tonga -O3 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s
-; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s
+; RUN: opt -mtriple=amdgcn-- -O3 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -passes='default<O3>' -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=tonga -O3 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s
+; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s
; GCN-LABEL: {{^}}fold_wavefrontsize:
-; OPT-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
; W32: v_mov_b32_e32 [[V:v[0-9]+]], 32
; W64: v_mov_b32_e32 [[V:v[0-9]+]], 64
; GCN: store_{{dword|b32}} v{{.+}}, [[V]]
-; OPT: %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
-; OPT: store i32 %tmp, ptr addrspace(1) %arg, align 4
-; OPT-NEXT: ret void
define amdgpu_kernel void @fold_wavefrontsize(ptr addrspace(1) nocapture %arg) {
+; OPT-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
+; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; OPT-NEXT: [[BB:.*:]]
+; OPT-NEXT: [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2:[0-9]+]]
+; OPT-NEXT: store i32 [[TMP]], ptr addrspace(1) [[ARG]], align 4
+; OPT-NEXT: ret void
+;
+; OPT-W64-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
+; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; OPT-W64-NEXT: [[BB:.*:]]
+; OPT-W64-NEXT: store i32 64, ptr addrspace(1) [[ARG]], align 4
+; OPT-W64-NEXT: ret void
+;
+; OPT-W32-LABEL: define amdgpu_kernel void @fold_wavefrontsize(
+; OPT-W32-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] {
+; OPT-W32-NEXT: [[BB:.*:]]
+; OPT-W32-NEXT: store i32 32, ptr addrspace(1) [[ARG]], align 4
+; OPT-W32-NEXT: ret void
+;
bb:
%tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
store i32 %tmp, ptr addrspace(1) %arg, align 4
@@ -34,20 +50,35 @@ bb:
}
; GCN-LABEL: {{^}}fold_and_optimize_wavefrontsize:
-; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
; W32: v_mov_b32_e32 [[V:v[0-9]+]], 1{{$}}
; W64: v_mov_b32_e32 [[V:v[0-9]+]], 2{{$}}
; GCN-NOT: cndmask
; GCN: store_{{dword|b32}} v{{.+}}, [[V]]
-; OPT: %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
-; OPT: %tmp1 = icmp ugt i32 %tmp, 32
-; OPT: %tmp2 = select i1 %tmp1, i32 2, i32 1
-; OPT: store i32 %tmp2, ptr addrspace(1) %arg
-; OPT-NEXT: ret void
define amdgpu_kernel void @fold_and_optimize_wavefrontsize(ptr addrspace(1) nocapture %arg) {
+; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
+; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-NEXT: [[BB:.*:]]
+; OPT-NEXT: [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2]]
+; OPT-NEXT: [[TMP1:%.*]] = icmp ugt i32 [[TMP]], 32
+; OPT-NEXT: [[TMP2:%.*]] = select i1 [[TMP1]], i32 2, i32 1
+; OPT-NEXT: store i32 [[TMP2]], ptr addrspace(1) [[ARG]], align 4
+; OPT-NEXT: ret void
+;
+; OPT-W64-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
+; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-W64-NEXT: [[BB:.*:]]
+; OPT-W64-NEXT: store i32 2, ptr addrspace(1) [[ARG]], align 4
+; OPT-W64-NEXT: ret void
+;
+; OPT-W32-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize(
+; OPT-W32-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-W32-NEXT: [[BB:.*:]]
+; OPT-W32-NEXT: store i32 1, ptr addrspace(1) [[ARG]], align 4
+; OPT-W32-NEXT: ret void
+;
bb:
%tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
%tmp1 = icmp ugt i32 %tmp, 32
@@ -57,15 +88,31 @@ bb:
}
; GCN-LABEL: {{^}}fold_and_optimize_if_wavefrontsize:
-; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
-
-; OPT: bb:
-; OPT: %tmp = tail call i32 @llvm.amdgcn.wavefrontsize()
-; OPT: %tmp1 = icmp ugt i32 %tmp, 32
-; OPT: bb3:
-; OPT-NEXT: ret void
define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(ptr addrspace(1) nocapture %arg) {
+; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
+; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-NEXT: [[BB:.*:]]
+; OPT-NEXT: [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2]]
+; OPT-NEXT: [[TMP1:%.*]] = icmp ugt i32 [[TMP]], 32
+; OPT-NEXT: br i1 [[TMP1]], label %[[BB2:.*]], label %[[BB3:.*]]
+; OPT: [[BB2]]:
+; OPT-NEXT: store i32 1, ptr addrspace(1) [[ARG]], align 4
+; OPT-NEXT: br label %[[BB3]]
+; OPT: [[BB3]]:
+; OPT-NEXT: ret void
+;
+; OPT-W64-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
+; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] {
+; OPT-W64-NEXT: [[BB:.*:]]
+; OPT-W64-NEXT: store i32 1, ptr addrspace(1) [[ARG]], align 4
+; OPT-W64-NEXT: ret void
+;
+; OPT-W32-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(
+; OPT-W32-SAME: ptr addrspace(1) nocapture readnone [[ARG:%.*]]) local_unnamed_addr #[[ATTR1:[0-9]+]] {
+; OPT-W32-NEXT: [[BB:.*:]]
+; OPT-W32-NEXT: ret void
+;
bb:
%tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0
%tmp1 = icmp ugt i32 %tmp, 32
|
llvm.amdgcn.wavefrontsize
.llvm.amdgcn.wavefrontsize
…le_wavefrontsize_early
✅ With the latest revision this PR passed the C/C++ code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not want or need a new pass to handle this. This is not a fix to the structural issue of wavesize. The problem is there is no such thing as a "no wavesize" IR. There is only wave32 or wave64. Querying the target gives the wrong answer for faux "generic" IR. Throwing in a pass that happens to know where it runs in the pipeline to decide when to lower is not a real fix; that is not a modular IR.
The correct solution is to use separate wave32 and wave64 builds. InstCombine can then just directly fold the intrinsic based on the known target.
PreservedAnalyses | ||
AMDGPUExpandPseudoIntrinsicsPass::run(Module &M, ModuleAnalysisManager &) { | ||
|
||
if (auto WS = M.getFunction("llvm.amdgcn.wavefrontsize")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can query by intrinsic ID now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
AMDGPUExpandPseudoIntrinsicsPass::run(Module &M, ModuleAnalysisManager &) { | ||
|
||
if (auto WS = M.getFunction("llvm.amdgcn.wavefrontsize")) | ||
return expandWaveSizeIntrinsic(TM.getSubtarget<GCNSubtarget>(*WS), WS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subtarget is per function, cannot rely on getting the subtarget from an intrinsic declaration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I think the current use of this intrinsic in 'generic' IR is sound so long as it's not guarding anything ABI related. Right now it's just used for loop bounds and array offsets pretty much. Though long-term I agree that it's probably most sound to just put these as separate builds, but that would require all of our device runtime stuff going through my interface / using multilibs. I think the only thing important here is that we don't do this prematurely, i.e. @JonChesterfield was working on 'generic' IR as well, so maybe he has some opinions. |
Just adding this to the pass pipeline where it is is no better than just doing it in instcombine, which is the natural place to do this. This patch, like instcombine, still has the problem that we don't know if we're producing the final code. |
Yeah that was my concern, it could go in the backend passes but then that would be pretty late. However, this is probably legal in general if the function has |
To be more specific, there is currently no wavefrontsize attribute. There are the wavefrontsize32/64 subtarget features, which may be contained in target-features. There is also an implied default by the target-cpu. Both of these may also come from the global target machine configuration. We should not impart meaning to the presence or absence of an attribute |
The new pass is not just to handle this, it happens to handle this since it already exists. Having an unspecified, abstract quantity is not the same thing as it being absent. Faux "generic" IR sounds like a problematic concept, do you have an example? Multi-builds might be the correct solution for something, but it's unclear what that something is - yes, if you already "fix" the wave size value, then the intrinsic is fairly spurious anyway, but it does not address the need to NOT encode it early. |
It's what |
We could just turn this off for a particular compilation and maintain the current unfoldable state. Which makes it possibly preferable to keep this as a pass. Or do the two separate builds. We probably need to appreciate that the vast majority of clients are neither |
As per my other reply, this is not an invalid use case, but somewhat niche. We can have a control value for disabling this early fold, for such builds, to avoid the need to do two builds (which might also be fine for |
ROCDL does something worse IMO, which is linking in a magic global constant instead of just using an intrinsic. I figured it would be safe to check if the caller has target attributes at all, but apparently |
Mechanically, this pass can be replaced with trivial handling of the intrinsic in AMDGPUInstCombineIntrinsic; we don't need a new module pass. As inserted into the pipeline here, this does not have any advantage over handling it directly in instcombine.
This violates the fundamental principles of a modular compiler IR. Any mechanism which we would have to invent to stop this fold from happening in a specific bitcode library build will be quite unsavory, and require handholding of every user to not run into the same issue. I'd like to systematically avoid this class problem by having a separate library build.
The toothpaste is out of the tube once the IR is produced. If some toolchain were relying on the global target machine features, there are opportunities for error on each tool invocation. The absence of the attribute does not tell you what the final compilation context will be. |
llvm.amdgcn.wavefrontsize
llvm.amdgcn.wavefrontsize
early
Done. |
@@ -1024,6 +1024,15 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const { | |||
} | |||
break; | |||
} | |||
case Intrinsic::amdgcn_wavefrontsize: { | |||
// TODO: this is a workaround for the pseudo-generic target one gets with no | |||
// specified mcpu, which spoofs its wave size to 64; it should be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A real solution would be two builds, but spoofing it as 64 works, (likely unintentinally) because we don't do any w64 specific changes yet and w64 can always be narrowed to w32 and not the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that this interpretation is actually correct, if you rely on lockstep of a full wave and you optimise around wavesize this will break in bad ways on wave32. The current generic
is not particularly god, but we have to live with it for now I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already do some light 64->32 folds, that are only sort of correct.
Technically we could make exec_hi an allocatable scratch register in wave32, but what we do now bakes in an assumption that exec_hi must always be 0.
But yes, the only way to really avoid any possible edge cases (and support a future of machine linked libraries) requires just having totally separate builds
llvm/lib/Target/AMDGPU/AMDGPU.h
Outdated
@@ -345,6 +345,15 @@ extern char &AMDGPUPrintfRuntimeBindingID; | |||
void initializeAMDGPUResourceUsageAnalysisPass(PassRegistry &); | |||
extern char &AMDGPUResourceUsageAnalysisID; | |||
|
|||
struct AMDGPUExpandPseudoIntrinsicsPass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pass isn't needed now?
case Intrinsic::amdgcn_wavefrontsize: { | ||
// TODO: this is a workaround for the pseudo-generic target one gets with no | ||
// specified mcpu, which spoofs its wave size to 64; it should be removed. | ||
if ((ST->getCPU().empty() || ST->getCPU() == "generic") && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less than ideal... I am not sure if there is a way to check that a fixed wavefront size is in the subtarget description and not added as an -mattr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None that I could find because we spoof the Wave64 in when it's not specified, so the only differentiator that I could think of is that the mattr
is never set. If you have an alternative, that'd be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not really have one. Maybe it is OK for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really gross. We also do have a "generic-hsa" target-cpu name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the input. What is the suggested solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general LTGM.
case Intrinsic::amdgcn_wavefrontsize: { | ||
// TODO: this is a workaround for the pseudo-generic target one gets with no | ||
// specified mcpu, which spoofs its wave size to 64; it should be removed. | ||
if ((ST->getCPU().empty() || ST->getCPU() == "generic") && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not really have one. Maybe it is OK for now.
case Intrinsic::amdgcn_wavefrontsize: { | ||
// TODO: this is a workaround for the pseudo-generic target one gets with no | ||
// specified mcpu, which spoofs its wave size to 64; it should be removed. | ||
if ((ST->getCPU().empty() || ST->getCPU() == "generic") && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really gross. We also do have a "generic-hsa" target-cpu name
@@ -1024,6 +1024,15 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const { | |||
} | |||
break; | |||
} | |||
case Intrinsic::amdgcn_wavefrontsize: { | |||
// TODO: this is a workaround for the pseudo-generic target one gets with no | |||
// specified mcpu, which spoofs its wave size to 64; it should be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already do some light 64->32 folds, that are only sort of correct.
Technically we could make exec_hi an allocatable scratch register in wave32, but what we do now bakes in an assumption that exec_hi must always be 0.
But yes, the only way to really avoid any possible edge cases (and support a future of machine linked libraries) requires just having totally separate builds
// TODO: this is a workaround for the pseudo-generic target one gets with no | ||
// specified mcpu, which spoofs its wave size to 64; it should be removed. | ||
if ((ST->getCPU().empty() || ST->getCPU() == "generic") && | ||
!ST->getFeatureString().contains("+wavefrontsize")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature string may also contain a -wavefrontsize.
It's probably safest to ignore the target-features. If we're really going to rely on this target-cpu hack for the library uses, rocm-device-libs is not using an explicit wavefrontsize feature anymore (all the uses were converted to the ballot wave64->wave32 hack)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also do have the subtarget already. Should probably move the logic in there, instead of spreading the default CPU logic parsing into a new place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a temporary hack, what's the point of putting it in the subtarget so that people get ideas, start using it, and then there's even more technical debt? The parsing might have to change anyway once we have a proper generic target (which the current hack is not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep the hack isolated to one place, instead of spreading it around. You've already missed "generic-hsa" for example. The wavesize target parsing is also hacky, and we already have other hacky parsing in the subtarget constructor.
We could also implement this by making the generic target actually have 0 wavesize, and replacing the isWave64 predicates with wavesize != 64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable, there's also an argument that the backend likely can't do anything useful without -mcpu=
set so we could make that an error. The NVIDIA tools basically do that themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I don't think we should be doing live design of generic (which is part of what got us here anyway), so I'd rather not build even more technical debt around its current form which was meant to be a test only kludge:
// The code produced for "generic" is only useful for tests and cannot
// reasonably be expected to execute on any particular target.
Which is to say I don't want to what is there now, I want it to not break. I've adjusted the check to cover generic-hsa
and corrected the commit message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be fine with multiple builds, but right now the AMDGCN infra doesn't support it very well since we'd need to port the ROCm Device Libs to use my build system. Beyond that it'd be pretty easy to just default the triple depending on mcpu and -mwavefrontsize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we couldn't make a helper that is like ST.hasCPU()
. The comments for 'generic-hsa' are different because the 'generic' IR in the RTlibs never touches the backend without having a CPU set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can simply live with starts_with
for now, as the check won't be infectious and we'll remove it in the future anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure if we even need to bother checking for 'generic' since that's not what any of the existing targets use for generic AFAIC. It's just not setting -mcpu
when you compile, which I think should probably be an error in the backend.
; RUN: opt -mtriple=amdgcn-- -passes='default<O3>' -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s | ||
; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s | ||
; RUN: opt -mtriple=amdgcn-- -mcpu=tonga -O3 -S < %s | FileCheck -check-prefix=OPT-W64 %s | ||
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This codegen test shouldn't be running all of these passes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It already was, mostly? It seems worthwhile to individualise the possible / plausible scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified.
…le_wavefrontsize_early
…le_wavefrontsize_early
…le_wavefrontsize_early
Awesome! This is absolutely something that has been on my todo stack for ages and it's very good to see someone else writing the thing. It looks like the implementation is contentious so I'll leave that for the moment. Under some time constraints so please forgive the length of the following - TLDR is I love this and definitely want the feature. A magic intrinsic which can be summoned from clang and used in things like if statements is the right construct for all the stuff which we known by codegen time and don't really need to care about in the front end. The code object version should probably be reified as one. The number of compute units. All the things currently handled by magic global variables in the rocm device library and relying on O1 to constant fold them out of existence. We do this, throw away the magic globals, everything is better. The magic intrinsic can have guaranteed constant folding even at O0. That kills that class of O0 doesn't work bugs. This means it needs something that is certain to remove it ahead of a simplifycfg pass, or for the pass which removes it to also do the trivial fold of a branch. So it's not just constant folding, though having instcombine also constant fold it is fine as an optimisation. The real value here in my opinion is towards being able to write IR libraries that don't know or care what target they're going to run on. Either because they're associated with spir-v, or because they're the libc which currently handwaves that problem, or the rocm device libs which handwaves it in a slightly different way, or the openmp runtime which currently builds K identical copies of the bitcode with different names in the spirit of correctness and stashes them in an archive. Lots of that is on sketchy QoI ground at present. But if we have an wavesize intrinsic that turns into 32 or 64 once the target is known, and hangs around in the IR until some later information about the target is revealed, we can have single IR and predictable correct semantics. Much better than the status quo. Thanks Alex! |
…le_wavefrontsize_early
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message should be adjusted, it's talking about the old pass
// TODO: this is a workaround for the pseudo-generic target one gets with no | ||
// specified mcpu, which spoofs its wave size to 64; it should be removed. | ||
if ((ST->getCPU().empty() || ST->getCPU() == "generic") && | ||
!ST->getFeatureString().contains("+wavefrontsize")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep the hack isolated to one place, instead of spreading it around. You've already missed "generic-hsa" for example. The wavesize target parsing is also hacky, and we already have other hacky parsing in the subtarget constructor.
We could also implement this by making the generic target actually have 0 wavesize, and replacing the isWave64 predicates with wavesize != 64
…le_wavefrontsize_early
…lx/llvm-project into handle_wavefrontsize_early
case Intrinsic::amdgcn_wavefrontsize: { | ||
// TODO: this is a workaround for the pseudo-generic target one gets with no | ||
// specified mcpu, which spoofs its wave size to 64; it should be removed. | ||
if ((ST->getCPU().empty() || ST->getCPU().starts_with("generic")) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still don't know if this is even worth checking being .empty., but I suppose we could codify that -mcpu=""
and -mcpu="generic"
are the same thing? My understanding is that generic was some testing thing.
#117410 gives you a way to do this without explicitly looking at the features or CPU |
Fold
llvm.amdgcn.wavefrontsize
early, during InstCombine, so that it's concrete value is used throughout subsequent optimisation passes.