-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[llvm][AMDGPU] Fold llvm.amdgcn.wavefrontsize
early
#114481
Changes from 8 commits
3ba88ce
1376596
826c291
ab6f5a2
f8705fb
ed870a8
f5751a5
026ed00
195decc
1a7abaf
9aed76c
246c22f
5a11720
7cf7558
6a77b8a
be414a8
dedc593
c634b4e
c7be46f
ed9f19f
d30cb95
dcfe7be
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
//===- AMDGPUExpandPseudoIntrinsics.cpp - Pseudo Intrinsic Expander Pass --===// | ||
// | ||
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. | ||
// See https://llvm.org/LICENSE.txt for license information. | ||
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
// | ||
//===----------------------------------------------------------------------===// | ||
// This file implements a pass that deals with expanding AMDGCN generic pseudo- | ||
// intrinsics into target specific quantities / sequences. In this context, a | ||
// pseudo-intrinsic is an AMDGCN intrinsic that does not directly map to a | ||
// specific instruction, but rather is intended as a mechanism for abstractly | ||
// conveying target specific info to a HLL / the FE, without concretely | ||
// impacting the AST. An example of such an intrinsic is amdgcn.wavefrontsize. | ||
// This pass should run as early as possible / immediately after Clang CodeGen, | ||
// so that the optimisation pipeline and the BE operate with concrete target | ||
// data. | ||
//===----------------------------------------------------------------------===// | ||
|
||
#include "AMDGPU.h" | ||
#include "AMDGPUTargetMachine.h" | ||
#include "GCNSubtarget.h" | ||
|
||
#include "llvm/IR/Constants.h" | ||
#include "llvm/IR/Function.h" | ||
#include "llvm/IR/IntrinsicsAMDGPU.h" | ||
#include "llvm/IR/Module.h" | ||
#include "llvm/Pass.h" | ||
|
||
using namespace llvm; | ||
|
||
static inline PreservedAnalyses expandWaveSizeIntrinsic(const GCNSubtarget &ST, | ||
Function *WaveSize) { | ||
if (WaveSize->hasZeroLiveUses()) | ||
return PreservedAnalyses::all(); | ||
|
||
for (auto &&U : WaveSize->users()) | ||
U->replaceAllUsesWith( | ||
ConstantInt::get(WaveSize->getReturnType(), ST.getWavefrontSize())); | ||
|
||
return PreservedAnalyses::none(); | ||
} | ||
|
||
PreservedAnalyses | ||
AMDGPUExpandPseudoIntrinsicsPass::run(Module &M, ModuleAnalysisManager &) { | ||
if (M.empty()) | ||
return PreservedAnalyses::all(); | ||
|
||
const auto &ST = TM.getSubtarget<GCNSubtarget>(*M.begin()); | ||
|
||
// This is not a concrete target, we should not fold early. | ||
if (ST.getCPU().empty() || ST.getCPU() == "generic") | ||
return PreservedAnalyses::all(); | ||
|
||
if (auto WS = Intrinsic::getDeclarationIfExists( | ||
&M, Intrinsic::amdgcn_wavefrontsize)) | ||
return expandWaveSizeIntrinsic(ST, WS); | ||
|
||
return PreservedAnalyses::all(); | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1024,6 +1024,15 @@ GCNTTIImpl::instCombineIntrinsic(InstCombiner &IC, IntrinsicInst &II) const { | |
} | ||
break; | ||
} | ||
case Intrinsic::amdgcn_wavefrontsize: { | ||
// TODO: this is a workaround for the pseudo-generic target one gets with no | ||
// specified mcpu, which spoofs its wave size to 64; it should be removed. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A real solution would be two builds, but spoofing it as 64 works, (likely unintentinally) because we don't do any w64 specific changes yet and w64 can always be narrowed to w32 and not the other way around. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think that this interpretation is actually correct, if you rely on lockstep of a full wave and you optimise around wavesize this will break in bad ways on wave32. The current There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We already do some light 64->32 folds, that are only sort of correct. Technically we could make exec_hi an allocatable scratch register in wave32, but what we do now bakes in an assumption that exec_hi must always be 0. But yes, the only way to really avoid any possible edge cases (and support a future of machine linked libraries) requires just having totally separate builds |
||
if ((ST->getCPU().empty() || ST->getCPU() == "generic") && | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Less than ideal... I am not sure if there is a way to check that a fixed wavefront size is in the subtarget description and not added as an -mattr? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. None that I could find because we spoof the Wave64 in when it's not specified, so the only differentiator that I could think of is that the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not really have one. Maybe it is OK for now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is really gross. We also do have a "generic-hsa" target-cpu name There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the input. What is the suggested solution? |
||
!ST->getFeatureString().contains("+wavefrontsize")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The feature string may also contain a -wavefrontsize. It's probably safest to ignore the target-features. If we're really going to rely on this target-cpu hack for the library uses, rocm-device-libs is not using an explicit wavefrontsize feature anymore (all the uses were converted to the ballot wave64->wave32 hack) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You also do have the subtarget already. Should probably move the logic in there, instead of spreading the default CPU logic parsing into a new place There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this is a temporary hack, what's the point of putting it in the subtarget so that people get ideas, start using it, and then there's even more technical debt? The parsing might have to change anyway once we have a proper generic target (which the current hack is not). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To keep the hack isolated to one place, instead of spreading it around. You've already missed "generic-hsa" for example. The wavesize target parsing is also hacky, and we already have other hacky parsing in the subtarget constructor. We could also implement this by making the generic target actually have 0 wavesize, and replacing the isWave64 predicates with wavesize != 64 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That seems reasonable, there's also an argument that the backend likely can't do anything useful without There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, I don't think we should be doing live design of generic (which is part of what got us here anyway), so I'd rather not build even more technical debt around its current form which was meant to be a test only kludge: // The code produced for "generic" is only useful for tests and cannot
// reasonably be expected to execute on any particular target. Which is to say I don't want to what is there now, I want it to not break. I've adjusted the check to cover There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd be fine with multiple builds, but right now the AMDGCN infra doesn't support it very well since we'd need to port the ROCm Device Libs to use my build system. Beyond that it'd be pretty easy to just default the triple depending on mcpu and There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess we couldn't make a helper that is like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps we can simply live with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unsure if we even need to bother checking for 'generic' since that's not what any of the existing targets use for generic AFAIC. It's just not setting |
||
break; | ||
return IC.replaceInstUsesWith(II, ConstantInt::get(II.getType(), | ||
ST->getWavefrontSize())); | ||
rampitec marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
jhuber6 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
case Intrinsic::amdgcn_wqm_vote: { | ||
// wqm_vote is identity when the argument is constant. | ||
if (!isa<Constant>(II.getArgOperand(0))) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 | ||
AlexVlx marked this conversation as resolved.
Show resolved
Hide resolved
|
||
; RUN: llc -mtriple=amdgcn -mcpu=fiji -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s | ||
; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W32 %s | ||
; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 -verify-machineinstrs < %s | FileCheck -check-prefixes=GCN,W64 %s | ||
|
@@ -6,48 +7,78 @@ | |
|
||
; RUN: opt -O3 -S < %s | FileCheck -check-prefix=OPT %s | ||
; RUN: opt -mtriple=amdgcn-- -O3 -S < %s | FileCheck -check-prefix=OPT %s | ||
; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s | ||
; RUN: opt -mtriple=amdgcn-- -passes='default<O3>' -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s | ||
; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s | ||
jhuber6 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
; RUN: opt -mtriple=amdgcn-- -mcpu=tonga -O3 -S < %s | FileCheck -check-prefix=OPT %s | ||
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s | ||
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s | ||
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT %s | ||
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT %s | ||
; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s | ||
; RUN: opt -mtriple=amdgcn-- -passes='default<O3>' -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s | ||
; RUN: opt -mtriple=amdgcn-- -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s | ||
; RUN: opt -mtriple=amdgcn-- -mcpu=tonga -O3 -S < %s | FileCheck -check-prefix=OPT-W64 %s | ||
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This codegen test shouldn't be running all of these passes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It already was, mostly? It seems worthwhile to individualise the possible / plausible scenarios. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Simplified. |
||
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1010 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s | ||
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize32 -S < %s | FileCheck -check-prefix=OPT-W32 %s | ||
; RUN: opt -mtriple=amdgcn-- -mcpu=gfx1100 -O3 -mattr=+wavefrontsize64 -S < %s | FileCheck -check-prefix=OPT-W64 %s | ||
|
||
; GCN-LABEL: {{^}}fold_wavefrontsize: | ||
; OPT-LABEL: define amdgpu_kernel void @fold_wavefrontsize( | ||
|
||
; W32: v_mov_b32_e32 [[V:v[0-9]+]], 32 | ||
; W64: v_mov_b32_e32 [[V:v[0-9]+]], 64 | ||
; GCN: store_{{dword|b32}} v{{.+}}, [[V]] | ||
|
||
; OPT: %tmp = tail call i32 @llvm.amdgcn.wavefrontsize() | ||
; OPT: store i32 %tmp, ptr addrspace(1) %arg, align 4 | ||
; OPT-NEXT: ret void | ||
|
||
define amdgpu_kernel void @fold_wavefrontsize(ptr addrspace(1) nocapture %arg) { | ||
; OPT-LABEL: define amdgpu_kernel void @fold_wavefrontsize( | ||
; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] { | ||
; OPT-NEXT: [[BB:.*:]] | ||
; OPT-NEXT: [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2:[0-9]+]] | ||
; OPT-NEXT: store i32 [[TMP]], ptr addrspace(1) [[ARG]], align 4 | ||
; OPT-NEXT: ret void | ||
; | ||
; OPT-W64-LABEL: define amdgpu_kernel void @fold_wavefrontsize( | ||
; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] { | ||
; OPT-W64-NEXT: [[BB:.*:]] | ||
; OPT-W64-NEXT: store i32 64, ptr addrspace(1) [[ARG]], align 4 | ||
; OPT-W64-NEXT: ret void | ||
; | ||
; OPT-W32-LABEL: define amdgpu_kernel void @fold_wavefrontsize( | ||
; OPT-W32-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0:[0-9]+]] { | ||
; OPT-W32-NEXT: [[BB:.*:]] | ||
; OPT-W32-NEXT: store i32 32, ptr addrspace(1) [[ARG]], align 4 | ||
; OPT-W32-NEXT: ret void | ||
; | ||
bb: | ||
%tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0 | ||
store i32 %tmp, ptr addrspace(1) %arg, align 4 | ||
ret void | ||
} | ||
|
||
; GCN-LABEL: {{^}}fold_and_optimize_wavefrontsize: | ||
; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize( | ||
|
||
; W32: v_mov_b32_e32 [[V:v[0-9]+]], 1{{$}} | ||
; W64: v_mov_b32_e32 [[V:v[0-9]+]], 2{{$}} | ||
; GCN-NOT: cndmask | ||
; GCN: store_{{dword|b32}} v{{.+}}, [[V]] | ||
|
||
; OPT: %tmp = tail call i32 @llvm.amdgcn.wavefrontsize() | ||
; OPT: %tmp1 = icmp ugt i32 %tmp, 32 | ||
; OPT: %tmp2 = select i1 %tmp1, i32 2, i32 1 | ||
; OPT: store i32 %tmp2, ptr addrspace(1) %arg | ||
; OPT-NEXT: ret void | ||
|
||
define amdgpu_kernel void @fold_and_optimize_wavefrontsize(ptr addrspace(1) nocapture %arg) { | ||
; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize( | ||
; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] { | ||
; OPT-NEXT: [[BB:.*:]] | ||
; OPT-NEXT: [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2]] | ||
; OPT-NEXT: [[TMP1:%.*]] = icmp ugt i32 [[TMP]], 32 | ||
; OPT-NEXT: [[TMP2:%.*]] = select i1 [[TMP1]], i32 2, i32 1 | ||
; OPT-NEXT: store i32 [[TMP2]], ptr addrspace(1) [[ARG]], align 4 | ||
; OPT-NEXT: ret void | ||
; | ||
; OPT-W64-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize( | ||
; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] { | ||
; OPT-W64-NEXT: [[BB:.*:]] | ||
; OPT-W64-NEXT: store i32 2, ptr addrspace(1) [[ARG]], align 4 | ||
; OPT-W64-NEXT: ret void | ||
; | ||
; OPT-W32-LABEL: define amdgpu_kernel void @fold_and_optimize_wavefrontsize( | ||
; OPT-W32-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] { | ||
; OPT-W32-NEXT: [[BB:.*:]] | ||
; OPT-W32-NEXT: store i32 1, ptr addrspace(1) [[ARG]], align 4 | ||
; OPT-W32-NEXT: ret void | ||
; | ||
bb: | ||
%tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0 | ||
%tmp1 = icmp ugt i32 %tmp, 32 | ||
|
@@ -57,15 +88,31 @@ bb: | |
} | ||
|
||
; GCN-LABEL: {{^}}fold_and_optimize_if_wavefrontsize: | ||
; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize( | ||
|
||
; OPT: bb: | ||
; OPT: %tmp = tail call i32 @llvm.amdgcn.wavefrontsize() | ||
; OPT: %tmp1 = icmp ugt i32 %tmp, 32 | ||
; OPT: bb3: | ||
; OPT-NEXT: ret void | ||
|
||
define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize(ptr addrspace(1) nocapture %arg) { | ||
; OPT-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize( | ||
; OPT-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] { | ||
; OPT-NEXT: [[BB:.*:]] | ||
; OPT-NEXT: [[TMP:%.*]] = tail call i32 @llvm.amdgcn.wavefrontsize() #[[ATTR2]] | ||
; OPT-NEXT: [[TMP1:%.*]] = icmp ugt i32 [[TMP]], 32 | ||
; OPT-NEXT: br i1 [[TMP1]], label %[[BB2:.*]], label %[[BB3:.*]] | ||
; OPT: [[BB2]]: | ||
; OPT-NEXT: store i32 1, ptr addrspace(1) [[ARG]], align 4 | ||
; OPT-NEXT: br label %[[BB3]] | ||
; OPT: [[BB3]]: | ||
; OPT-NEXT: ret void | ||
; | ||
; OPT-W64-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize( | ||
; OPT-W64-SAME: ptr addrspace(1) nocapture writeonly [[ARG:%.*]]) local_unnamed_addr #[[ATTR0]] { | ||
; OPT-W64-NEXT: [[BB:.*:]] | ||
; OPT-W64-NEXT: store i32 1, ptr addrspace(1) [[ARG]], align 4 | ||
; OPT-W64-NEXT: ret void | ||
; | ||
; OPT-W32-LABEL: define amdgpu_kernel void @fold_and_optimize_if_wavefrontsize( | ||
; OPT-W32-SAME: ptr addrspace(1) nocapture readnone [[ARG:%.*]]) local_unnamed_addr #[[ATTR1:[0-9]+]] { | ||
; OPT-W32-NEXT: [[BB:.*:]] | ||
; OPT-W32-NEXT: ret void | ||
; | ||
bb: | ||
%tmp = tail call i32 @llvm.amdgcn.wavefrontsize() #0 | ||
%tmp1 = icmp ugt i32 %tmp, 32 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pass isn't needed now?