-
Notifications
You must be signed in to change notification settings - Fork 939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subgroup Operations #4190
Subgroup Operations #4190
Conversation
414f844
to
c9df3cd
Compare
I'm personally super excited for subgroup support, thank you so much for working on it! If you're interested, the other parts of subgroup support I would find valuable are:
|
To give you a bit of help in the correct direction: To determine if we should split up features, it depends on what devices in the wild already support. You can check the metal feature tables, the d3d documentation, and use this tool https://github.com/kainino0x/gpuinfo-vulkan-query to determine how the subgroup ops should be grouped. Additionally, definitely write some tests that check the functionality. It's the most sure fire way to make sure things are working cross platform. |
Oh and ofc, this is awesome! Thank you for working on it! |
First of all, thank you for working on this. It is a highly anticipated feature for me. About the feature fragmentation, looking at what the other APIs did:
So Vulkan has the most feature fragmentation and DirectX the least. I would personally prefer only a single feature to cover all operations, because these are all related (by the concept of subgroups) and splitting them does not make it easier for developers. The trade-off being that some devices which could be partially supported won't be supported at all, but it seems such devices are rather rare. |
@exrook I continued your work on this branch: https://github.com/Lichtso/wgpu/tree/subgroup_operations |
b5a67eb
to
6dbd21e
Compare
6dbd21e
to
2cd83cc
Compare
59c44dc
to
f67005e
Compare
I've decided to provide one feature bit per shader stage to require support for subgroup operations. i.e. we'll have:
Each feature bit requires support of the following operations in vulkan terms: annotated gpuinfo vulkan query resultsOperations basic, vote, arithmetic, ballot, shuffle, shuffle_relativeIt seems the below that the underlying hardware all supports these operations, these reports are just from older driver versions.
Compute stage supportAfter removing devices lacking support for our selected operations. If the device supports any subgroup operations at all they must be supported in compute stage according to vulkan standard anyways.
Fragment stage support
Vertex stage supportThese also seem to only lack support due to old drivers. To be clear, this is after devices have been removed by the previous requirements.
Support on VulkanSupport for these subgroup operations in compute shaders seems pretty universal among devices that already meet WebGPU minimum requirements. Fragment and vertex subgroup operations also seem widely supported, if you have the latest drivers. Support on older Apple GPUsThe Apple A13 GPU is interesting in that it supports shuffle operations but not reduction operations. In theory one could provide an implementation of reductions in terms of shuffle (permutes in MSL) operations, like how mesa does for older AMD gpus, though I'll leave that exercise to someone else 😆 . On MoltenVK A12 and A11 also advertise subgroup operations besides reductions, but report a subgroup size of 4 so really these are just quad operations MoltenVK is pretending are subgroup operations. These gpus do not support any real subgroup operations, so we couldn't do the emulation above either. A separate group of flags for quad operations may be useful for users wanting to implement algorithms targeting these older devices. Vulkan devices supporting the above operations but not quad operations
Vulkan devices supporting the above operations but not clustered operations
|
f67005e
to
af5415e
Compare
Interesting that the mac CI is now failing with a shader compiler error on MoltenVK after this reorganization:
The MoltenVK test was previously being skipped |
08227ef
to
21549e6
Compare
Forcing MoltenVK debug on when running the test gets it to print out the translated MSL, which seems to compile fine when pasted into shader-playground, so I've got no idea what's going wrong here :) |
…sl-out TODO: metal out, figure out what needs to be done in validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small changes from wgpu side
Please re-request a review from me once the changes are addressed to make sure I see it!
The control flow tests immediately pay off with a failure on Metal :) Minimized example: WGSL source@group(0)
@binding(0)
var<storage, read_write> storage_buffer: array<u32>;
@compute
@workgroup_size(128)
fn main(
@builtin(global_invocation_id) global_id: vec3<u32>,
@builtin(subgroup_size) subgroup_size: u32,
@builtin(subgroup_invocation_id) subgroup_invocation_id: u32,
) {
var value = 0u;
if subgroup_invocation_id % 2u == 0u {
value = subgroupAdd(1u);
} else {
value = subgroup_size / 2u;
}
storage_buffer[global_id.x] = value;
} naga MSL output// language: metal1.0
#include <metal_stdlib>
#include <simd/simd.h>
using metal::uint;
struct _mslBufferSizes {
uint size0;
};
typedef uint type_1[1];
struct main_Input {
};
kernel void main_(
metal::uint3 global_id [[thread_position_in_grid]]
, uint subgroup_size [[threads_per_simdgroup]]
, uint subgroup_invocation_id [[thread_index_in_simdgroup]]
, device type_1& storage_buffer [[user(fake0)]]
, constant _mslBufferSizes& _buffer_sizes [[user(fake0)]]
) {
uint value = 0u;
if ((subgroup_invocation_id % 2u) == 0u) {
uint unnamed = metal::simd_sum(1u);
value = unnamed;
} else {
value = subgroup_size / 2u;
}
uint _e16 = value;
storage_buffer[global_id.x] = _e16;
return;
} This program results in a storage buffer with contents Interestingly, moving the store inside the if subgroup_invocation_id % 2u == 0u {
storage_buffer[global_id.x] = subgroupAdd(1u);
} else {
storage_buffer[global_id.x] = subgroup_size/2u;
} Additionally, adding a dummy store to one of the branches also forces the correct behavior: if subgroup_invocation_id % 2u == 0u {
storage_buffer[global_id.x] = 0u;
value = subgroupAdd(1u);
} else {
value = subgroup_size/2u;
}
storage_buffer[global_id.x] = value; Substituting Given that the metal shaders we produce seem to be correct as far as I can tell, this looks like a bug in metal where the compiler is incorrectly reordering subgroup operations with respect to our desired control flow, producing resultant control flow similar to below: var value = 0u;
let sum = subgroupAdd(1u);
let size_half = subgroup_size / 2u;
if subgroup_invocation_id % 2u == 0u {
value = sum;
} else {
value = size_half;
}
storage_buffer[global_id.x] = value; I'm not really an apple user, but maybe someone else knows how we could report a bug to metal about this behavior? |
Interestingly, inserting a |
Putting in (any) global memory operation on a conditional branch probably introduces a memory barrier under the hood. And yes, figuring out the precise semantics of control-flow and memory barriers will be tough, especially because it seems to me that the developers of the underlying APIs did not think them through either. I already hinted at the barrier problem in the official spec PR because that proposal is currently missing them. |
Not much we can do about it if metal is miscompiling even the most basic of shader. This sounds like it should be an expected failure with a corresponding issue. Preferably the infra doing the testing will assume that metal will be wrong, based on the expected miscompilation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we still have some CI failures
@exrook are you planning to finish this PR? I would love to be able to use subgroups. |
@JMS55 here's an update of the status of this PR as I see it The CI failures are due to bugs in the backends implementation of subgroups (lavapipe & metal) that are unrelated to our mapping from wgsl. I've done some investigation into the underlying causes but wasn't able to find anything. I'll disable the triggering tests for now, I don't believe they should block this getting merged. That being said I believe the Naga side of this PR has still not been reviewed, so that's still a blocker for merging. We could also consider updating the function names to match the draft webGPU subgroups proposal, though personally I prefer the names already used in this PR for a few of the operations that differ. This PR also does not currently implement the quad operations specified in the proposal, I don't really have a desire to use these but if someone would like to use them I'd be able to add them. |
What's the issue on lavapipe? I would expect lavapipe to be conformant.
All test expectations should be as narrowly scoped as possible, so we assert that we will get this exact result on lavapipe and on metal. |
We are missing clustered operations, and if I am not mistaken quad operations (as they can be used outside of fragment shaders) are just a special case of 4-way-clustered operations. IIRC, metal has no general clustered operations (other than quads), so I am ok with leaving them out for now. The differences to the webGPU draft are:
|
Thanks for the summary, I will try to review this next week. |
Glad to see movement on this PR! |
As a followup to this PR, it would also be great to get a cooperative matrix multiplication extension, for neural net applications like burn. |
https://github.com/Lichtso/wgpu/tree/subgroup_feature @teoxoy I merged trunk into it as requested and fixed the tests. So it is ready for review now. |
@Lichtso could you submit a new PR against that branch, and we'll go from there? |
Checklist
cargo clippy
.cargo clippy --target wasm32-unknown-unknown
if applicable.Connections
merged with gfx-rs/naga#2523
Closes: #4428
Description
Allows wgsl shaders to perform subgroup operations
Adds wgpu features:
SUBGROUP_COMPUTE
SUBGROUP_FRAGMENT
SUBGROUP_VERTEX
Each feature enables
BASIC
,VOTE
,ARITHMETIC
,SHUFFLE
, andSHUFFLE_RELATIVE
operations to be used in their respective stage. See #4190 (comment) for a short discussion of the portability of this feature set.Adds new naga capability
SUBGROUP
, required for a shader to use subgroup builtin functions or parameters.Adds new naga validator settings
subgroup_operations
andsubgroup_stages
determining which sets of the below subgroup operations are valid, and which stages they are valid in.BASIC
operations:VOTE
operations:ARITHMETIC
operations:BALLOT
operations:SHUFFLE
operations:SHUFFLE_RELATIVE
operationsNew builtins:
Testing
Naga snapshot tests for wgsl, spv input
wgpu gpu test exercising the feature
Thanks @Lichtso for many contributions to this work!