-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix race in async copy #3438
base: main
Are you sure you want to change the base?
fix race in async copy #3438
Conversation
!test |
!test --diff |
All code diffs are due to the newly added |
all H100 failures are due to CI script/hardware issues, pending rerun of them. |
Can you please show what the typical code pattern looks like before and after this? |
The code change is simple, just adding a Typical code is:
|
Ref: |
why did we insert the wait instruction only for T2 but not for T3? |
The orginal wait is for both T2 & T3, becuase it waits for the complete of all the previously issued load instructions. Originally, we insert wait before the first read of T3, it waits for both T2 & T3.
After this PR, we insert wait before
Here is the full kernel for the newly added test
|
For the newly added case which doesn't require thread predicate
|
when total_reduction_numel <= 1024, scheduler may use multiple reductions per block with bdimy > 1, this leads to race condition in shared memory when using async copy. Adding `cp.async.wait_all`after the 1st async copy can avoid the race, but needs to figure out the root cause before we can safely use it. So, here we set bdimy = 1 as a WAR. Should be reverted after #3438 is merged. race detected with: ``` NVFUSER_DUMP=scheduler_params,cuda_to_file NVFUSER_ENABLE=kernel_debug PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool racecheck --racecheck-detect-level info ./nvfuser_tests --gtest_filter='CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_96' ```
when total_reduction_numel <= 1024, scheduler may use multiple reductions per block with bdimy > 1, this leads to race condition in shared memory when using async copy. Adding `cp.async.wait_all`after the 1st async copy can avoid the race, but needs to figure out the root cause before we can safely use it. So, here we set bdimy = 1 as a WAR. Should be reverted after #3438 is merged. race detected with: ``` NVFUSER_DUMP=scheduler_params,cuda_to_file NVFUSER_ENABLE=kernel_debug PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool racecheck --racecheck-detect-level info ./nvfuser_tests --gtest_filter='CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_96' ```
Why thread predicate is related here? I have a feeling that this PR "fixes" the issue only because for this specific example, the problematic schedule happen to have thread predicate. For example, assume that there is a fusion where |
More precisely,
generates a kernel
|
when total_reduction_numel <= 1024, scheduler may use multiple reductions per block with bdimy > 1, this leads to race condition in shared memory when using async copy. Adding `cp.async.wait_all`after the 1st async copy can avoid the race, but needs to figure out the root cause before we can safely use it. So, here we set bdimy = 1 as a WAR. Should be reverted after #3438 is merged. race detected with: ``` NVFUSER_DUMP=scheduler_params,cuda_to_file NVFUSER_ENABLE=kernel_debug PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool racecheck --racecheck-detect-level info ./nvfuser_tests --gtest_filter='CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_96' ```
Fix #3428
What's in this PR?
Revise
ReadAfterWriteSyncs
to:cpasync_wait_before_
only handles cpasync doesn't need thread syncs.sync_before_
handles regular and cpasync requires thread syncs.Why?
Before this fix,
cpasync_wait_before_
also handles cpasync with thread syncs, it may lead to a case wherecp.async.wait_all
is inserted after__syncthreads()
which leads to race condition as seen in #3428