Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
use bdimy = 1 to WAR smem race (#3423)
when total_reduction_numel <= 1024, scheduler may use multiple reductions per block with bdimy > 1, this leads to race condition in shared memory when using async copy. Adding `cp.async.wait_all`after the 1st async copy can avoid the race, but needs to figure out the root cause before we can safely use it. So, here we set bdimy = 1 as a WAR. Should be reverted after #3438 is merged. race detected with: ``` NVFUSER_DUMP=scheduler_params,cuda_to_file NVFUSER_ENABLE=kernel_debug PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool racecheck --racecheck-detect-level info ./nvfuser_tests --gtest_filter='CombinedSchedulerTest.LayerNormBackward/dtype_double_batch_216_hidden_96' ```
- Loading branch information