Make acc
matrix allocation on each call for XeTLA GEMM benchmarks
#3026
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If we take for comparison: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12382880184/job/34564504020 (main) vs https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12390456716/job/34585505155 (PR), then the degradation from this pull request for XeTLA is ~3%.
However, this is also a potential opportunity to improve the Triton kernel by only allocating the accumulation matrix once. If this is implemented for Triton, this pull request will need to be rolled back for XeTLA.
CI runs:
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12379634873https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12381457034Wall time is used instead of elapsed_time
(apparently chose the wrong runner)