Compare microkernel benchmark IR against main for PRs #2974

alexbaden · 2024-12-10T01:20:21Z

It would be useful to be able to detect whether or not the generated IR (Triton IR or TTGIR) changes for a given pull request (code change). For example, if the IR changes for a particular microkernel then we should have high suspicion for performance changes for that microkernel. Performance measurements are often noisy, whereas IR changes are essentially binary. This would also be useful for when we sync openai commits or make changes to the pass pipeline and want to know all the places that could be affected.

The proposal here is to implement a CI job that runs on every PR and compiles the Triton IR for the microkernel benchmark to TTIR/TTGIR using triton.compile. To keep things simple and keep runtime down, we will use the nightly wheels for our "golden reference". A suggested flow is:

Create script which runs the microkernel benchmark IR. Perhaps we can import this from the microkernel benchmarks directly?
Get latest Triton nightly wheels and run the microkernel benchmark IR script to generate TTIR/TTGIR for each microkernel. Note that a given microkernel benchmark may actually generate multiple TTIR/TTGIR. We can start with 1 though and improve as needed.
Run the same script as part of the build_and_test pipeline on the rolling driver.
Diff the IRs.

Note that many PRs will be expected to change the IR, so we do not want the job to fail if the IR differs (otherwise we will have a lot of red PRs that are actually fine, and merging might be blocked). But, it would be nice to have some way of notifying the user when the IR differs instead of having to remember to check the output of a given github actions run.

The text was updated successfully, but these errors were encountered:

whitneywhtsang · 2024-12-11T05:22:16Z

The motivation of this issue is good, we want to have a way to quickly identify if performance impact is variance or not.

Note that a given microkernel benchmark may actually generate multiple TTIR/TTGIR. We can start with 1 though and improve as needed.

What's your expected behavior when there are more than 1 TTIR/TTGIR?

In general, I worry that this approach may be too fragile, and end up reporting differences in most cases, and then developers start to ignore the report. MLIR changes all the time, if we use Triton nightly wheels output as baseline, there can be many PRs merged between that and the PR, or the PR could be based on a commit before the Triton nightly wheels.

alexbaden · 2024-12-11T14:59:37Z

What's your expected behavior when there are more than 1 TTIR/TTGIR?

Eventually I would like to be able to diff as much IR as possible, but to start I think grabbing one representative set of IRs from each microbenchmark would be ok. I am not sure how to best grab the IR without running the benchmark, figuring that out is part of this issue.

In general, I worry that this approach may be too fragile, and end up reporting differences in most cases, and then developers start to ignore the report.

I expect the reports to be somewhat noisy, but the noise should be fairly easy to scan through if we can highlight the diff appropriately. I think knowing how the IR is changing as we merge from upstream / add features to main is very useful. Do you have another approach that would be less noisy that we could try?

MLIR changes all the time, if we use Triton nightly wheels output as baseline, there can be many PRs merged between that and the PR, or the PR could be based on a commit before the Triton nightly wheels.

Good point, I suggested Triton nightly wheels because they are already built. But, we might want something more recent or more relevant to a given PR. Maybe we can try the Triton nightly wheels to start and consider alternatives (like keeping the builds from every commit in main somewhere or building the last commit from main for a given PR). We could also make this an ad-hoc job and let the user supply both the PR and the commit to compare, but I'd like it to run automatically so we have a record of IR changes to refer to.

whitneywhtsang · 2024-12-11T21:22:27Z

Eventually I would like to be able to diff as much IR as possible, but to start I think grabbing one representative set of IRs from each microbenchmark would be ok. I am not sure how to best grab the IR without running the benchmark, figuring that out is part of this issue.

Just to clarify, do you mean diffing the final MLIR of one input shape?

Do you have another approach that would be less noisy that we could try?

If we reconsider the motivation, which is to have a way to quickly identify if performance impact is variance or not. Another way to resolve that could be CI automatically determine if the performance results are within standard deviation, and report only gains and degradations. Or CI automatically rerun benchmarks with potential performance impact.

alexbaden · 2024-12-11T22:58:45Z

The motivation is to be able to determine if a given PR is going to change the TTIR/TTGIR (and possibly even the MLIR). Detecting performance regressions, etc is just one benefit from knowing if a change is mutating the IR.

Yes, it seems like most microbenchmarks do parameter sweeps across input shapes but I imagine there are other possible parametrizations.

sommerlukas · 2024-12-16T16:04:56Z

Offline, I mentioned some scripts from the LLVM infrastructure that can help to abstract over SSA value names.

For MLIR, there is this script: https://github.com/llvm/llvm-project/blob/main/mlir/utils/generate-test-checks.py
For LLVM IR, there is: https://github.com/llvm/llvm-project/blob/main/llvm/utils/update_test_checks.py

Even if we don't use the scripts themselves, we can maybe learn something from how they detect and abstract SSA value names.

whitneywhtsang · 2024-12-16T16:42:39Z

Before starting the effort, we should clearly define the motivation of this work, as it can help deciding if we prefers false positive or false negative. For example, if we want to ensure no performance regression, then we cannot accept false negative.

When comparing IR, best to leverage existing tools like lit, if possible.

Not sure if the transformation below can be useful here, but FYI,
https://llvm.org/docs/Passes.html#instnamer-assign-names-to-anonymous-instructions

This is a little utility pass that gives instructions names, this is mostly useful when diffing the effect of an optimization because deleting an unnamed instruction can change all other instruction numbering, making the diff very noisy.

alexbaden assigned arunjose696 Dec 10, 2024

vlad-penkin added this to the 0.3 [Triton] Language and Runtime milestone Dec 10, 2024

vlad-penkin added ci enhancement New feature or request upstream: rebase PR to be up-streamed codegen: mlir labels Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare microkernel benchmark IR against main for PRs #2974

Compare microkernel benchmark IR against main for PRs #2974

alexbaden commented Dec 10, 2024

whitneywhtsang commented Dec 11, 2024

alexbaden commented Dec 11, 2024

whitneywhtsang commented Dec 11, 2024

alexbaden commented Dec 11, 2024

sommerlukas commented Dec 16, 2024

whitneywhtsang commented Dec 16, 2024

Compare microkernel benchmark IR against main for PRs #2974

Compare microkernel benchmark IR against main for PRs #2974

Comments

alexbaden commented Dec 10, 2024

whitneywhtsang commented Dec 11, 2024

alexbaden commented Dec 11, 2024

whitneywhtsang commented Dec 11, 2024

alexbaden commented Dec 11, 2024

sommerlukas commented Dec 16, 2024

whitneywhtsang commented Dec 16, 2024