Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare microkernel benchmark IR against main for PRs #2974

Open
alexbaden opened this issue Dec 10, 2024 · 6 comments
Open

Compare microkernel benchmark IR against main for PRs #2974

alexbaden opened this issue Dec 10, 2024 · 6 comments
Assignees
Labels
ci codegen: mlir enhancement New feature or request upstream: rebase PR to be up-streamed

Comments

@alexbaden
Copy link
Contributor

It would be useful to be able to detect whether or not the generated IR (Triton IR or TTGIR) changes for a given pull request (code change). For example, if the IR changes for a particular microkernel then we should have high suspicion for performance changes for that microkernel. Performance measurements are often noisy, whereas IR changes are essentially binary. This would also be useful for when we sync openai commits or make changes to the pass pipeline and want to know all the places that could be affected.

The proposal here is to implement a CI job that runs on every PR and compiles the Triton IR for the microkernel benchmark to TTIR/TTGIR using triton.compile. To keep things simple and keep runtime down, we will use the nightly wheels for our "golden reference". A suggested flow is:

  1. Create script which runs the microkernel benchmark IR. Perhaps we can import this from the microkernel benchmarks directly?
  2. Get latest Triton nightly wheels and run the microkernel benchmark IR script to generate TTIR/TTGIR for each microkernel. Note that a given microkernel benchmark may actually generate multiple TTIR/TTGIR. We can start with 1 though and improve as needed.
  3. Run the same script as part of the build_and_test pipeline on the rolling driver.
  4. Diff the IRs.

Note that many PRs will be expected to change the IR, so we do not want the job to fail if the IR differs (otherwise we will have a lot of red PRs that are actually fine, and merging might be blocked). But, it would be nice to have some way of notifying the user when the IR differs instead of having to remember to check the output of a given github actions run.

@whitneywhtsang
Copy link
Contributor

The motivation of this issue is good, we want to have a way to quickly identify if performance impact is variance or not.

Note that a given microkernel benchmark may actually generate multiple TTIR/TTGIR. We can start with 1 though and improve as needed.

What's your expected behavior when there are more than 1 TTIR/TTGIR?

In general, I worry that this approach may be too fragile, and end up reporting differences in most cases, and then developers start to ignore the report. MLIR changes all the time, if we use Triton nightly wheels output as baseline, there can be many PRs merged between that and the PR, or the PR could be based on a commit before the Triton nightly wheels.

@alexbaden
Copy link
Contributor Author

What's your expected behavior when there are more than 1 TTIR/TTGIR?

Eventually I would like to be able to diff as much IR as possible, but to start I think grabbing one representative set of IRs from each microbenchmark would be ok. I am not sure how to best grab the IR without running the benchmark, figuring that out is part of this issue.

In general, I worry that this approach may be too fragile, and end up reporting differences in most cases, and then developers start to ignore the report.

I expect the reports to be somewhat noisy, but the noise should be fairly easy to scan through if we can highlight the diff appropriately. I think knowing how the IR is changing as we merge from upstream / add features to main is very useful. Do you have another approach that would be less noisy that we could try?

MLIR changes all the time, if we use Triton nightly wheels output as baseline, there can be many PRs merged between that and the PR, or the PR could be based on a commit before the Triton nightly wheels.

Good point, I suggested Triton nightly wheels because they are already built. But, we might want something more recent or more relevant to a given PR. Maybe we can try the Triton nightly wheels to start and consider alternatives (like keeping the builds from every commit in main somewhere or building the last commit from main for a given PR). We could also make this an ad-hoc job and let the user supply both the PR and the commit to compare, but I'd like it to run automatically so we have a record of IR changes to refer to.

@whitneywhtsang
Copy link
Contributor

Eventually I would like to be able to diff as much IR as possible, but to start I think grabbing one representative set of IRs from each microbenchmark would be ok. I am not sure how to best grab the IR without running the benchmark, figuring that out is part of this issue.

Just to clarify, do you mean diffing the final MLIR of one input shape?

Do you have another approach that would be less noisy that we could try?

If we reconsider the motivation, which is to have a way to quickly identify if performance impact is variance or not. Another way to resolve that could be CI automatically determine if the performance results are within standard deviation, and report only gains and degradations. Or CI automatically rerun benchmarks with potential performance impact.

@alexbaden
Copy link
Contributor Author

The motivation is to be able to determine if a given PR is going to change the TTIR/TTGIR (and possibly even the MLIR). Detecting performance regressions, etc is just one benefit from knowing if a change is mutating the IR.

Yes, it seems like most microbenchmarks do parameter sweeps across input shapes but I imagine there are other possible parametrizations.

@sommerlukas
Copy link
Contributor

Offline, I mentioned some scripts from the LLVM infrastructure that can help to abstract over SSA value names.

Even if we don't use the scripts themselves, we can maybe learn something from how they detect and abstract SSA value names.

@whitneywhtsang
Copy link
Contributor

Before starting the effort, we should clearly define the motivation of this work, as it can help deciding if we prefers false positive or false negative. For example, if we want to ensure no performance regression, then we cannot accept false negative.

When comparing IR, best to leverage existing tools like lit, if possible.

Not sure if the transformation below can be useful here, but FYI,
https://llvm.org/docs/Passes.html#instnamer-assign-names-to-anonymous-instructions

This is a little utility pass that gives instructions names, this is mostly useful when diffing the effect of an optimization because deleting an unnamed instruction can change all other instruction numbering, making the diff very noisy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci codegen: mlir enhancement New feature or request upstream: rebase PR to be up-streamed
Projects
None yet
Development

No branches or pull requests

5 participants