Evaluate Profile-Guided Optimization (PGO) and LLVM BOLT #4840
Replies: 16 comments
-
Thank you for the great work around PGO and the instructions. I verified that this project builds with PGO by following the instructions from One quick question: Is the test data good enough if I plan to run my current test suite (which has lots of conformance tests), and then on some real projects. |
Beta Was this translation helpful? Give feedback.
-
That's a good question. Usually, the test data (mostly unit-tests, I guess) is not a good candidate for collecting PGO profiles since the tests try to cover all cases (including rare corner cases), but for PGO you are interested in optimizing a "happy" path of the program. From my experience, the good candidates for collecting PGO profiles are the following workloads:
|
Beta Was this translation helpful? Give feedback.
-
Close as not planned for now. I'm satisfied with the current performance of oxc, and I really don't know how to set this thing up easily 😞 |
Beta Was this translation helpful? Give feedback.
-
@Boshen I just played with PGO locally, and got about 10% perf improvement across all 6 repos I tested. Both PGO and perf comparison was done using these 6 repositories:
And here is what I added to ecosystem_dir := "C:/source/ecosystem"
oxlint_bin := "C:/source/rust/oxc/target/release/oxlint.exe"
threads := "12"
pgo_data_dir := "C:/source/rust/oxc/pgo-data"
llvm_profdata_bin := "~/.rustup/toolchains/1.78.0-x86_64-pc-windows-msvc/lib/rustlib/x86_64-pc-windows-msvc/bin/llvm-profdata.exe"
build-pgo:
just build-pgo-init
just oxlint_bin=C:/source/rust/oxc/target/x86_64-pc-windows-msvc/release/oxlint.exe ecosystem
{{llvm_profdata_bin}} merge -o {{pgo_data_dir}}/merged.profdata {{pgo_data_dir}}
just build-pgo-final
build-pgo-init $RUSTFLAGS="-Cprofile-generate=C:/source/rust/oxc/pgo-data":
cargo build --release -p oxc_cli --bin oxlint --features allocator --target x86_64-pc-windows-msvc
build-pgo-final $RUSTFLAGS="-Cprofile-use=C:/source/rust/oxc/pgo-data/merged.profdata -Cllvm-args=-pgo-warn-missing-function":
cargo build --release -p oxc_cli --bin oxlint --features allocator --target x86_64-pc-windows-msvc
ecosystem:
-cd "{{ecosystem_dir}}/DefinitelyTyped" && {{oxlint_bin}} --threads={{threads}} --quiet -D all
cd "{{ecosystem_dir}}/affine" && {{oxlint_bin}} --threads={{threads}} --deny-warnings -c oxlint.json --import-plugin -D correctness -D perf
cd "{{ecosystem_dir}}/napi-rs" && {{oxlint_bin}} --threads={{threads}} --deny-warnings --ignore-path=.oxlintignore --import-plugin -D correctness -A no-export
cd "{{ecosystem_dir}}/preact" && {{oxlint_bin}} --threads={{threads}} --deny-warnings -c oxlint.json oxlint src test debug compat hooks test-utils
cd "{{ecosystem_dir}}/rolldown" && {{oxlint_bin}} --threads={{threads}} --deny-warnings --ignore-path=.oxlintignore --import-plugin
cd "{{ecosystem_dir}}/vscode" && {{oxlint_bin}} --threads={{threads}} --quiet -D all Paths are specific to my environment, but should give you a good idea how to do it. You need to have all repos cloned into the rustup component add llvm-tools-preview Rust toolchain and target is also hard-coded to Windows in my example, so would need to be updated. Just a heads up, running |
Beta Was this translation helpful? Give feedback.
-
Given @valeneiko's impressive speed-up findings, I think this is worth considering again. It may not be viable to integrate with our CI setup, or the compile time increase may be a blocker, but re-opening this issue so we can at least consider it. @valeneiko Can I ask a favour? Would you be able to run the same kind of test on the parser, and see what (if any) speed-up it gets? |
Beta Was this translation helpful? Give feedback.
-
@overlookmotel if you can share the command to run the parser. I can do it tomorrow. |
Beta Was this translation helpful? Give feedback.
-
This is really amazing work! I can see some good potential if people are building a service for really high intense work. As for oxlint, I'm unsure about adding this costly final release build step for the 10% performance improvement. |
Beta Was this translation helpful? Give feedback.
-
Agree that this build step is too much for regular CI. |
Beta Was this translation helpful? Give feedback.
-
By the way, only reason I've not come back on your request for the command to run the parser is that there isn't one! The parser is only exposed as a Rust crate (and an NPM package, but we shouldn't use that as it's slow due to cost of serializing the AST to pass it from Rust to JS). So I'll need to build one for you! If you don't want to wait for me, you know Rust, and are willing, you could probably knock one up yourself pretty quickly. But please feel free to say "no, I don't have time for that". Very much appreciate you testing this out and putting it on our radar, and am very willing to do what I can to assist you in testing it further. Just am tied up right now so will take me a few days to get to it.
I tend to agree with you on this. |
Beta Was this translation helpful? Give feedback.
-
@overlookmotel the results are below. Between 0% and 20% faster. First number is wall time. Second one is cummulative time in just the
You can find the source here: |
Beta Was this translation helpful? Give feedback.
-
@valeneiko Amazing! Thanks loads for doing this. I suspect the ones which show 0% improvement just don't run long enough for the speed-up to show with a millisecond measurement granularity. |
Beta Was this translation helpful? Give feedback.
-
I can suggest you run such benchmarks with hyperfine - it will allow you to get the results with the required granularity. |
Beta Was this translation helpful? Give feedback.
-
The reason I was interested in the parser is that it is absolutely stuffed full of branching, so there's a lot of room there for incorrect branch prediction to incur costs. I am guessing that a lot of the 10% speed boost that PGO gives the linter comes from PGO reducing branch mis-prediction in the parser (or re-ordering branches so that the commonly taken path is the default). The results above seem to at least partially confirm that hypothesis. The tricky thing is that the parser is provided as a library, not a binary. So, if I've understood correctly, for external consumers it'd be on them to implement PGO - it's not something we can do this end in a library. Have I understood that right? What we could do in the parser is figure out what changes PGO is making to the parser's codegen, and try to replicate the largest gains by manually guiding the non-PGO compiler to do the same thing with Is there any way to get a picture of what PGO is doing to the parser, in a format which is feasible to interpret? 2nd question: Is there any chance we're overfitting the data, if the files we're "training" PGO on are the same files that we're then measuring the gain of using PGO on? |
Beta Was this translation helpful? Give feedback.
-
If we are publishing a pre-built library, we can still PGO optimize it. We just need something to dynamically link to it (instead of the usual static linking). But yes, if people are building the lib from source, they would need to do PGO optimisation on their side. |
Beta Was this translation helpful? Give feedback.
-
It's possible to extract some statistics about the most frequently-executed functions with However, if you want to get more insights about the performed optimizations, you need to use a disassembler and take a look at the generated assembly. Then you can try to figure out the difference between PGOed and non-PGOed versions. This way will take more time to implement, I suppose. |
Beta Was this translation helpful? Give feedback.
-
@overlookmotel I just discovered When compiling with PGO we can also tell LLVM to print out the stats for branch probalities by adding these flags to
You can find a list of options to pass to There is also an option that prints basic block frequency: To discover these flags I have used: rustc -Cllvm-args="--help-list-hidden" This whole idea was inspired by this lecture: |
Beta Was this translation helpful? Give feedback.
-
Hi!
Recently I did many Profile-Guided Optimization (PGO) benchmarks on multiple projects (including many compilers and compiler-like workloads like static analyzers, code formatters, etc.) - the results are available here. Since
oxc
is a performance-oriented project, I think PGO can help here too.We need to evaluate PGO applicability to
oxc
tooling. And if it helps to achieve better performance - add a note to the documentation about that. In this case, users and maintainers will be aware of another optimization opportunity foroxc
. Also, PGO integration into the build scripts can help users and maintainers easily apply PGO for their own workloads. Even distributed by Oxc binaries can be pre-optimized with PGO on a generic-enough sample workload (e.g.rustc
already does it).After PGO, I can suggest evaluating LLVM BOLT as an additional optimization step after PGO.
For the Rust projects, I recommend starting with cargo-pgo - it makes easier PGO optimization in many cases.
Beta Was this translation helpful? Give feedback.
All reactions