Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotating a function with #[inline] leads to 10x slowdown in criterion #649

Closed
CeleritasCelery opened this issue Feb 1, 2023 · 3 comments

Comments

@CeleritasCelery
Copy link

CeleritasCelery commented Feb 1, 2023

I am encountering an odd issue where adding the #[inline] annotation to a public function leads to it performing 10x worse in the criterion benchmark. This function is a SIMD implementation of character counting. Interestingly I only see the major slow down when benchmarking the SIMD version. If I switch it out for the scalar version, the inline annotation has no effect. I am also not able to reproduce this issue on x86_64, only with Aarch64 (Apple Silicon).

Here is the repo/branch used to reproduce the issue.

https://github.com/CeleritasCelery/str_indices/tree/benchmark_issue

Clone that branch and run cargo criterion

My first question is, how could I generate the assembly for the benchmarks? I want to look at the two version to see what is actually going on. My best guess is that this is a codegen issue.

Do you have any ideas as to why I might be seeing this? I couldn't find any other instances of this in the issue tracker.

I am using the Rust 1.67 and criterion 0.4.0

cessen/str_indices#10

@workingjubilee
Copy link

Normally rustc supports --emit=asm for this case but it's hard to support that for cargo commands so most simply do not. So your choices are finding a way to avoid criterion yet generate the same result, or finding the object file that criterion creates and using objdump on it.

@saethlin
Copy link

saethlin commented Feb 3, 2023

I'm on x86_64. I cannot reproduce a 10x difference, but I can reproduce a smaller difference. I'll post the output of criterion below, with the unimportant lines deleted. The inconsistent indentation is from criterion.

By default I see this:

chars::count/en_10000   time:   [201.76 ns 202.10 ns 202.48 ns]
chars::count_inline/en_10000
                        time:   [311.65 ns 311.99 ns 312.36 ns]

Adding codegen-units = 1 to [profile.release] I get this:

chars::count/en_10000   time:   [199.83 ns 199.99 ns 200.16 ns]
chars::count_inline/en_10000
                        time:   [192.16 ns 192.28 ns 192.41 ns]

Adding codegen-units = 1 and lto = "fat" I get this:

chars::count/en_10000   time:   [309.54 ns 309.66 ns 309.82 ns]
chars::count_inline/en_10000
                        time:   [198.52 ns 198.68 ns 198.83 ns]

The sequence of instructions in the tiny loop that's the target of the benchmark is exactly the same in all cases. I think this benchmark is highly sensitive to some aspect that LLVM doesn't or can't control. The alignment of the loop seems likely.

This is a common problem in microbenchmarking, and I am not aware of any good solutions to it. Emery Berger did a project called Stabilizer which doesn't work anymore (it's tightly coupled to LLVM internals and an academic doesn't have the time to keep up with LLVM versions), but the explanation of the problem is pretty good: https://youtu.be/r-TLSBdHe1A There is a GitHub repo and a paper, they're pretty easy to find if you want to learn more.

@CeleritasCelery
Copy link
Author

closed in favor of rust-lang/rust#107617

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants