-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RangeInclusive iteration performance improvement. #57378
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @rkruppe (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
The failure seems legitimate, LLVM apparently fails to constant-fold the loop over a I've reproduced the issue on the playground (https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=c6ad080bf6386dab551c2bed1ad6dbfb); so I'll have to fiddle with this to understand what is blocking LLVM. |
This is somewhat surprising, given that fn foo3c(n: u64) -> u64 {
let mut count = 0;
(0..n).for_each(|_| {
(0..n).chain(::std::iter::once(n)).rev().for_each(|j| {
count += j;
})
});
count
} constant-folds just fine 🤔 |
Alright, let's go nuts: https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=c23c205c5f6dcdfeb958c2a6cf83ecdb .
At this point, I'm really wondering what trips up LLVM. Note: use of explicit @rkruppe : I am thinking that this test, as written, is bad. Whether LLVM const-fold or not seems to have no relation to the "tightness" of the generated LLVM IR, or its overall performance. It seems that we would be better serve by a check which actually verifies the number of conditional jump involved in the inner loop, rather than using const-folding as a proxy for performance. |
Cross-reference: #56563 |
Performance discussion should be accompanied by benchmarks, so I put together a number of benchmarks and used criterion to evaluate the relative performance of:
And the results are the following:
(see gist for details of each benchmark, I reported only the black-hole cases: https://gist.github.com/matthieu-m/df8dcfed3e23ca83ea5abf9e7b3ca4d3) This yields two conclusions:
Also, it is notable that LLVM's closed formula transformation kicks in for I guess either understanding or fixing this hole is the key to getting an implementation of inclusive ranges which both yields good assembly and let LLVM perform the closed formula transformation. In the absence of such understanding/fixing, I would tend to prefer better straightforward assembly at the expense of the closed formula transformation: it is easier for the user to substitute a closed formula rather than re-implement an inclusive range, and I am doubtful that a closed formula exists in many cases. I also have to revise my statement about performance; while on the Add Mul example, inclusive ranges perform as good as exclusive one, there is still some overhead remaining in the Pythagorean Triples case. It may simply be the slight overhead of the inner loop magnified by the number of times it is executed, of course, and this PR still significantly improves performance: from x1.91 to x1.27 slow-down. Does anyone have any idea as to what could prevent LLVM from effecting the closed formula transformation? |
@kennytm As the author of the current version of |
@matthieu-m I haven't investigated what causes LLVM to const-fold a loop, it just happened that the test works after tweaking the representation and putting |
Updated performance number after specializing
This reinforces the conclusion that the A custom Unfortunately, it does nothing to improve the performance of "simple" loops using external iteration, where LLVM just fails to perform Loop Splitting and subsequently to transform the loop into a closed form. My experiments with Loop Splitting have found it extremely finicky, with very similar cases falling on either side of the divide. This is pretty frustrating 😢 1 The performance penalty observed is specific to the absence of Loop Splitting by LLVM; however in interior iteration we can manually split the loop between the loop itself and either a header or trailer, thereby gaining all our due performance without relying on getting lucky during optimizations. 2 As a more general note, it also means that (a) it is likely beneficial to implement a specialized |
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
Many were -- including |
Sorry, it doesn't seem like I'll be able to give this PR proper attention in the near future. Please assign someone else. |
Specialize Iterator::try_fold and DoubleEndedIterator::try_rfold to improve code generation in all internal iteration scenarios. This changes brings the performance of internal iteration with RangeInclusive on par with the performance of iteration with Range: - Single conditional jump in hot loop, - Unrolling and vectorization, - And even Closed Form substitution. Unfortunately, it only applies to internal iteration. Despite various attempts at stream-lining the implementation of next and next_back, LLVM has stubbornly refused to optimize external iteration appropriately, leaving me with a choice between: - The current implementation, for which Closed Form substitution is performed, but which uses 2 conditional jumps in the hot loop when optimization fail. - An implementation using a "is_done" boolean, which uses 1 conditional jump in the hot loop when optimization fail, allowing unrolling and vectorization, but for which Closed Form substitution fails. In the absence of any conclusive evidence as to which usecase matters most, and with no assurance that the lack of Closed Form substitution is not indicative of other optimizations being foiled, there is no way to pick one implementation over the other, and thus I defer to the statu quo as far as next and next_back are concerned.
80aa9e4
to
eb5b096
Compare
Unfortunately, I have yet to find a way to get LLVM to play nice with external iteration. I'll open another PR to improve internal iteration; as force-push corrupted this one, it seems. |
The current implementation of Iterator::{next, next_back} for
RangeInclusive leads to sub-optimal performance of loops as LLVM is not
capable of splitting the loop into a first-pass initialization
(computing is_empty) followed by the actual loop. This results in each
iteration performing two conditional jumps, which not only impacts the
performance of unoptimized loops, but also inhibits unrolling and
vectorization.
The proposed implementation switches things around, performing extra
work only on the last iteration of the loop. This results in even
unoptimized loops performing a single conditional jump in all but the
last iteration, matching Range's performance, as well as letting LLVM
unroll and vectorize when it would do so for Range's loop.
As a result, it should make iterating on inclusive ranges as fast as
iterating on exclusive ones; avoiding a papercut performance pitfall.
Unfortunately, it also appears to foil LLVM Loop Splitting optimization.