-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize to_byte_idx #17
Conversation
Previous it was dividing the char count by the MAX_ACC, which is 255 for SIMD functions. This meant that a char index of 1000 would only use 3 chunks at a time and anything below 255 would skip the fast path completely. Fixing this lead to a 40% improvement on all on-trivial benchmarks.
Thanks for this! I'm currently on an extended vacation, so unless this is urgent I'll get to it some time in April. Sorry for the delay! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on this! As I mentioned in the other PR, I ended up moving, etc., and it's taken me a while to settle in and get back into rhythm.
Over-all this looks great. Just a few nits.
Done. Glad you are back! |
Thanks! |
I got a chance to optimize
to_byte_idx
. There are a few things I changed here:small string fast path
added a fast path for small strings. We already had this
chars::count
. This resulted in about 10% speedup on the trival strings.fixed chunking bug
I found a performance bug where
max_round_length
was being divided byMAX_ACC
instead ofSIZE
. This made the fast path way shorter than needed. For example if yourchar_idx
was 1000, then the old logic would only use of the fast path for the first 3 chunks (48 bytes). And ifchar_idx
was below 255 then it would skip the fast path entirely. Fixing this 1 line resulted in up to 40% speedup on some benchmarks.loop unrolling
This is the same approach that was used last time, with unrolling the loop for faster performance. Now the routine goes brrrr and I am getting over 50GB/s on my arm machine.
benchmark text file path
lastly I updated the benchmarks so that they can be run from any directory instead of just the crate root.
benchmarks
Everything looks like it improved except for
jp_0102
which shows a 8% performance regression on both arm and x86. I can't seem to explain the regression.arm benchmarks
x86 benchmarks
note that this is with a really old Intel Core 2 Duo E8500 on linux