-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inversion performance improvements #84
Comments
Thanks for the suggestions! I spent a while optimizing the Ah yes, decoding is an area that I haven't focused on optimizing very much.
|
After "First Phase", the In order to create the submatrix After this all that is left to do is back solving |
Ah yes, I think I understand your suggestion.
I do this, by ignoring all columns left of Line 665 in 95b6b5a
Oh yes, I think I see how this works, and I don't have this optimization. Do you know why the RFC recommends performing the Third Phase by doing this?
I followed the RFC ( Line 695 in 95b6b5a
edit |
Because converting the However like i mentioned in my initial comment:
Instead of maintaining the matrix This effectively combines 3rd, 4th, and 5th phases as you surmised but without requiring the |
Got it, right. Thanks for explaining that! Hopefully I'll have time, this weekend, to try implementing this and see how much it improves performance :) |
@sleepybishop, I started on this today, but after reading through the RFC & my implementation again I'm not sure how to |
Sure, after "First Phase", the
These steps avoid the need to track changes to the |
Ah yes, I think I understand now. Thanks for explaining it in detail! I'll try it out |
Just implemented this and got a nice ~15% speedup: #87 . Thanks for the help! Do you know of any tricks for optimizing the graph selection for r=2 in the first phase? I've already optimized the data structures quite heavily, but that's still the most expensive processing step. I was thinking maybe I could cache the results, but the elimination process of the first phase seems to frequently create new rows with r=2 which invalidates the cached graph. |
Good Work!
One optimization is to use the transpose of Beyond that, it is possible to prune and grow the graph as needed without recomputing it, however graph maintenance is still relatively expensive. |
Ah yep, I already keep an index of values by column to speed up those adjustments. Ya, I already tried doing graph maintenance for each operation, but that turned out to be slower than just recomputing for each r=2 selection. |
@sleepybishop one other question, do you know if implementing this part of the RFC: After the latest round of optimizations, essentially all of encoding time is spent in three areas, the last of which can't really be optimized further:
|
yes, computing
|
Cool, thanks. Ya, I'd like to implement that decoding optimization at some point. Right now I'm trying to finish up optimizing the encode path |
the rfc says the intermediate symbols can be calculated via |
I'm not familiar with Markowitz pivoting, do you have any pointers to how it works? And is the tradeoff that the plan compilation is faster, but the resulting plan will have more symbol ops? |
Sure, see this paper, page 52/section 3.2.
Yep, that's the idea, a suboptimal strategy will create more fill-in, so it's a tradeoff between the performance of graph methods and the performace of xor row ops, which you've already optimized heavily with SIMD. |
Cool, thanks! I may give that a try. I'm still thinking through some potential data structure changes to speed up the graph calculation. I feel like there might be a significant speedup to be had, but am not sure yet |
With graph methods the thing to exploit is that typically one component will be really large compared to the rest. |
Yep, I already leverage that by selecting an edge from the longest cycle during the |
I optimized the graph substep using a union-find data structure: #95 It's now a much smaller fraction of the time |
Cool, did you get a significant boost in encoding throughput? What's the largest bottleneck now? |
Ya, about 10% on large symbol counts. I just released it in 1.6.2. It's spread out over a fair number of things now, but the sparse FMAs are the largest one. Another large piece is generating the HDPC rows of the constraint matrix. Here's a profile: flamegraph.zip |
What do you mean sparse FMA's? I thought you had refactored to only use dense ops on Optimizing the generation of Use Section 5.7.2 from the rfc as a reference.
You can work this out yourself by working backwards from a |
I removed most of the sparse additions, but the ones in phase 1 are still there. They were a relatively small amount of time before. They're next up on my list of things to optimize now that the graph data structure is more efficient. Ah thanks for the tip on generating the HDPC matrix! I'll give that a go |
Ah, actually the sparse additions in phase 1 weren't taking much time. I removed them and it improved throughput by ~2%. The profile shows 4.1% in |
I tried the HDPC generation optimization you suggested, but wasn't able to make it work. Perhaps I implemented it wrong? #101 For this part "set each column in every row to I wasn't able to re-derive this part either: "namely add |
You're almost there, the order of operations if off. for i in 0..H {
result[i][Kprime + S - 1] = Octet::alpha(i).byte();
} because the remaining steps depend on the last column.
it's So then it's just a matter of shuffling your logic around a bit: for j in (0..=(Kprime + S - 2)).rev() { // edited <- loop should count down
// fill in column j
for i in 0..H {
result[i][j] = (Octet::alpha(1) * Octet::new(result[i][j+1])).byte()
}
// recover the changes by multiplication with MT
let rand6 = rand((j + 1) as u32, 6u32, H as u32) as usize;
let rand7 = rand((j + 1) as u32, 7u32, (H - 1) as u32) as usize;
let i1 = rand6;
let i2 = (rand6 + rand7 + 1) % H;
result[i1][j] ^= 1;
result[i2][j] ^= 1;
} |
Oh yes, of course, compute it recursively, very clever! Thanks, I'm adding this in |
Nice. Another ~10% speedup from that HDPC optimization :) Looks like the next thing I need to do is revisit some of the matrix methods. Things like iterating through a column of values and swapping column indices are some of the most expensive ops now |
Any reason to not merge the current performance speedup? |
It's already released. Not sure which you're referring to?
…Sent from my phone
On Sun, Feb 25, 2024, 4:39 AM jblestang ***@***.***> wrote:
Any reason to not merge the current performance speedup?
—
Reply to this email directly, view it on GitHub
<#84 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGNXQHTB5HZBF6ICUWODETYVMWITAVCNFSM4TVHO4S2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJWGI4TENJUHE3A>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
The issue is still opened and I was wondering why. |
Some ideas to improve performance of precode matrix inversion:
r == 2
the vast majority of the time, optimizing for this andr <= 3
will improve performance.rank(gf2_matrix) == L
), inversion can stay in gf2 and avoid expensive hdpc operations completely.Hopefully these are helpful insights.
The text was updated successfully, but these errors were encountered: