-
-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Avx backed Block8x8F Transpose method #1374
Conversation
Vector256<float> r6 = Unsafe.As<Vector4, Vector256<float>>(ref this.V6L); | ||
Vector256<float> r7 = Unsafe.As<Vector4, Vector256<float>>(ref this.V7L); | ||
|
||
Vector256<float> t0 = Avx.UnpackLow(r0, r1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect there are a few spills as part of this due to having 8-16 live registers at a time.
You might see better perf by intermixing the three steps given here. That is, handle
t0, t2, t4, t6 to produce r0, r1, r4, r5 and store them
and then handle t1, t3, t5, t7 to produce r2, r3, r6, r7 and store them
I imagine this will help minimize the number of stack spills that happen and may also allow better pipelining on the latest CPUs (without hindering older CPUs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that I did not check the disassembly here, I am just speculating based on past experience 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It helped! 🎉
tests/ImageSharp.Benchmarks/Codecs/Jpeg/BlockOperations/Block8x8F_Transpose.cs
Outdated
Show resolved
Hide resolved
…x8F_Transpose.cs Co-authored-by: Anton Firszov <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #1374 +/- ##
==========================================
+ Coverage 82.77% 82.80% +0.02%
==========================================
Files 690 690
Lines 30975 31033 +58
Branches 3511 3512 +1
==========================================
+ Hits 25641 25696 +55
- Misses 4613 4615 +2
- Partials 721 722 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a while I was frightened that the gains are negligible, but seems like a visible improvement, good job!
Mostly LGTM, but let's see what do others think.
@@ -172,14 +172,34 @@ public void TransposeInto() | |||
source.LoadFrom(Create8x8FloatData()); | |||
|
|||
var dest = default(Block8x8F); | |||
source.TransposeInto(ref dest); | |||
source.TransposeIntoFallback(ref dest); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename the test method as well.
Vector256<float> t5 = Avx.UnpackHigh(r4, r5); | ||
Vector256<float> t7 = Avx.UnpackHigh(r6, r7); | ||
v = Avx.Shuffle(t5, t7, 0x4E); | ||
Unsafe.As<Vector4, Vector256<float>>(ref d.V6L) = Avx.Blend(t5, v, 0xCC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding can we rely on codegen producing optimal assembly when working with Unsafe.As
? (In comparison to Store
/ LoadVector256
)
We wanted to avoid pinning, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like to subscribe to this newsletter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should still fold the loads/stores. I believe there are a couple edge cases where we may generate an additional lea
instruction, but the disassembly would nerd to be checked to see if it's an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a new language feature I am unaware of that allows this without pinning?
https://source.dot.net/#System.Private.CoreLib/Matrix4x4.cs,273
Vector128<float> M11 = AdvSimd.LoadVector128(&value1.M11);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case Matrix4x4
is passed by value to the method, so the compiler knows for sure that value1
is living on the stack, which means it never gets moved around by the GC. => the address can be used safely without pinning anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually: JpegBlockPostprocessor
is the holder of all 3 blocks passed to TransformIDCT
:
ImageSharp/src/ImageSharp/Formats/Jpeg/Components/Decoder/JpegBlockPostProcessor.cs
Lines 80 to 86 in 78a584e
ref Block8x8F b = ref this.SourceBlock; | |
b.LoadFrom(ref sourceBlock); | |
// Dequantize: | |
b.MultiplyInplace(ref this.DequantiazationTable); | |
FastFloatingPointDCT.TransformIDCT(ref b, ref this.WorkspaceBlock1, ref this.WorkspaceBlock2); |
And that struct always lives on the stack:
ImageSharp/src/ImageSharp/Formats/Jpeg/Components/Decoder/JpegComponentPostProcessor.cs
Line 79 in 78a584e
var blockPp = new JpegBlockPostProcessor(this.ImagePostProcessor.RawJpeg, this.Component); |
Therefore, you can get a pointer to the whole Block8x8
buffer at the beginning of the method "safely" with Unsafe:
float* dPtr = (float*)Unsafe.AsPointer(ref d);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes of course. This all makes sense thanks!
I might give the pointer approach a try tomorrow just to see if there is any difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided against it. I don't want to risk changes to external code breaking this.
Vector256<float> r0 = Avx.InsertVector128( | ||
Unsafe.As<Vector4, Vector128<float>>(ref this.V0L).ToVector256(), | ||
Unsafe.As<Vector4, Vector128<float>>(ref this.V4L), | ||
1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't change only slightly changes the assembly that is generated (vmovups
vs vmovupd
- see comment below) but it might show the intention better as:
Vector256<float> r0 = Avx.InsertVector128( | |
Unsafe.As<Vector4, Vector128<float>>(ref this.V0L).ToVector256(), | |
Unsafe.As<Vector4, Vector128<float>>(ref this.V4L), | |
1); | |
Vector256<float> r0 = Vector256.Create( | |
Unsafe.As<Vector4, Vector128<float>>(ref this.V0L), | |
Unsafe.As<Vector4, Vector128<float>>(ref this.V4L) | |
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can anyone explain the difference in the precision of the output instruction here?
vmovups
to vmovupd
at L0006
Block8x8F.TransposeIntoAvx(Block8x8F ByRef)
L0000: sub esp, 0x20
L0003: vzeroupper
L0006: vmovups xmm0, [ecx]
L000a: add ecx, 0x80
L0010: vmovupd xmm1, [ecx]
L0014: vinsertf128 ymm0, ymm0, xmm1, 1
L001a: vmovupd [esp], ymm0
L001f: vzeroupper
L0022: add esp, 0x20
L0025: ret
Block8x8F.TransposeIntoAvx(Block8x8F ByRef)
L0000: sub esp, 0x20
L0003: vzeroupper
L0006: vmovupd xmm0, [ecx]
L000a: add ecx, 0x80
L0010: vmovupd xmm1, [ecx]
L0014: vinsertf128 ymm0, ymm0, xmm1, 1
L001a: vmovupd [esp], ymm0
L001f: vzeroupper
L0022: add esp, 0x20
L0025: ret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be error but I spotted a slight slow down in the change. Keeping as-is for now.
Designed to fail to ensure RemoteExecutor is running.
@JimBobSquarePants wouldn't it make sense to isolate testinfra changes into a follow-up PR? (1) I'm getting unsure about the right approach for providing proper test coverage for the different codepaths, (2) we also need to deal with #1376. Regarding (1): We may probably want to introduce one or more entire test targets instead/besides the unit test helpers to make 100% sure we cover end to end scenarios on systems where certain instruction set extensions are unavailable.
|
I could revert. I just didn’t want to add more clutter (I don’t like having to write duplicate tests for the same feature) |
I don’t fancy extra targets either. Builds already take far too long |
My point is that the transpose stuff is good to merge as is. Would be nice to have it done, and take our time to discuss/finish/review the rest. |
/// </summary> | ||
/// <param name="action">The test action to run.</param> | ||
/// <param name="intrinsics">The intrinsics features.</param> | ||
public static void RunWithHwIntrinsicsFeature( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyways, great stuff! Would be nice to add tests for the test utility itself but, it is currently blocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do have tests waiting to push. Just hit the executor issue so stopped. I’ll revert later, merge the current state and add as a separate PR.
Add Avx backed Block8x8F Transpose method
Prerequisites
Description
Adds a new Avx backed
Block8x8F.TransposeInto
method. Used to help reduce bottleneck in Jpeg Discrete Cosine TransformImageSharp/src/ImageSharp/Formats/Jpeg/Components/FastFloatingPointDCT.cs
Lines 53 to 55 in 4dadf24
Benchmarks aren't anywhere near what I hoped they'd be. Maybe I've done something wrong. The only obvious difference I can see so far between this and the implementations in things likeMatrix4x4
is that I'm usingUnsafe.As
overAvx.Load
Benchmarking around 16% faster with some savings in jpeg decoding also. Thanks @tannergooding !
I've ported both methods from here using a processor directive to hide one from the compiler.I've just focused on the implementation that had less instructions.
https://stackoverflow.com/questions/25622745/transpose-an-8x8-float-using-avx-avx2/25627536#25627536
Tagging the brains trust to see if you chaps have any suggestions.
@antonfirsov
@tannergooding
@saucecontrol
Block8x8F.Transpose
Jpeg Decode
Before
After