Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Avx backed Block8x8F Transpose method #1374

Merged
merged 10 commits into from
Oct 12, 2020

Conversation

JimBobSquarePants
Copy link
Member

@JimBobSquarePants JimBobSquarePants commented Oct 7, 2020

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Adds a new Avx backed Block8x8F.TransposeInto method. Used to help reduce bottleneck in Jpeg Discrete Cosine Transform

// TODO: Transpose is a bottleneck now. We need full AVX support to optimize it:
// https://github.com/dotnet/corefx/issues/22940
src.TransposeInto(ref temp);

Benchmarks aren't anywhere near what I hoped they'd be. Maybe I've done something wrong. The only obvious difference I can see so far between this and the implementations in things like Matrix4x4 is that I'm using Unsafe.As over Avx.Load
Benchmarking around 16% faster with some savings in jpeg decoding also. Thanks @tannergooding !

I've ported both methods from here using a processor directive to hide one from the compiler.
I've just focused on the implementation that had less instructions.
https://stackoverflow.com/questions/25622745/transpose-an-8x8-float-using-avx-avx2/25627536#25627536

Tagging the brains trust to see if you chaps have any suggestions.

@antonfirsov
@tannergooding
@saucecontrol

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.508 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.1.20452.10
  [Host]     : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), X64 RyuJIT
  DefaultJob : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), X64 RyuJIT

Block8x8F.Transpose

Method Mean Error StdDev Median Ratio RatioSD
TransposeIntoVector4 46.71 ns 0.956 ns 1.277 ns 45.96 ns 1.00 0.00
TransposeIntoAvx 39.97 ns 0.157 ns 0.139 ns 39.97 ns 0.84 0.02

Jpeg Decode

Before

Method TestImage Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
'Decode Jpeg - System.Drawing' Jpg/b(...)e.jpg [21] 5.201 ms 0.5051 ms 0.0277 ms 1.00 0.00 - - - 176 B
'Decode Jpeg - ImageSharp' Jpg/b(...)e.jpg [21] 11.378 ms 0.5299 ms 0.0290 ms 2.19 0.01 - - - 15888 B
'Decode Jpeg - System.Drawing' Jpg/b(...)f.jpg [28] 14.157 ms 0.6681 ms 0.0366 ms 1.00 0.00 - - - 176 B
'Decode Jpeg - ImageSharp' Jpg/b(...)f.jpg [28] 27.588 ms 1.7655 ms 0.0968 ms 1.95 0.00 - - - 16896 B
'Decode Jpeg - System.Drawing' Jpg/i(...)e.jpg [43] 337.594 ms 39.9392 ms 2.1892 ms 1.00 0.00 - - - 216 B
'Decode Jpeg - ImageSharp' Jpg/i(...)e.jpg [43] 264.551 ms 84.5057 ms 4.6320 ms 0.78 0.02 - - - 36022512 B

After

Method TestImage Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
'Decode Jpeg - System.Drawing' Jpg/b(...)e.jpg [21] 5.464 ms 0.8390 ms 0.0460 ms 1.00 0.00 - - - 176 B
'Decode Jpeg - ImageSharp' Jpg/b(...)e.jpg [21] 10.433 ms 0.6750 ms 0.0370 ms 1.91 0.02 - - - 15918 B
'Decode Jpeg - System.Drawing' Jpg/b(...)f.jpg [28] 15.075 ms 32.4878 ms 1.7808 ms 1.00 0.00 - - - 176 B
'Decode Jpeg - ImageSharp' Jpg/b(...)f.jpg [28] 28.504 ms 8.1014 ms 0.4441 ms 1.91 0.19 - - - 16896 B
'Decode Jpeg - System.Drawing' Jpg/i(...)e.jpg [43] 339.520 ms 24.3315 ms 1.3337 ms 1.00 0.00 - - - 216 B
'Decode Jpeg - ImageSharp' Jpg/i(...)e.jpg [43] 254.203 ms 108.8514 ms 5.9665 ms 0.75 0.02 - - - 36022512 B

Vector256<float> r6 = Unsafe.As<Vector4, Vector256<float>>(ref this.V6L);
Vector256<float> r7 = Unsafe.As<Vector4, Vector256<float>>(ref this.V7L);

Vector256<float> t0 = Avx.UnpackLow(r0, r1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect there are a few spills as part of this due to having 8-16 live registers at a time.

You might see better perf by intermixing the three steps given here. That is, handle
t0, t2, t4, t6 to produce r0, r1, r4, r5 and store them
and then handle t1, t3, t5, t7 to produce r2, r3, r6, r7 and store them

I imagine this will help minimize the number of stack spills that happen and may also allow better pipelining on the latest CPUs (without hindering older CPUs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that I did not check the disassembly here, I am just speculating based on past experience 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helped! 🎉

@codecov
Copy link

codecov bot commented Oct 7, 2020

Codecov Report

Merging #1374 into master will increase coverage by 0.02%.
The diff coverage is 94.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1374      +/-   ##
==========================================
+ Coverage   82.77%   82.80%   +0.02%     
==========================================
  Files         690      690              
  Lines       30975    31033      +58     
  Branches     3511     3512       +1     
==========================================
+ Hits        25641    25696      +55     
- Misses       4613     4615       +2     
- Partials      721      722       +1     
Flag Coverage Δ
#unittests 82.80% <94.82%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...arp/Formats/Jpeg/Components/Block8x8F.Generated.cs 100.00% <ø> (ø)
...rp/Formats/Jpeg/Components/FastFloatingPointDCT.cs 100.00% <ø> (ø)
...rc/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs 88.51% <94.82%> (+1.53%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4dadf24...685693a. Read the comment docs.

@JimBobSquarePants JimBobSquarePants changed the title WIP Add Avx backed Block8x8F Transpose method Add Avx backed Block8x8F Transpose method Oct 7, 2020
@JimBobSquarePants JimBobSquarePants marked this pull request as ready for review October 7, 2020 21:43
Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a while I was frightened that the gains are negligible, but seems like a visible improvement, good job!

Mostly LGTM, but let's see what do others think.

@@ -172,14 +172,34 @@ public void TransposeInto()
source.LoadFrom(Create8x8FloatData());

var dest = default(Block8x8F);
source.TransposeInto(ref dest);
source.TransposeIntoFallback(ref dest);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename the test method as well.

Vector256<float> t5 = Avx.UnpackHigh(r4, r5);
Vector256<float> t7 = Avx.UnpackHigh(r6, r7);
v = Avx.Shuffle(t5, t7, 0x4E);
Unsafe.As<Vector4, Vector256<float>>(ref d.V6L) = Avx.Blend(t5, v, 0xCC);
Copy link
Member

@antonfirsov antonfirsov Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding can we rely on codegen producing optimal assembly when working with Unsafe.As? (In comparison to Store / LoadVector256 )

We wanted to avoid pinning, if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to subscribe to this newsletter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still fold the loads/stores. I believe there are a couple edge cases where we may generate an additional lea instruction, but the disassembly would nerd to be checked to see if it's an issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a new language feature I am unaware of that allows this without pinning?

https://source.dot.net/#System.Private.CoreLib/Matrix4x4.cs,273

Vector128<float> M11 = AdvSimd.LoadVector128(&value1.M11);

Copy link
Member

@antonfirsov antonfirsov Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case Matrix4x4 is passed by value to the method, so the compiler knows for sure that value1 is living on the stack, which means it never gets moved around by the GC. => the address can be used safely without pinning anything.

Copy link
Member

@antonfirsov antonfirsov Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually: JpegBlockPostprocessor is the holder of all 3 blocks passed to TransformIDCT:

ref Block8x8F b = ref this.SourceBlock;
b.LoadFrom(ref sourceBlock);
// Dequantize:
b.MultiplyInplace(ref this.DequantiazationTable);
FastFloatingPointDCT.TransformIDCT(ref b, ref this.WorkspaceBlock1, ref this.WorkspaceBlock2);

And that struct always lives on the stack:

var blockPp = new JpegBlockPostProcessor(this.ImagePostProcessor.RawJpeg, this.Component);

Therefore, you can get a pointer to the whole Block8x8 buffer at the beginning of the method "safely" with Unsafe:

float* dPtr = (float*)Unsafe.AsPointer(ref d);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes of course. This all makes sense thanks!

I might give the pointer approach a try tomorrow just to see if there is any difference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided against it. I don't want to risk changes to external code breaking this.

Comment on lines +631 to +634
Vector256<float> r0 = Avx.InsertVector128(
Unsafe.As<Vector4, Vector128<float>>(ref this.V0L).ToVector256(),
Unsafe.As<Vector4, Vector128<float>>(ref this.V4L),
1);
Copy link
Contributor

@Turnerj Turnerj Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't change only slightly changes the assembly that is generated (vmovups vs vmovupd - see comment below) but it might show the intention better as:

Suggested change
Vector256<float> r0 = Avx.InsertVector128(
Unsafe.As<Vector4, Vector128<float>>(ref this.V0L).ToVector256(),
Unsafe.As<Vector4, Vector128<float>>(ref this.V4L),
1);
Vector256<float> r0 = Vector256.Create(
Unsafe.As<Vector4, Vector128<float>>(ref this.V0L),
Unsafe.As<Vector4, Vector128<float>>(ref this.V4L)
);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can anyone explain the difference in the precision of the output instruction here?

vmovups to vmovupd at L0006

Original

Block8x8F.TransposeIntoAvx(Block8x8F ByRef)
    L0000: sub esp, 0x20
    L0003: vzeroupper
    L0006: vmovups xmm0, [ecx]
    L000a: add ecx, 0x80
    L0010: vmovupd xmm1, [ecx]
    L0014: vinsertf128 ymm0, ymm0, xmm1, 1
    L001a: vmovupd [esp], ymm0
    L001f: vzeroupper
    L0022: add esp, 0x20
    L0025: ret

Suggested Change

Block8x8F.TransposeIntoAvx(Block8x8F ByRef)
    L0000: sub esp, 0x20
    L0003: vzeroupper
    L0006: vmovupd xmm0, [ecx]
    L000a: add ecx, 0x80
    L0010: vmovupd xmm1, [ecx]
    L0014: vinsertf128 ymm0, ymm0, xmm1, 1
    L001a: vmovupd [esp], ymm0
    L001f: vzeroupper
    L0022: add esp, 0x20
    L0025: ret

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be error but I spotted a slight slow down in the change. Keeping as-is for now.

@antonfirsov
Copy link
Member

antonfirsov commented Oct 11, 2020

@JimBobSquarePants wouldn't it make sense to isolate testinfra changes into a follow-up PR?

(1) I'm getting unsure about the right approach for providing proper test coverage for the different codepaths, (2) we also need to deal with #1376.

Regarding (1): We may probably want to introduce one or more entire test targets instead/besides the unit test helpers to make 100% sure we cover end to end scenarios on systems where certain instruction set extensions are unavailable.

  • Pro: less reliance on Unit Test quality, and extensiveness of code reviews when adding SIMD features
  • Con: running more targets

@JimBobSquarePants
Copy link
Member Author

I could revert. I just didn’t want to add more clutter (I don’t like having to write duplicate tests for the same feature)

@JimBobSquarePants
Copy link
Member Author

I don’t fancy extra targets either. Builds already take far too long

@antonfirsov
Copy link
Member

My point is that the transpose stuff is good to merge as is. Would be nice to have it done, and take our time to discuss/finish/review the rest.

/// </summary>
/// <param name="action">The test action to run.</param>
/// <param name="intrinsics">The intrinsics features.</param>
public static void RunWithHwIntrinsicsFeature(
Copy link
Member

@antonfirsov antonfirsov Oct 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways, great stuff! Would be nice to add tests for the test utility itself but, it is currently blocked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have tests waiting to push. Just hit the executor issue so stopped. I’ll revert later, merge the current state and add as a separate PR.

@JimBobSquarePants JimBobSquarePants merged commit a1784a6 into master Oct 12, 2020
@JimBobSquarePants JimBobSquarePants deleted the js/Block8x8F_TransposeAVX branch October 12, 2020 13:44
JimBobSquarePants added a commit that referenced this pull request Mar 13, 2021
Add Avx backed Block8x8F Transpose method
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants