Add Avx backed Block8x8F Transpose method #1374

JimBobSquarePants · 2020-10-07T16:46:34Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

Adds a new Avx backed Block8x8F.TransposeInto method. Used to help reduce bottleneck in Jpeg Discrete Cosine Transform

ImageSharp/src/ImageSharp/Formats/Jpeg/Components/FastFloatingPointDCT.cs

Lines 53 to 55 in 4dadf24

    
           // TODO: Transpose is a bottleneck now. We need full AVX support to optimize it: 
        
           // https://github.com/dotnet/corefx/issues/22940 
        
           src.TransposeInto(ref temp);

Benchmarks aren't anywhere near what I hoped they'd be. Maybe I've done something wrong. The only obvious difference I can see so far between this and the implementations in things like Matrix4x4 is that I'm using Unsafe.As over Avx.Load
Benchmarking around 16% faster with some savings in jpeg decoding also. Thanks @tannergooding !

~~I've ported both methods from here using a processor directive to hide one from the compiler.~~
I've just focused on the implementation that had less instructions.
https://stackoverflow.com/questions/25622745/transpose-an-8x8-float-using-avx-avx2/25627536#25627536

Tagging the brains trust to see if you chaps have any suggestions.

@antonfirsov
@tannergooding
@saucecontrol

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.508 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.1.20452.10
  [Host]     : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), X64 RyuJIT
  DefaultJob : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), X64 RyuJIT

Block8x8F.Transpose

Method	Mean	Error	StdDev	Median	Ratio	RatioSD
TransposeIntoVector4	46.71 ns	0.956 ns	1.277 ns	45.96 ns	1.00	0.00
TransposeIntoAvx	39.97 ns	0.157 ns	0.139 ns	39.97 ns	0.84	0.02

Jpeg Decode

Before

Method	TestImage	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
'Decode Jpeg - System.Drawing'	Jpg/b(...)e.jpg [21]	5.201 ms	0.5051 ms	0.0277 ms	1.00	0.00	-	-	-	176 B
'Decode Jpeg - ImageSharp'	Jpg/b(...)e.jpg [21]	11.378 ms	0.5299 ms	0.0290 ms	2.19	0.01	-	-	-	15888 B

'Decode Jpeg - System.Drawing'	Jpg/b(...)f.jpg [28]	14.157 ms	0.6681 ms	0.0366 ms	1.00	0.00	-	-	-	176 B
'Decode Jpeg - ImageSharp'	Jpg/b(...)f.jpg [28]	27.588 ms	1.7655 ms	0.0968 ms	1.95	0.00	-	-	-	16896 B

'Decode Jpeg - System.Drawing'	Jpg/i(...)e.jpg [43]	337.594 ms	39.9392 ms	2.1892 ms	1.00	0.00	-	-	-	216 B
'Decode Jpeg - ImageSharp'	Jpg/i(...)e.jpg [43]	264.551 ms	84.5057 ms	4.6320 ms	0.78	0.02	-	-	-	36022512 B

After

Method	TestImage	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
'Decode Jpeg - System.Drawing'	Jpg/b(...)e.jpg [21]	5.464 ms	0.8390 ms	0.0460 ms	1.00	0.00	-	-	-	176 B
'Decode Jpeg - ImageSharp'	Jpg/b(...)e.jpg [21]	10.433 ms	0.6750 ms	0.0370 ms	1.91	0.02	-	-	-	15918 B

'Decode Jpeg - System.Drawing'	Jpg/b(...)f.jpg [28]	15.075 ms	32.4878 ms	1.7808 ms	1.00	0.00	-	-	-	176 B
'Decode Jpeg - ImageSharp'	Jpg/b(...)f.jpg [28]	28.504 ms	8.1014 ms	0.4441 ms	1.91	0.19	-	-	-	16896 B

'Decode Jpeg - System.Drawing'	Jpg/i(...)e.jpg [43]	339.520 ms	24.3315 ms	1.3337 ms	1.00	0.00	-	-	-	216 B
'Decode Jpeg - ImageSharp'	Jpg/i(...)e.jpg [43]	254.203 ms	108.8514 ms	5.9665 ms	0.75	0.02	-	-	-	36022512 B

tannergooding · 2020-10-07T17:34:24Z

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs

+            Vector256<float> r6 = Unsafe.As<Vector4, Vector256<float>>(ref this.V6L);
+            Vector256<float> r7 = Unsafe.As<Vector4, Vector256<float>>(ref this.V7L);
+
+            Vector256<float> t0 = Avx.UnpackLow(r0, r1);


I'd expect there are a few spills as part of this due to having 8-16 live registers at a time.

You might see better perf by intermixing the three steps given here. That is, handle
t0, t2, t4, t6 to produce r0, r1, r4, r5 and store them
and then handle t1, t3, t5, t7 to produce r2, r3, r6, r7 and store them

I imagine this will help minimize the number of stack spills that happen and may also allow better pipelining on the latest CPUs (without hindering older CPUs)

Noting that I did not check the disassembly here, I am just speculating based on past experience 😄

It helped! 🎉

tests/ImageSharp.Benchmarks/Codecs/Jpeg/BlockOperations/Block8x8F_Transpose.cs

…x8F_Transpose.cs Co-authored-by: Anton Firszov <[email protected]>

codecov · 2020-10-07T19:38:42Z

Codecov Report

Merging #1374 into master will increase coverage by 0.02%.
The diff coverage is 94.82%.

@@            Coverage Diff             @@
##           master    #1374      +/-   ##
==========================================
+ Coverage   82.77%   82.80%   +0.02%     
==========================================
  Files         690      690              
  Lines       30975    31033      +58     
  Branches     3511     3512       +1     
==========================================
+ Hits        25641    25696      +55     
- Misses       4613     4615       +2     
- Partials      721      722       +1

Flag	Coverage Δ
#unittests	`82.80% <94.82%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...arp/Formats/Jpeg/Components/Block8x8F.Generated.cs	`100.00% <ø> (ø)`
...rp/Formats/Jpeg/Components/FastFloatingPointDCT.cs	`100.00% <ø> (ø)`
...rc/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs	`88.51% <94.82%> (+1.53%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4dadf24...685693a. Read the comment docs.

antonfirsov

For a while I was frightened that the gains are negligible, but seems like a visible improvement, good job!

Mostly LGTM, but let's see what do others think.

antonfirsov · 2020-10-07T21:52:48Z

tests/ImageSharp.Tests/Formats/Jpg/Block8x8FTests.cs

@@ -172,14 +172,34 @@ public void TransposeInto()
            source.LoadFrom(Create8x8FloatData());

            var dest = default(Block8x8F);
-            source.TransposeInto(ref dest);
+            source.TransposeIntoFallback(ref dest);


Let's rename the test method as well.

antonfirsov · 2020-10-07T22:06:34Z

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs

+            Vector256<float> t5 = Avx.UnpackHigh(r4, r5);
+            Vector256<float> t7 = Avx.UnpackHigh(r6, r7);
+            v = Avx.Shuffle(t5, t7, 0x4E);
+            Unsafe.As<Vector4, Vector256<float>>(ref d.V6L) = Avx.Blend(t5, v, 0xCC);


@tannergooding can we rely on codegen producing optimal assembly when working with Unsafe.As? (In comparison to Store / LoadVector256 )

We wanted to avoid pinning, if possible.

I would also like to subscribe to this newsletter.

We should still fold the loads/stores. I believe there are a couple edge cases where we may generate an additional lea instruction, but the disassembly would nerd to be checked to see if it's an issue.

Is there a new language feature I am unaware of that allows this without pinning?

https://source.dot.net/#System.Private.CoreLib/Matrix4x4.cs,273

Vector128<float> M11 = AdvSimd.LoadVector128(&value1.M11);

In this case Matrix4x4 is passed by value to the method, so the compiler knows for sure that value1 is living on the stack, which means it never gets moved around by the GC. => the address can be used safely without pinning anything.

Actually: JpegBlockPostprocessor is the holder of all 3 blocks passed to TransformIDCT:

ImageSharp/src/ImageSharp/Formats/Jpeg/Components/Decoder/JpegBlockPostProcessor.cs

Lines 80 to 86 in 78a584e

ref Block8x8F b = ref this.SourceBlock;

b.LoadFrom(ref sourceBlock);

// Dequantize:

b.MultiplyInplace(ref this.DequantiazationTable);

FastFloatingPointDCT.TransformIDCT(ref b, ref this.WorkspaceBlock1, ref this.WorkspaceBlock2);

And that struct always lives on the stack:

ImageSharp/src/ImageSharp/Formats/Jpeg/Components/Decoder/JpegComponentPostProcessor.cs

Line 79 in 78a584e

var blockPp = new JpegBlockPostProcessor(this.ImagePostProcessor.RawJpeg, this.Component);

Therefore, you can get a pointer to the whole Block8x8 buffer at the beginning of the method "safely" with Unsafe:

float* dPtr = (float*)Unsafe.AsPointer(ref d);

Ah yes of course. This all makes sense thanks!

I might give the pointer approach a try tomorrow just to see if there is any difference.

Decided against it. I don't want to risk changes to external code breaking this.

Turnerj · 2020-10-08T06:45:16Z

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs

+            Vector256<float> r0 = Avx.InsertVector128(
+               Unsafe.As<Vector4, Vector128<float>>(ref this.V0L).ToVector256(),
+               Unsafe.As<Vector4, Vector128<float>>(ref this.V4L),
+               1);


This ~~doesn't change~~ only slightly changes the assembly that is generated (vmovups vs vmovupd - see comment below) but it might show the intention better as:

Suggested change

Vector256<float> r0 = Avx.InsertVector128(

Unsafe.As<Vector4, Vector128<float>>(ref this.V0L).ToVector256(),

Unsafe.As<Vector4, Vector128<float>>(ref this.V4L),

1);

Vector256<float> r0 = Vector256.Create(

Unsafe.As<Vector4, Vector128<float>>(ref this.V0L),

Unsafe.As<Vector4, Vector128<float>>(ref this.V4L)

);

Can anyone explain the difference in the precision of the output instruction here?

vmovups to vmovupd at L0006

Original

Block8x8F.TransposeIntoAvx(Block8x8F ByRef) L0000: sub esp, 0x20 L0003: vzeroupper L0006: vmovups xmm0, [ecx] L000a: add ecx, 0x80 L0010: vmovupd xmm1, [ecx] L0014: vinsertf128 ymm0, ymm0, xmm1, 1 L001a: vmovupd [esp], ymm0 L001f: vzeroupper L0022: add esp, 0x20 L0025: ret

Suggested Change

Block8x8F.TransposeIntoAvx(Block8x8F ByRef) L0000: sub esp, 0x20 L0003: vzeroupper L0006: vmovupd xmm0, [ecx] L000a: add ecx, 0x80 L0010: vmovupd xmm1, [ecx] L0014: vinsertf128 ymm0, ymm0, xmm1, 1 L001a: vmovupd [esp], ymm0 L001f: vzeroupper L0022: add esp, 0x20 L0025: ret

Could be error but I spotted a slight slow down in the change. Keeping as-is for now.

Designed to fail to ensure RemoteExecutor is running.

antonfirsov · 2020-10-11T11:37:34Z

@JimBobSquarePants wouldn't it make sense to isolate testinfra changes into a follow-up PR?

(1) I'm getting unsure about the right approach for providing proper test coverage for the different codepaths, (2) we also need to deal with #1376.

Regarding (1): We may probably want to introduce one or more entire test targets instead/besides the unit test helpers to make 100% sure we cover end to end scenarios on systems where certain instruction set extensions are unavailable.

Pro: less reliance on Unit Test quality, and extensiveness of code reviews when adding SIMD features
Con: running more targets

JimBobSquarePants · 2020-10-11T11:45:12Z

I could revert. I just didn’t want to add more clutter (I don’t like having to write duplicate tests for the same feature)

JimBobSquarePants · 2020-10-11T11:45:49Z

I don’t fancy extra targets either. Builds already take far too long

antonfirsov · 2020-10-11T12:07:00Z

My point is that the transpose stuff is good to merge as is. Would be nice to have it done, and take our time to discuss/finish/review the rest.

antonfirsov · 2020-10-11T12:20:25Z

tests/ImageSharp.Tests/TestUtilities/FeatureTesting/FeatureTestRunner.cs

+        /// </summary>
+        /// <param name="action">The test action to run.</param>
+        /// <param name="intrinsics">The intrinsics features.</param>
+        public static void RunWithHwIntrinsicsFeature(


Anyways, great stuff! Would be nice to add tests for the test utility itself but, it is currently blocked.

I do have tests waiting to push. Just hit the executor issue so stopped. I’ll revert later, merge the current state and add as a separate PR.

Add Avx backed Block8x8F Transpose method

JimBobSquarePants added 2 commits October 7, 2020 15:42

Add AVX backed Block8x8F Transpose method

24d49e5

Add variant 2

7a55662

JimBobSquarePants added area:performance formats:jpeg labels Oct 7, 2020

JimBobSquarePants added this to the 1.1.0 milestone Oct 7, 2020

tannergooding reviewed Oct 7, 2020

View reviewed changes

antonfirsov reviewed Oct 7, 2020

View reviewed changes

tests/ImageSharp.Benchmarks/Codecs/Jpeg/BlockOperations/Block8x8F_Transpose.cs Outdated Show resolved Hide resolved

Update tests/ImageSharp.Benchmarks/Codecs/Jpeg/BlockOperations/Block8…

3b2ade5

…x8F_Transpose.cs Co-authored-by: Anton Firszov <[email protected]>

Use interleaving to prevent stack spills

093fbc4

JimBobSquarePants changed the title ~~WIP Add Avx backed Block8x8F Transpose method~~ Add Avx backed Block8x8F Transpose method Oct 7, 2020

JimBobSquarePants marked this pull request as ready for review October 7, 2020 21:43

antonfirsov reviewed Oct 7, 2020

View reviewed changes

Update Block8x8FTests.cs

8e5a59f

Turnerj reviewed Oct 8, 2020

View reviewed changes

JimBobSquarePants added 4 commits October 9, 2020 17:02

First pass at HW feature tests

6dae52e

Designed to fail to ensure RemoteExecutor is running.

Fix build

9c648d7

Test windows only

97c1846

Use single test, enable runners

e33c1cd

antonfirsov reviewed Oct 11, 2020

View reviewed changes

Revert to 8e5a59f

685693a

JimBobSquarePants merged commit a1784a6 into master Oct 12, 2020

JimBobSquarePants deleted the js/Block8x8F_TransposeAVX branch October 12, 2020 13:44

JimBobSquarePants mentioned this pull request Oct 16, 2020

Optimize Block8x8F low hanging fruit and fix naming #1390

Merged

4 tasks

JimBobSquarePants added a commit that referenced this pull request Mar 13, 2021

Merge pull request #1374 from SixLabors/js/Block8x8F_TransposeAVX

1b173ab

Add Avx backed Block8x8F Transpose method

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Avx backed Block8x8F Transpose method #1374

Add Avx backed Block8x8F Transpose method #1374

JimBobSquarePants commented Oct 7, 2020 •

edited

Loading

tannergooding Oct 7, 2020

tannergooding Oct 7, 2020

JimBobSquarePants Oct 7, 2020

codecov bot commented Oct 7, 2020 •

edited

Loading

antonfirsov left a comment •

edited

Loading

antonfirsov Oct 7, 2020

antonfirsov Oct 7, 2020 •

edited

Loading

JimBobSquarePants Oct 7, 2020

tannergooding Oct 7, 2020

JimBobSquarePants Oct 7, 2020

antonfirsov Oct 7, 2020 •

edited

Loading

antonfirsov Oct 7, 2020 •

edited

Loading

JimBobSquarePants Oct 8, 2020

JimBobSquarePants Oct 8, 2020

Turnerj Oct 8, 2020 •

edited

Loading

JimBobSquarePants Oct 8, 2020

JimBobSquarePants Oct 12, 2020

antonfirsov commented Oct 11, 2020 •

edited

Loading

JimBobSquarePants commented Oct 11, 2020

JimBobSquarePants commented Oct 11, 2020

antonfirsov commented Oct 11, 2020

antonfirsov Oct 11, 2020 •

edited

Loading

JimBobSquarePants Oct 11, 2020

	// TODO: Transpose is a bottleneck now. We need full AVX support to optimize it:
	// https://github.com/dotnet/corefx/issues/22940
	src.TransposeInto(ref temp);

	ref Block8x8F b = ref this.SourceBlock;
	b.LoadFrom(ref sourceBlock);

	// Dequantize:
	b.MultiplyInplace(ref this.DequantiazationTable);

	FastFloatingPointDCT.TransformIDCT(ref b, ref this.WorkspaceBlock1, ref this.WorkspaceBlock2);

Add Avx backed Block8x8F Transpose method #1374

Add Avx backed Block8x8F Transpose method #1374

Conversation

JimBobSquarePants commented Oct 7, 2020 • edited Loading

Prerequisites

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 7, 2020 • edited Loading

Codecov Report

antonfirsov left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonfirsov Oct 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonfirsov Oct 7, 2020 • edited Loading

Choose a reason for hiding this comment

antonfirsov Oct 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Turnerj Oct 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonfirsov commented Oct 11, 2020 • edited Loading

JimBobSquarePants commented Oct 11, 2020

JimBobSquarePants commented Oct 11, 2020

antonfirsov commented Oct 11, 2020

antonfirsov Oct 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimBobSquarePants commented Oct 7, 2020 •

edited

Loading

codecov bot commented Oct 7, 2020 •

edited

Loading

antonfirsov left a comment •

edited

Loading

antonfirsov Oct 7, 2020 •

edited

Loading

antonfirsov Oct 7, 2020 •

edited

Loading

antonfirsov Oct 7, 2020 •

edited

Loading

Turnerj Oct 8, 2020 •

edited

Loading

antonfirsov commented Oct 11, 2020 •

edited

Loading

antonfirsov Oct 11, 2020 •

edited

Loading