Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert dwa encoder to use algorithm quantize #1948

Merged

Conversation

kdt3rd
Copy link
Contributor

@kdt3rd kdt3rd commented Dec 31, 2024

Convert the quantization from a massive table to a bit-fiddling algorithm which is more code but significantly faster (in some cases, almost 2x). This also eliminates the giant lookup tables from the binary, which reduces the compiled size of the core library by half on x86, with avx2 / f16c enabled anyway.

This will help with the dwa performance problems noted in #1915 where the compressor got much slower

Convert the quantization from a massive table to a bit-fiddling
algorithm which is more code but significantly faster (in some cases,
almost 2x). This also eliminates the giant lookup tables from the
binary, which reduces the compiled size of the core library by half on
x86, with avx2 / f16c enabled anyway.

Signed-off-by: Kimball Thurston <[email protected]>
Signed-off-by: Kimball Thurston <[email protected]>
Signed-off-by: Kimball Thurston <[email protected]>
also propagate cross platform function fixes to test suite

Signed-off-by: Kimball Thurston <[email protected]>
@@ -103,12 +103,68 @@ static exr_result_t LossyDctDecoder_execute (
// value in the buffer - with the index into zig zag
// order data. If we return 0, we have DC only data.
//
static int LossyDctDecoder_unRleAc (
//
// This is assuminging that halfZigBlock is zero'ed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"assuminging"

@@ -215,7 +271,7 @@ LossyDctDecoder_base_construct (
d->_toLinear = toLinear;
d->_width = width;
d->_height = height;
if (d->_toLinear == NULL) d->_toLinear = dwaCompressorNoOp;
//if (d->_toLinear == NULL) d->_toLinear = dwaCompressorNoOp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this commenting out intentional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yeah, I missed fully removing that. I also removed the no-op table in favor of an outer loop conditional to dispatch to do nothing instead of always looking through an identity table.

@meshula
Copy link
Contributor

meshula commented Dec 31, 2024

This also eliminates the giant lookup tables from the binary, which reduces the compiled size of the core library by half on x86, with avx2 / f16c enabled anyway.

I think, based on the code that on non-intel the giant look up tables are eliminated as well, just checking because I could read this comment as the giant lookup tables still exist on not-intel. I'm asking though, because maybe some detail is hidden in platform switches not in the diff.

We still have the giant table for unit testing. Do we need any touch on the cmake script to govern the creation of the table according to whether testing is enabled?

Copy link
Contributor

@meshula meshula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving on the assumption that the elimination of the tables is cross platform, not intel-only.

Happy New Year!

@kdt3rd
Copy link
Contributor Author

kdt3rd commented Dec 31, 2024

Happy New Year! Yes, sorry, bad wording: this is entirely a cross-platform / architecture change, just removing 1.3MB of binary tables. I mentioned avx2 / f16c only to indicate that is where the performance numbers I might quote are indicative of (my older core i9-9900k). I have not yet been able to test on newer / alternate hardware, but expect there to be a good boost for all systems with increasingly disparate memory speed vs. instruction execution. This is could be viewed as a step towards enabling a gpu version of this, or easier implementation in an embedded system, as examples, not that I have explicit plans to do any of that.

This is not fully branchless unfortunately, as I could not work out the math yet to reason about the error tolerance being full 32-bit float precision, but wanting to quantize a 16-bit float. This code replicates the results precisely for the allowed range of quantization levels (0-100) for DWAA/B. However, if we were to change the error tolerance to be to a half float value, we could greatly simplify the code (faster still) and I believe have it be fully deterministic and mostly branchless (the odd check for inf / nan).

But I didn't want to change the output values just yet...

@kdt3rd
Copy link
Contributor Author

kdt3rd commented Dec 31, 2024

We still have the giant table for unit testing. Do we need any touch on the cmake script to govern the creation of the table according to whether testing is enabled?

AFAIK, we long ago removed rules to create these files, only leaving the code which did so as reference, which I have not changed....

Signed-off-by: Kimball Thurston <[email protected]>
@kdt3rd kdt3rd merged commit 22cbb79 into AcademySoftwareFoundation:main Jan 1, 2025
36 checks passed
@kdt3rd kdt3rd deleted the convert_dwa_table_to_code branch January 1, 2025 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants