[Windows 10] [C LIBRARY] : PNG decoding is slower than OpenCV #72

qinxianyuzi · 2022-02-08T04:05:27Z

Hello, thanks for helping:
I try to use wuffs to open png files within a c++ project. I use vs2017 to compile this code, but PNG decoding is slower than OpenCV.
OpenCV: 65ms
wuffs: 93ms

#include "iostream"
#include "chrono"
#define WUFFS_IMPLEMENTATION
#define WUFFS_CONFIG__MODULE__PNG
#include "wuffs-v0.3.c"

uint32_t g_width = 0;
uint32_t g_height = 0;
wuffs_aux::MemOwner g_pixbuf_mem_owner(nullptr, &free);
wuffs_base__pixel_buffer g_pixbuf = { 0 };

bool load_image(const char* filename)
{
	FILE* file = stdin;
	const char* adj_filename = "<stdin>";
	if (filename) {
		FILE* f = fopen(filename, "rb");
		if (f == NULL) {
			printf("%s: could not open file\n", filename);
			return false;
		}
		file = f;
		adj_filename = filename;
	}
	g_width = 0;
	g_height = 0;
	g_pixbuf_mem_owner.reset();
	g_pixbuf = wuffs_base__null_pixel_buffer();
	wuffs_aux::DecodeImageCallbacks callbacks;
	wuffs_aux::sync_io::FileInput input(file);
	wuffs_aux::DecodeImageResult res = wuffs_aux::DecodeImage(callbacks, input);

	return true;
}

inline auto get_time()
{
	return std::chrono::high_resolution_clock::now();
}

int main(int argc, char** argv)
{
	auto start = get_time();
	bool loaded = load_image("C:/Users/huangry/Desktop/8/IMG_1071.PNG");
	if (loaded) 
		std::cout << loaded << "\n";
	auto end = get_time();
	std::chrono::duration<double> elapsed = (end - start);
	printf("Wuffs : %fs\n", elapsed.count());

	return 0;
}

The text was updated successfully, but these errors were encountered:

nigeltao · 2022-02-08T22:10:48Z

Are you configuring Visual Studio with /arch:AVX? GCC and clang can use __attribute__((target("avx2"))) on its functions but I don't think Microsoft's VS supports that, so you have to manually opt in to SIMD acceleration. If you don't opt in, you'll get the slower (non-SIMD) fallback code.

nigeltao · 2022-02-08T22:11:25Z

If that doesn't help, can you attach the C:/Users/huangry/Desktop/8/IMG_1071.PNG file so I can try to reproduce the slowness?

nigeltao · 2022-02-08T22:16:33Z

Are you configuring Visual Studio with /arch:AVX?

Oh, also, for MSVC, make sure that you're compiling an optimized build, not a debug build. I think this is the /O2 option (that's: slash, letter-O, number-2), or its GUI equivalent, but I might be wrong (I don't use Microsoft's toolchain, day-to-day).

qinxianyuzi · 2022-02-09T01:03:56Z

Thanks, I'm trying to configure Visual Studio with avx2. And maybe clang is indispensable.

qinxianyuzi · 2022-02-09T01:11:49Z

This is PNG file.

qinxianyuzi · 2022-02-09T02:56:44Z

I configure Visual Studio with (/arch:AVX2), but it doesn't work.

nigeltao · 2022-02-09T07:33:29Z

I configure Visual Studio with (/arch:AVX2), but it doesn't work.

Does "it doesn't work" mean that it didn't get faster, or does it mean that you got a compiler error message, or does it mean something else? If it's an error message, can you copy/paste it here?

qinxianyuzi · 2022-02-14T10:56:00Z

It didn't get faster.

nigeltao · 2022-02-14T11:23:27Z

I configure Visual Studio with (/arch:AVX2), but it doesn't work.

OK. Does /arch:AVX without the 2 do anything? Do you also pass /O2? It might be easier if you say what compiler flags you are passing.

Is clang faster or is it also as slow?

qinxianyuzi · 2022-03-01T08:22:35Z

It is 1.2x faster than opencv with clang

pavel-perina · 2022-06-09T16:11:33Z

Hi. I tried it on large data. Program has some internal overhead, but anyways ...

First dataset 1984x1984x1540/16bit grayscale (all times including overhead, series of 1540 images)):
OpenCV/libpng: 75s
WIC (windows imaging components)/file: 66s
WIC/memory: 58s (because file reader had some overhead reading 26MB PNG files from HDD, it turned out to be faster to read whole file and use memory decoder)
WUFFS: fails (no 16bit support, converted to 8bit leaving half of output buffer empty)

Second dataset 2048x2048x2048/8bit grayscale synthetic data, each PNG roughly 14kB - basically repeating b&w patterns.
WUFFs: 18s everything w/overhead, 9.4s in decoder
WIC/memory: 10s everything, 2.7s in decoder (3.5x faster!!!)
OpenCV/libpng 21s everything, 5.4s in decoder (worse overhead due to another app layer)

About /arch:AVX ... it may do something, but MSVC is very good at finding reasons why it won't optimize loops and reasons can be printed using /Qvec-report:2 option in C++/All options/Additional Options

Bottleneck is obviously wuffs_base__io_writer__limited_copy_u32_from_history_fast for very compressible data which gives us

1>c:\dev-c\****\imageoperations\include\imageoperations\wuffs-v0.3.h(10427) : info C5002: loop not vectorized due to reason '1301'
1>c:\dev-c\****\imageoperations\include\imageoperations\wuffs-v0.3.h(10432) : info C5002: loop not vectorized due to reason '1301'
1>

And from https://docs.microsoft.com/en-us/cpp/error-messages/tool-errors/vectorizer-and-parallelizer-messages?view=msvc-170#BKMK_ReasonCode130x , 1301 = Loop stride isn't +1.

Example of code which it can optimize (if outputtype is shorter or the same, otherwise it fails with code 1203, but code logic chooses outputtype that won't overflow)

template<typename OutputType, typename InputType>
void updateBufferFromBlock(void *output, void *input, size_t n)
{
    const InputType*  pIn  = static_cast<const InputType*>(input);
          OutputType* pOut = static_cast<OutputType*>(output);

    for (size_t i = 0; i < n; i++) {
        pOut[i] += static_cast<OutputType>(pIn[i]);
    }
}

Top-down function times for realistic dataset: https://i.imgur.com/UD5a7MF.jpg compiled with /02 /arch:AVX and comparison with other decoders: https://imgur.com/a/ZEtojo9

TL;DR: either write/generate code using AVX intristic instructions or don't pre-optimize it for MSVC. Windows Imaging Components seems fastest, but it works only on WIndows (since Vista, Seven ... idk)

nigeltao · 2022-06-12T04:56:20Z

FWIW, this patch:

diff --git a/release/c/wuffs-unsupported-snapshot.c b/release/c/wuffs-unsupported-snapshot.c
index 717414f8..ef2105cb 100644
--- a/release/c/wuffs-unsupported-snapshot.c
+++ b/release/c/wuffs-unsupported-snapshot.c
@@ -11743,13 +11743,8 @@ wuffs_base__io_writer__limited_copy_u32_from_history_fast(uint8_t** ptr_iop_w,
                                                           uint32_t distance) {
   uint8_t* p = *ptr_iop_w;
   uint8_t* q = p - distance;
-  uint32_t n = length;
-  for (; n >= 3; n -= 3) {
-    *p++ = *q++;
-    *p++ = *q++;
-    *p++ = *q++;
-  }
-  for (; n; n--) {
+  size_t n = length;
+  for (size_t i = 0; i < n; i++) {
     *p++ = *q++;
   }
   *ptr_iop_w = p;

looks like your updateBufferFromBlock suggestion, but the benchmark results are mixed. clang11 gets worse, gcc10 gets better.

name                                              old speed     new speed     delta

wuffs_deflate_decode_1k_full_init/clang11         181MB/s ± 1%  179MB/s ± 1%  -1.36%  (p=0.008 n=5+5)
wuffs_deflate_decode_1k_part_init/clang11         215MB/s ± 0%  206MB/s ± 0%  -4.53%  (p=0.008 n=5+5)
wuffs_deflate_decode_10k_full_init/clang11        388MB/s ± 0%  362MB/s ± 1%  -6.64%  (p=0.008 n=5+5)
wuffs_deflate_decode_10k_part_init/clang11        398MB/s ± 0%  370MB/s ± 0%  -7.14%  (p=0.008 n=5+5)
wuffs_deflate_decode_100k_just_one_read/clang11   496MB/s ± 0%  489MB/s ± 0%  -1.47%  (p=0.008 n=5+5)
wuffs_deflate_decode_100k_many_big_reads/clang11  313MB/s ± 0%  302MB/s ± 0%  -3.40%  (p=0.008 n=5+5)

wuffs_deflate_decode_1k_full_init/gcc10           177MB/s ± 0%  179MB/s ± 1%    ~     (p=0.056 n=5+5)
wuffs_deflate_decode_1k_part_init/gcc10           206MB/s ± 0%  209MB/s ± 0%  +1.51%  (p=0.008 n=5+5)
wuffs_deflate_decode_10k_full_init/gcc10          384MB/s ± 0%  386MB/s ± 0%  +0.73%  (p=0.008 n=5+5)
wuffs_deflate_decode_10k_part_init/gcc10          393MB/s ± 0%  397MB/s ± 0%  +1.08%  (p=0.008 n=5+5)
wuffs_deflate_decode_100k_just_one_read/gcc10     496MB/s ± 0%  523MB/s ± 0%  +5.30%  (p=0.008 n=5+5)
wuffs_deflate_decode_100k_many_big_reads/gcc10    314MB/s ± 0%  336MB/s ± 1%  +6.96%  (p=0.008 n=5+5)

mimic_deflate_decode_1k_full_init/gcc10           229MB/s ± 1%  228MB/s ± 0%    ~     (p=0.310 n=5+5)
mimic_deflate_decode_10k_full_init/gcc10          275MB/s ± 0%  275MB/s ± 0%    ~     (p=0.310 n=5+5)
mimic_deflate_decode_100k_just_one_read/gcc10     336MB/s ± 0%  335MB/s ± 0%  -0.37%  (p=0.008 n=5+5)
mimic_deflate_decode_100k_many_big_reads/gcc10    263MB/s ± 0%  264MB/s ± 0%    ~     (p=0.310 n=5+5)

In any case, I'm not sure if AVX-ness (or not) would really help here. The destination and source byte slices can overlap, often by only a few bytes, in which case you can't just do a simple memcpy 32 bytes at a atime.

nigeltao · 2022-06-12T04:59:50Z

WUFFS: fails (no 16bit support, converted to 8bit leaving half of output buffer empty)

Wuffs should be able to decode to WUFFS_BASE__PIXEL_FORMAT__Y_16LE or WUFFS_BASE__PIXEL_FORMAT__Y_16BE, but you have to opt into that (instead of defaulting to WUFFS_BASE__PIXEL_FORMAT__BGRA_PREMUL). If you're using Wuffs' C++ API, then that involves overriding the SelectPixfmt method (like example/sdl-imageviewer/sdl-imageviewer.cc).

I don't have a Windows machine readily available, but according to https://godbolt.org/z/q4MfjzTPh and the https://imgur.com/UD5a7MF profile mentioned in #72, this could improve inner loop performance. Updates #72

nigeltao · 2022-06-12T07:21:14Z

I don't have MSVC myself, but for those who do, I'm curious if commit c226ed6 noticably improves PNG decode speed.

pavel-perina · 2022-06-16T16:08:43Z

I don't have MSVC myself, but for those who do, I'm curious if commit c226ed6 noticably improves PNG decode speed.

I'm sorry, little busy this week, hopefully will get to this issue next week.

nigeltao · 2022-07-06T05:02:54Z

@pavel-perina any news?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Windows 10] [C LIBRARY] : PNG decoding is slower than OpenCV #72

[Windows 10] [C LIBRARY] : PNG decoding is slower than OpenCV #72

qinxianyuzi commented Feb 8, 2022 •

edited

Loading

nigeltao commented Feb 8, 2022

nigeltao commented Feb 8, 2022

nigeltao commented Feb 8, 2022

qinxianyuzi commented Feb 9, 2022

qinxianyuzi commented Feb 9, 2022

qinxianyuzi commented Feb 9, 2022

nigeltao commented Feb 9, 2022

qinxianyuzi commented Feb 14, 2022

nigeltao commented Feb 14, 2022

qinxianyuzi commented Mar 1, 2022

pavel-perina commented Jun 9, 2022 •

edited

Loading

nigeltao commented Jun 12, 2022

nigeltao commented Jun 12, 2022

nigeltao commented Jun 12, 2022

pavel-perina commented Jun 16, 2022

nigeltao commented Jul 6, 2022

[Windows 10] [C LIBRARY] : PNG decoding is slower than OpenCV #72

[Windows 10] [C LIBRARY] : PNG decoding is slower than OpenCV #72

Comments

qinxianyuzi commented Feb 8, 2022 • edited Loading

nigeltao commented Feb 8, 2022

nigeltao commented Feb 8, 2022

nigeltao commented Feb 8, 2022

qinxianyuzi commented Feb 9, 2022

qinxianyuzi commented Feb 9, 2022

qinxianyuzi commented Feb 9, 2022

nigeltao commented Feb 9, 2022

qinxianyuzi commented Feb 14, 2022

nigeltao commented Feb 14, 2022

qinxianyuzi commented Mar 1, 2022

pavel-perina commented Jun 9, 2022 • edited Loading

nigeltao commented Jun 12, 2022

nigeltao commented Jun 12, 2022

nigeltao commented Jun 12, 2022

pavel-perina commented Jun 16, 2022

nigeltao commented Jul 6, 2022

qinxianyuzi commented Feb 8, 2022 •

edited

Loading

pavel-perina commented Jun 9, 2022 •

edited

Loading