Everything in this project is authored by me (Joe Zbiciak, [email protected]), and is licensed under the Creative Commons Attribution-ShareAlike 4.0 International license, aka. CC BY-SA 4.0.
This is a really simple benchmark of different methods for counting 1 bits in a 32-bit value.
The benchmark itself uses the IEEE CRC-32 to generate a sequence of random 32-bit values. The IEEE CRC-32 polynomial is an maximal sequence polynomial for an LFSR.
The benchmark first checks that the three computational methods (popcnt32_a
,
popcnt32_b
, and popcnt32_c
) all return the same values for the full 32-bit
input range. This takes a few seconds on a modern computer.
Next, it initializes a lookup table for a fourth version, popcnt32_z
, that
merely looks up the value in the lookup table. A fifth version, popcnt32_e
,
uses the first 256 entries of this same table for byte-at-a-time lookup.
Finally, it measures the amount of time it takes to run through the entire
2^32 - 1 sequence of the IEEE CRC-32, counting the 1 bits in each state. In
order to force the compiler to measure this, the code uses a volatile
sum
variable to sum up the 1 bits counted.
Because computing the CRC sequence also takes time, the code also measures that
as the "Null" loop. You can see from the run time that the Null loop takes
nearly as long as popcnt32_a
, popcnt32_b
, and popcnt32_c
on many systems.
Meanwhile, popcnt32_z
takes a lot longer than the others.
The following output comes from my M1 Max based MacBook Pro, compiled with
clang -mtune=native -mcpu-native -DHAVE_BUILTIN_POPCOUNT -O3 -o bit_count bit_count.c
with Apple Clang version 13.1.6 (clang-1316.0.21.2.3). (See
below for details on HAVE_BUILTIN_POPCOUNT
.)
$ ./bit_count
Testing implementations...
Errs: 0 OK: 4294967296
Initializing LUT implementation... Done.
Null: 4225023 clocks 7FFFFFFF80000000
Ver A: 7921125 clocks 1000000000
Ver B: 7544140 clocks 1000000000
Ver C: 7032806 clocks 1000000000
Ver D: 4237567 clocks 1000000000
Ver E: 5532954 clocks 1000000000
Ver Z: 35113958 clocks 1000000000
$
And this data comes from my aging AMD Phenom™ II X4 965 machine that has plenty of DDR3 DRAM. My machine wasn't swapping during this test. It just has that much less memory bandwidth than the Apple M1 Max, apparently?
$ ./bit_count
Testing implementations...
Errs: 0 OK: 4294967296
Initializing LUT implementation... Done.
Null: 8650000 clocks 7FFFFFFF80000000
Ver A: 15610000 clocks 1000000000
Ver B: 16450000 clocks 1000000000
Ver C: 13080000 clocks 1000000000
Ver E: 13240000 clocks 1000000000
Ver Z: 304970000 clocks 1000000000
If you want to try the GCC extension __builtin_popcount
, compile with
-DHAVE_BUILTIN_POPCOUNT
. To actually use your CPU's native instruction (if
it has one), you will likely need additional flags, such as -march=native -mtune=native
(GCC) or -march=native -mcpu=native
(Clang).
When built with this macro defined, Ver D
will appear in the results.
The following data is from my Phenom™ II X4 965 built this way:
$ ./bit_count
Testing implementations...
Errs: 0 OK: 4294967296
Initializing LUT implementation... Done.
Null: 8650000 clocks 7FFFFFFF80000000
Ver A: 15610000 clocks 1000000000
Ver B: 16450000 clocks 1000000000
Ver C: 13080000 clocks 1000000000
Ver D: 11330000 clocks 1000000000
Ver E: 13240000 clocks 1000000000
Ver Z: 304970000 clocks 1000000000
$
Copyright © 2023, Joe Zbiciak [email protected]
SPDX-License-Identifier: CC-BY-SA-4.0