Skip to content

Latest commit

 

History

History
161 lines (119 loc) · 4.34 KB

README.adoc

File metadata and controls

161 lines (119 loc) · 4.34 KB

NEON ARMv8 SHA3_2x

Update

This package is now support ARMv8.2-sha3 instruction The result improve significantly when use SHA-3 instruction.

Apple M1

SHA-3 Enabled
2022-05-06T06:04:41-04:00
Running ./benchmark
Run on (8 X 24.1207 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 2.07, 1.91, 1.75
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_F1600x2        156 ns          156 ns      4135039
BM_F1600          218 ns          218 ns      3166704
SHA-3 Disable
2022-05-06T06:09:10-04:00
Running ./benchmark
Run on (8 X 24.0697 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 2.40, 2.16, 1.90
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_F1600x2        355 ns          355 ns      1948260
BM_F1600          217 ns          217 ns      3220716

Anyway it’s still faster than 2 times Keccak-F1600.

NEON ARMv8 Keccak2x Implementation.

Since there is no SIMD128 for ARMv8, so I decide to implement one.

The result is not impressive, due to 2 reasons:

SHA3 uses native bit-wise operation like AND, NOT, XOR, those operation only take about 1 cycle in CPU, therefore:

  • No pipeline happens

  • No significant improvement if SIMD bitwidth is 128-bit, ARMv8 native register width is 64-bit, I suppose frequency in NEON mode is slower than Scalar mode. (I don’t know the term for this, please let me know)

This code can be faster than this benchmark if:

  • SIMD register bitwidth is wider: e.g 256, 512, …​

  • Frequency in NEON mode is at least > 0.5 * (Scalar frequency)

What is inside this package?

NEON (ASIMD) ARMv8 implementation of:

  • KeccakP-1600

  • SHAKE128 : Absorb, Squeeze

  • SHAKE256 : Absorb, Squeeze

  • SHA3_256

  • SHA3_512

Result

System Information

Here is my benchmark on ARMv8 Raspberry Pi 64-bit Majaro:

OS
Distributor ID: Manjaro-ARM
Description:    Manjaro ARM Linux
Release:        20.10
CPU
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
Vendor ID:                       ARM
Model:                           3
Model name:                      Cortex-A72
Stepping:                        r0p3
CPU max MHz:                     1900.0000
CPU min MHz:                     600.0000
Flags:                           fp asimd evtstrm crc32 cpuid

I overclocked Raspberry Pi to 1900 Mhz. The default CPU frequency is 1500 Mhz.

Result

All benchmarks were run via this command:

make all
taskset 0x1 ./benchmark_SHAKE128_256_1000.bin

taskset command pin process to only 1 CPU, avoid switching CPU cost

Table 1. Result

Output Length

Input Length

FIPS202x2 NEON

FIPS202x2 C

42

672

487

514

294

336

390

413

1008

42

586

606

2772

1008

2228

2287

3318

504

2230

2286

4074

1008

3004

3099

The result above iterate 1000 time. As set in #define TESTS 1000

You can view the full result, iterate 1,000 or 1,000,000 times in: data/

Graph

If the data/ is confuse to you, here is some graphs:

shake128

shake256

  • The orange line is the differences between C reference code and NEON implementation

  • The green line is average of 24 samples for C_ref - NEON

    • Orange: C_ref - NEON

    • Green: average of C_ref - NEON

You can notice that in some case, C Ref is better than NEON. For small output length, NEON is better than C Ref at about 5%.

Conclusion

The Keccak2x NEON version is always faster than 2 times Keccak C version. See bench() function

  • If you only call Keccak once, use C version, it’s faster

  • If you call Keccak multiple times, use NEON version, it saves sometimes.