Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add code measuring CPU frequency #125

Open
Bulat-Ziganshin opened this issue May 7, 2020 · 8 comments
Open

Add code measuring CPU frequency #125

Bulat-Ziganshin opened this issue May 7, 2020 · 8 comments

Comments

@Bulat-Ziganshin
Copy link

Bulat-Ziganshin commented May 7, 2020

I just wrote a little snippet measuring actual frequency of CPU core performing this code: https://encode.su/threads/3389-Code-snippet-to-compute-CPU-frequency

Please consider using it to correctly compute number of CPU cycles spent by hash functions - instead of RDTSC whose fakeness was discussed here a few years ago.

@rurban
Copy link
Owner

rurban commented May 8, 2020

Nice. Just we already have better measurements than gettimeofday

And on Linux you can just ask the kernel. It deviates constantly btw.

@erthink
Copy link
Contributor

erthink commented May 8, 2020

As I wrote earlier, seems that the best code for measuring up to clock cycles inside the t1ha benchmark.

It supports x86, arm64, ppc64, s390x, e2k, ia64, etc, as well as perf_event, emscripten_get_now(), mach_absolute_time(), QueryPerformanceCounter(), read_wall_time(), clock_gettime(), gethrtime() and gettimeofday() (i.e. more than google-benchmark).
For instance see logs on Travis-CI.

I was planning to rearrange this code as a separate "mera" library, but I don't have time for this yet.
Therefore, reusing this code is not as convenient as we would like.
However, it is worth mentioning in this context.


PPC64:

Preparing to benchmarking...
 - running on CPU#10
 - use MFSPR(268) as clock source for benchmarking
 - assume it cheap and stable
 - measure granularity and overhead: 6 cycles, 0.166667 iteration/cycle

ARM64:

Preparing to benchmarking...
 - running on CPU#30
 - use CNTVCT_EL0 as clock source for benchmarking
 - assume it cheap and stable
 - measure granularity and overhead: 0.2 tick, 5 iterations/tick

x390s

Preparing to benchmarking...
 - running on CPU#3
 - use STCKE as clock source for benchmarking
 - assume it cheap and stable
 - measure granularity and overhead: 6 cycles, 0.166667 iteration/cycle

AMD64:

Preparing to benchmarking...
 - perf_event_open(): No such file or directory
 - running on CPU#0
 - use RDTSCP as clock source for benchmarking
 - assume it cheap and floating (RESULTS MAY VARY AND BE USELESS)
 - measure granularity and overhead: 38 cycles, 0.0263158 iteration/cycle

@Bulat-Ziganshin
Copy link
Author

Bulat-Ziganshin commented May 8, 2020

It seems that you both say about measuring time intervals, while the code I provided is about measuring effective CPU frequency - using any abovementioned way to measure the time interval.

My point is that using rdtsc to count CPU cycles is broken for about 10 years, because it reports cycles of fixed base frequency (such as 2 GHz in reports provided in encode.su thread). So, instead I wrote small code for which we know how much CPU cycles it will be executed, and by measuring time spent on it, we can easily compute the frequency. Moreover, the method works for almost any supersclalar CPU.

Using this approach, we can finally correctly report how much CPU cycles spent for each hashing operation.

@rurban
Copy link
Owner

rurban commented May 8, 2020

Yes, I know these loop counting tricks from gamers to calculate the frame rate. It's a rather stable way to do it. I'll check if rtdsc with cpuid is better or worse.

But "better" would be reading the freq from the kernel via proc.

rurban added a commit that referenced this issue Oct 1, 2020
not hardcoded to 3 GHz.
Some code is based on GH #125, but this result is not really good.
On linux I found an easy way.
rurban added a commit that referenced this issue Oct 1, 2020
not hardcoded to 3 GHz.
Some code is based on GH #125, but this result is not really good.
On linux I found an easy way.
rurban added a commit that referenced this issue Oct 1, 2020
not hardcoded to 3 GHz.
Some code is based on GH #125, but this result is not really good.
On linux I found an easy way.
rurban added a commit that referenced this issue Nov 26, 2020
not hardcoded to 3 GHz.
Some code is based on GH #125, but this result is not really good.
On linux I found an easy way.
rurban added a commit that referenced this issue Nov 28, 2020
not hardcoded to 3 GHz.
Some code is based on GH #125, but this result is not really good.
On linux I found an easy way.
rurban added a commit that referenced this issue Jan 21, 2021
not hardcoded to 3 GHz.
Some code is based on GH #125, but this result is not really good.
On linux I found an easy way.
rurban added a commit that referenced this issue Nov 19, 2021
not hardcoded to 3 GHz.
Some code is based on GH #125, but this result is not really good.
On linux I found an easy way.
rurban added a commit that referenced this issue Jan 27, 2022
not hardcoded to 3 GHz.
Some code is based on GH #125, but this result is not really good.
On linux I found an easy way.
rurban added a commit that referenced this issue Apr 2, 2022
not hardcoded to 3 GHz.
Some code is based on GH #125, but this result is not really good.
On linux I found an easy way.
@YellowOnion
Copy link

But "better" would be reading the freq from the kernel via proc.

Switching frequency on a modern core is usually in microseconds, AMD's Precision boost is pretty crazy, my CPU will be anywhere between 4.5 and 5.1GHz with single core boost, constantly changing due power demand etc, I kinda doubt you can get accurate readings through anything non-atomic with the execution of the code.

Real world time is also important especially when older Intel's AVX512 will clock a system down below "base" (Zen 4 doesn't have this penalty), potentially hiding some of the performance penalty because a user might think 30 cycles at 2GHz is better than 40cycles at 3GHz.

There's also other things to consider, I'm pretty sure some AVX units can take upwards of 200 cycles just to turn on, which might not be measured here if the unit is already hot.

https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

@darkk
Copy link
Contributor

darkk commented Sep 13, 2024

I've drafted a patch that approaches the issue from the different angle.

I usually know CPU frequency of the machine I'm working with. Also, there might be some easy way to query it. So, the frequency itself is of purely informational value to me. However, I don't always know if the cycle counter code is somewhat correct and #241 together with #292 highlight that, so some "visual control" is handy.

So, I decided to combine tick counter and real-time clock into ae7ccd9 that produces output similar to the following:

--- Testing xxHash32 "xxHash, 32-bit for x86" POOR

[[[ Speed Tests ]]]

WARNING: timer resolution is 158 (0x9e) ticks (0x1001f296d4a00eaa - 0x1001f296d4a00e5b). Broken VDSO?
Bulk speed test - 262144-byte keys, cache_linesize 32
Alignment   8 -  0.287 bytes/cycle -  3.482 cycles/byte -   821.65 MiB/sec @ 3 GHz -   158.85 MiB/s @ 580 MHz
Alignment   7 -  0.139 bytes/cycle -  7.204 cycles/byte -   397.15 MiB/sec @ 3 GHz -    76.78 MiB/s @ 580 MHz, -51.7% B/c
Alignment   6 -  0.139 bytes/cycle -  7.216 cycles/byte -   396.51 MiB/sec @ 3 GHz -    76.66 MiB/s @ 580 MHz, -51.7% B/c
Alignment   5 -  0.138 bytes/cycle -  7.226 cycles/byte -   395.94 MiB/sec @ 3 GHz -    76.55 MiB/s @ 580 MHz, -51.8% B/c
Alignment   4 -  0.306 bytes/cycle -  3.269 cycles/byte -   875.23 MiB/sec @ 3 GHz -   169.21 MiB/s @ 580 MHz, +6.5% B/c
Alignment   3 -  0.138 bytes/cycle -  7.226 cycles/byte -   395.92 MiB/sec @ 3 GHz -    76.54 MiB/s @ 580 MHz, -51.8% B/c
Alignment   2 -  0.138 bytes/cycle -  7.226 cycles/byte -   395.94 MiB/sec @ 3 GHz -    76.55 MiB/s @ 580 MHz, -51.8% B/c
Alignment   1 -  0.138 bytes/cycle -  7.226 cycles/byte -   395.93 MiB/sec @ 3 GHz -    76.55 MiB/s @ 580 MHz, -51.8% B/c
Alignment   0 -  0.255 bytes/cycle -  3.926 cycles/byte -   728.66 MiB/sec @ 3 GHz -   140.87 MiB/s @ 580 MHz, -11.3% B/c
Average       -  0.187 bytes/cycle -  5.361 cycles/byte -   533.66 MiB/sec @ 3 GHz -   103.17 MiB/s @ 580 MHz, ~64.0%
Best          -  0.306 bytes/cycle -  3.269 cycles/byte -   875.23 MiB/sec @ 3 GHz -   169.21 MiB/s @ 580 MHz, -54.8% B/c for worst
WARNING: $worst and $best deviate by -54.8% (> 1%). Misaligned read penalty?
	\ Try SMHASHER_ALIGNAS_STEP=2
WARNING: alignas(8) and alignas(0) deviate by -11.3% (> 1%). Insufficient alignas() granularity?
	\ Try SMHASHER_ALIGNAS_MAX=32 or SMHASHER_ALIGNAS_MAX=64
WARNING: key[262144] 0.306 B/c and key[pagesize=4096] 0.421 B/c deviate by -27.4% (> 10%). Memory wall hit?
	\ Try SMHASHER_BLOCKSIZE=$(getconf -a | awk '/[^I]CACHE_SIZE/ && $2 > 0 {print $2/2}' | shuf -n 1)

@rurban
Copy link
Owner

rurban commented Sep 13, 2024

Our cycle counter code is correct for Intel. In fact one of the only ones which is actually correct, after an Intel paper.

@darkk
Copy link
Contributor

darkk commented Sep 13, 2024

I think so. I'm mostly focused on MIPS (having 32-bit cycle counter) and ARM at this moment. The output above comes from my go-to MIPS32 router and reflects its frequency correctly.

The code is basically a (rdtsc2 - rdtsc1) / (timeofday2 - timeofday1) adjusted for overflows and timers. It reports 2808 MHz on my Intel laptop that is somewhat in-sync with cpu MHz : 2800.000 from /proc/cpuinfo under performance governor. I have no way to say anything about correctness of this number and extra 8 MHz as I'm not familiar with dynamic frequency scaling of Intel CPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants