-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raspberry Pi 5 4GB is faster (sometimes significantly so) than the 8GB version #1854
Comments
Have you tried with the other 64-bit kernel ( |
Yes, I've run both the Geekbench 5 and stress-ng tests with kernel8.img and get the very same low results @ 2.8 GHz on the 8GB board, unfortunately. I should perhaps also mention that all tests are run on the very same SD card/installation. I'm just moving it between the boards. So there are no setup differences that could explain the discrepancies between the boards. |
One pretty interesting comparison is the 4GB @ 2.4 GHz vs 8GB @ 2.8 GHz in Geekbench 5. See the full comparison here: https://browser.geekbench.com/v5/cpu/compare/22028225?baseline=22028307 It is interesting to note that the single-core results mostly look to be in the 8GB board's favor (which they of course are expected be, given the higher clock frequency). One notable regression is the AES-XTS test which is down 5% on the 8GB board. Turning over to the multi-core results, things get a bit more interesting. The overall score of the 8GB board is now 5% lower than the 4GB board, despite being overclocked by almost 17%. Big performance differences (-12 to -35 %) can be seen in 7 out of the 21 sub-tests. |
Can you rule out throttling being a factor? |
Yes, no throttling detected on either board (checked via vcgencmd). The 8GB board uses the active cooler and the 4GB board is in the official fan case. I've used force_turbo=1 for all of my benchmarks, to ensure consistency. I've also tried manually setting the overvoltage delta on the 8GB board to give an additional 0.05V on top of the DVFS curve, but it had no effect on the results. Which I expected, but hey, might as well try it all. 😁 I'd recommend that you run the simple stress-ng test on an 8GB board at 2.4 and 2.8 GHz and see if you can reproduce the issue. The test takes only 10 to 20 seconds to run. |
Some difference in performance between sdram devices is expected. A read on this may be useful. This is important:
so physical memory is split into pages, and pages into banks. I think Pi5 has 8 banks, so you can access 8 different pages cheaply if they are all in different banks. As soon as you access a different page in a bank, you have to close the old one, and open the new one, which is expensive. This is a form of thrashing (as you may see with a cache) and reduces your usable memory bandwidth, depending on access patterns. You could naively say, each of the 8 gigabytes is a different bank, but that wouldn't work very well at start of day when perhaps only the first GB is in use, so you effectively only have one bank in use. You might imagine a benchmark that uses 8 different buffers. If luckily, they all sit in physical memory in separate banks, you could get no page opens/closes when running and maximal memory bandwidth. If unluckily they all sat in the same bank, performance would be much worse. I believe the architecture of the 8GB sdram means different address bits are used in the bank segmentation. In theory this is not inherently worse (with my buffers arranged to be in physical memory in a way that maximises bank usage, I'm sure it would be possible to get better results from 8GB than 4GB). In practice it may be that the 4GB layout is typically better, although you tend to find the longer the kernel has been running, the more scattered the virtual->physical mapping gets, and the less likely this is to have a measurable difference. This is just speculation at what might be one of the effects at play here. It may not be the only effect. The overclocking result is more surprising. I'll see if I can reproduce. There may be ways in which multiple cores may thrash against each other more destructively when running faster compared to the fixed sdram speed. We can get stats out of the sdram controller that say how busy it is (ratio of active cycles to total cycles), and how efficient it is (ratio of cycles where data is actually read/written, compared to other stuff, like opening/closing pages). Or it may be something simpler. As well as the arm clock, there is a dsu clock which controls the shared L3 cache the arms use, and is clocked at about 90% of arm's speed. Need to confirm that is running as expected. |
Thanks for the detailed response. I agree that part of the issue may be different internal RAM arrangement causing certain performance differences. As you say, though, the adverse effect of overlocking is a bit harder to explain in an intuitive way. Looking forward to your results. As it stands, I'm not sure what else I could do to help in a meaningful way, but let me know if there's anything specific you want me to check. |
I get (6.1 kernel, no display, stress-ng bogo ops/s real): 4GB part So I'm seeing the surprising 8GB + overclock result you see. |
I can confirm this: Kernel 6.6.5-v8+ aarch64 8GB (active cooled): 4GB (active cooled): |
It seems the stress-ng test we are running is basically a multi-threaded memset. I've created a simpler test that runs memset from a variable number of cores and that shows the same behaviour:
It's not unexpected that having more cores thrashing memory reduces overall bandwidth (other cores are interfering by closing pages the original core was using). In the single core case, the overclock does provide a benefit, but it appears overclocking may be giving the additional cores the ability to interfere more often. |
Interesting. So what are the results of the 4GB board in this test? |
This comment was marked as abuse.
This comment was marked as abuse.
Swapping to handle a large workload is going to more than negate any small speed benefit of the 4GB part. If you need more than 4GB, you need more than 4GB. |
This comment was marked as abuse.
This comment was marked as abuse.
Yes , you're a curious fellow. |
A large workload, as in, anything that needs that additional 4GB of RAM. It could be any number of things. And bear in mind not all uses cases are necessarily one single task, like compiling a kernel. |
Does this mean I should refrain from overclocking my Pi 5 8GB? Or is anyone gonna fix this? Or does it mean that I should ditch my Pi 5 8GB and get a Pi 5 4GB instead? |
If all you care about is the performance of a multicore memset benchmark then yes. |
So overclocking the Raspberry Pi 5 8GB will not reduce its real-world performance, or should I refrain from overclocking? I want the best real-world performance. |
It's up to you. Pi5 is generally plenty fast enough that there is little need for overclocking. |
So then what about these benchmark results? |
@popcornmix Did you run your test program yet on the 4GB board to see how it compares? Is anyone going to look further into this from the Raspberry Pi side? I of course understand it's not going to have the highest priority. While there are some theories in this comment thread as to what might be going on, there are no real conclusions as I see it. I will admit that as a hardware guy (although not with extensive DDR RAM knowledge), I do find the behavior interesting. :-) One thing that I can add that I wasn't clear with in my initial post: Sequential RAM bandwidth and latency performance does not seem to go down with increased CPU frequency on either the 4GB or 8GB variant (it actually increases slightly on both with CPU frequency). However, with both boards running at the same frequency, the 4GB board enjoys an advantage of around 3% for read bandwidth/latency and ~4.5% for write bandwidth/latency. This corresponds pretty well with several benchmarks/apps (such as DosBox and Google Octane 2.0 browser bench) running ~3% faster on the 4GB board at default frequencies. I can't tell if this bandwidth/latency discrepancy is related to the larger performance discrepancy for certain workloads/access patterns at higher frequencies. |
One thought springs to mind. There is a config.txt option to limit the total memory.
This should limit the mem to 4G on both boards. Does this change the result? I don't have any 8G versions otherwise I would test it myself. |
@Darkflib I actually did test that a while ago, but it did not change the results. Well, at least not in terms of the major performance regression at higher frequencies. I don't think I measured pure bandwidth and latency at default frequencies to see if it closed the 3-5 % gap that exists between the boards. |
We have been investigating this. The 4GB and 8GB sdram devices have numerous difference in timings based on the spec. The 4GB setting is:
The 8GB setting is:
If you are willing to run unsafely you can switch a single board between the two modes, and typically you find the benchmark behaviour changes in the same way as swapping between a 4GB and 8GB board. (running a 4GB with 8GB timing is likely safe but sub-optimal. Running an 8GB with 4GB is not safe, but in my testing has been reliable enough to run benchmarks. YMMV). When doing a multicore memset type test it seems possible to get into a state where sdram thrashes (see earlier link about pages and banks) which lowers sdram efficiency. Ideally for perfect efficiency when you open a page, you will access a page size's worth of data (I believe 8KB). But in the thrashing case the statistics suggest we are only writing about 130 bytes per page open before another core gets in and causes that page to close. This thrashing behaviour seems to occur when you approach 0% idle cycles in the sdram controller. It's pretty hard to achieve this typically, but then following effects get you closer:
I can replace the arm overclock with an sdram underclock and observe the same behaviour. We're still trying to understand the exact behaviour, but so far I don't think the dramatic change in benchmark performance will occur in any real world workload (once the arm cores do some actual processing, rather than just writing to memory, you get idle cycles in sdram controller and the thrashing behaviour does not occur). |
Thank you for the elaborate response. I did a quick test of the two suggested timing settings on my 8GB board, using Geekbench 5. These tests were run on a fully updated installation of Ubuntu 23.10, as that's the only "throwaway" installation I had up and running. The two tests below were both run at 2.8 GHz, with only the tRFC setting changed between them: https://browser.geekbench.com/v5/cpu/compare/22120461?baseline=22120453 As can be seen, the multi-core composite score increases by 7% and the "Text Rendering" sub-test increases by 59%. As you say, the scores with the tighter timing seem to fall in line with the 4GB board. I will experiment with this a bit more on a Raspberry Pi OS installation once I have the time. Nice to know some more details about what causes these effects, even if the behavior is all within spec and not risk-free to mitigate. |
I bet these issues should be resolved soon enough, as (hopefully) the devs at Raspberry Pi will see this. Seems to be a big flaw, however. It might be a firmware issue, but I suspect that it might also have to do a lot with the hardware, so possibly a new revision (i.e., “Raspberry Pi 5 Model B Revision 2.0”) will solve this. |
Two Pi engineers have actually posted in this thread (three now!), and have explained what is causing the results. |
Actually we haven't, because it is still actively being investigated, and the mechanism behind the slow down under benchmark conditions is not yet understood. However, we are really keen to find the explanation. |
Soz! |
Great! Will test it later today! |
I should have searched reviews first, but I just went for the 8GB Pi 5. I never thought memory could play such big role on overclocking. I can't get mine to run above 2.7GHz while most people can push the 4GB variant above 3GHz. Should I just get the 4GB and use that as my main? |
@senothechad - I have a few 8 GB Pi 5 and some work at 3.0 GHz, others are more stable at 2.8... and a few can only get to 2.6 before they get flakey. I don't think the memory issue here has as much to do with overclock-ability. |
I agree with @geerlingguy. I'm pretty sure that it's more about the silicon lottery than anything when it comes to overclocking. |
Indeed there is no official support for overclocking and Pi5 is a 2.4GHz product. |
Is this really just a matter of silicon lottery, or does the type of memory, in this case 4GB and 8GB variants actually play a role? |
Max ARM CPU speed is purely silicon lottery. Previously, 4GB / 8GB made a difference in terms of the performance for memcpy like benchaks that you get at overclocked speeds but following this PR to increase available memory bandwidth it's less of a difference. |
Note the chip that you overclock (with You could theoretically remove a 4GB SDRAM chip and replace with an 8GB SDRAM chip and the BCM2712 will still overclock in exactly the same way. There is zero correlation between how far you can overclock and whether you have a 4GB or 8GB SDRAM. |
Got it. Though is there any way I can bypass voltage limitations, maybe even with separate hardware? 1.0V is the maximum I can set but this chip could technically handle a little bit more voltage. Just want to try everything for that extra juice. |
No |
What if I didn't care about the risks of destroying it? |
* Adjust the SDRAM refresh interval based on the temperature. This addresses the gap in performance between the 8GB and 4GB variants. See raspberrypi/firmware#1854 * Preliminary support for signed boot.
* Adjust the SDRAM refresh interval based on the temperature. This addresses the gap in performance between the 8GB and 4GB variants. See raspberrypi/firmware#1854 * Preliminary support for signed boot.
…river See: raspberrypi/firmware#1854 firmware: arm_loader: mailbox: Optionally return extended board rev See: raspberrypi/firmware#1831 firmware: arm_loader: Set dma-channel-mask as well as brcm,dma-channel-mask firmware: board_info: Add Compute Module 5 model info string
The adjustment to reduce sdram refresh where possible (which gives a performance benefit) that was done for pi5 has now been applied to pi4. On pi4 this is implemented in start4.elf firmware, rather than bootloader. Should be no obvious change in behaviour, except sdram bandwidth limited tasks may be a few percent faster on Pi4. |
Interesting changes since the last automatic update: * Enable network install * Enable over-clocking frequencies > 3GHz See: ttps://github.com/raspberrypi/firmware/issues/1876 * Adjust SDRAM refresh rate according to temperature and address a performance gap between 4GB and 8GB parts in benchmarks. See: raspberrypi/firmware#1854 * Support custom CA certs with HTTPS boot * Move non Kernel ARM stages back to 512KB raspberrypi/firmware#1868 * Assorted HAT+ and NVMe interop improvements. * Fix TRYBOOT if secure-boot is enabled. * Preliminary support for D0 and CM5.
Interesting changes since the last automatic update: * Enable network install * Enable over-clocking frequencies > 3GHz See: ttps://github.com/raspberrypi/firmware/issues/1876 * Adjust SDRAM refresh rate according to temperature and address a performance gap between 4GB and 8GB parts in benchmarks. See: raspberrypi/firmware#1854 * Support custom CA certs with HTTPS boot * Move non Kernel ARM stages back to 512KB raspberrypi/firmware#1868 * Assorted HAT+ and NVMe interop improvements. * Fix TRYBOOT if secure-boot is enabled. * Preliminary support for D0 and CM5.
Interesting changes since the last automatic update: * Enable network install * Enable over-clocking frequencies > 3GHz See: ttps://github.com/raspberrypi/firmware/issues/1876 * Adjust SDRAM refresh rate according to temperature and address a performance gap between 4GB and 8GB parts in benchmarks. See: raspberrypi/firmware#1854 * Support custom CA certs with HTTPS boot * Move non Kernel ARM stages back to 512KB raspberrypi/firmware#1868 * Assorted HAT+ and NVMe interop improvements. * Fix TRYBOOT if secure-boot is enabled. * Preliminary support for D0 and CM5.
We have been in contact with Micron, and there is some good news. This should remove the lower performance observed with 8GB Pi5 compared to 4GB. Latest pi5 bootloader firmware contains this update (and can get got with rpi-update). |
@popcornmix That's great. I ran some tests on my 4GB and 8GB boards, using the same image. On average, there were no particular performance discrepancies in Geekbench 5, Passmark or Google Octane v2. There is some variability in the scores between runs and typically the 4GB peaks a bit higher, though. The only test were there's still a bit more of a difference appears to be Geekbench 6. It's not something I personally would worry about, but reporting it for completeness: https://browser.geekbench.com/v6/cpu/compare/7160591?baseline=7159462 |
It's interesting that the single core GB6 results also shows the difference. The scores for tests with little sdram bandwidth tend to get 4x with multicore. Ray Tracer is a good example of this with 3.98x. HTML5 Browser is a surprising result. It is somewhat sdram limited with a multicore speedup of 1.64x. If it is sdram causing this difference (which seems a reasonable conclusion based on the sdram size being the main differentiator) it seem surprising the difference is much greater in the less loaded, single core sdram case. It would be interesting if you are able to run this a couple more times and confirm if the 17% vs 1.7% numbers are consistent across runs. |
Thinking a bit more, the sdram is dual rank, with the top address bit choosing the rank. So a test that spans ranks (i.e. top and bottom halves of memory) may be beneficial. We know that GB6 fails to run on a 2GB Pi with out-of-memory, and does run on a 4GB Pi. We have actually been working on a scheme to try to avoid this type of inefficient behaviour due to precise locations of buffers. See raspberrypi/linux#6273. If you are feeling bold, then
should get you the numa feature which should mitigate some of these effects (and produce better benchmark scores, and performance of the system in general). Perhaps give it a go. |
Just on a side note: with this new firmware, my Pi5 now runs completely stable at 3 GHz; before, 2.9 GHz was the maximum for this very board of mine. |
I've tested the emulated NUMA feature now and it's highly effective. The performance of both the 4GB and 8GB models is improved, but the 8GB significantly more so. With the NUMA feature enabled, both boards perform in a very consistent manner and there is no measureable performance difference between them. The GB6 results are shown below. I did 5 runs on each configuration. The non-NUMA results are with the latest firmware that uses the same memory timings for the 4GB and 8GB. force_turbo=1 was used. The results match with what you've seen, i.e. +6% in single and +18% in multi composite scores. Below is a comparison between non-NUMA and NUMA on the 8GB board. As can be seen, several sub-tests increase by ~30%. There are no regressions. https://browser.geekbench.com/v6/cpu/compare/7189117?baseline=7173406 Out of interest, are you able to give any more background to this? It sounds like this would be something that could affect other platforms as well. Would this be beneficial for non Pi hardware or is it dealing with some peculiarity in your SoCs? And is this not needed or already mitigated on the x86 side? EDIT: Are there any known negatives of applying this emulated NUMA feature? EDIT 2: Running some other stuff as well, performance seems to be improved overall. Geekbench 5 is also much faster. Google Octane v2 web browser test is now at over 31k, where previously it was 29k. I have some older results for the WebGL Aquarium test, so not sure if something else has changed, but it now runs at 60 FPS with 1000 fish at 1080p fullscreen. Previously it was ~38 FPS. |
Good to hear you are seeing the benefit. Every test I've run performs better with NUMA (the extent of the gain depends on how sdram bandwidth constrained the use case was to begin with). Pi4 and Pi5 show similar benefits. There is currently a minor issue with NUMA that means the CMA size is limited to a single NUMA region. Pi5 doesn't really need CMA, as it has MMUs in front of all hardware blocks (e.g. hevc or hvs), so can use system memory. Pi4 doesn't have the MMUs, so does still need CMA (and 512M may be needed for 4K hevc playback). A solution for CMA + NUMA is being investigated (and once found will hopefully lead to PR being merged). I'd expect it to affect some other platforms. I think it will help any "obvious" implementations of sdram controller. Some sdram controllers have some cleverness in, where they scramble the address lines coming in to break up the structure (note, it needs more that just reordering address lines), and they are effectively doing a similar thing to the NUMA interleave in hardware. I'd expect most x86 sdram controllers to be pretty clever. |
Thanks for the additional info. Also, I retested the WebGL Aquarium sample. Turns out it runs equally well without NUMA. Maybe some other SW optimization since I last tested? However, I can confirm that compilation runs faster. Compiling dosbox-staging is 10.5 % faster on the 8GB board with NUMA enabled. |
Describe the bug
The Raspberry Pi 5 4GB performs slightly better (0-10%) than the 8GB version at default 2.4 GHz and the gap widens to >100% for certain workloads when overclocked. These workloads will see a dramatic reduction in performance when overclocking the 8GB board. This reduction in performance is not present at all on the 4GB board.
It's unclear to me whether the small performance difference at default clock frequency has the same root cause as the more dramatic one that emerges as the ARM core frequency is increased. For the time being I'm treating them as related and reporting on both in this issue.
To reproduce
I have found two workloads in particular that expose the issue: Geekbench 5 "Text Rendering" multi-core sub-test and stress-ng "numa" stressor. To reproduce this issue, I suggest benchmarking the 4GB and 8GB boards at both 2.4 GHz and 2.8 GHz.
Geekbench 5 is available here: https://www.geekbench.com/preview/
To run stress-ng with profiling info:
sudo apt install stress-ng linux-perf linux-perf-dbgsym
sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'
stress-ng --numa 4 --numa-ops 1000 --metrics --perf
Expected behaviour
Ideally the 4GB and 8GB boards would perform close to the same. The smaller difference at stock frequencies could be deemed normal/expected (for example due to different RAM ICs being used), but the dramatically increasing gap in performance as ARM core frequency increases suggests something may be misbehaving.
Actual behaviour
The 4GB board is anywhere from a few to more than 100% faster than the 8GB board, depending on clock frequency and workload. Below is a summary of tests I've run. As can be seen, the 4GB uses Samsung RAM and the 8GB uses Micron RAM.
These benchmark results are completely reproducible. I've also looked at other people's submission of Geekbench 5 results and can see the same reduction in "Text Rendering" scores on overclocked 8GB boards (but not on overclocked 4GB boards), so this is not limited to my specimen.
Below are the Geekbench 5 results at 2.4 and 2.8 GHz for the runs listed in the table above.
4GB (2400 MHz): https://browser.geekbench.com/v5/cpu/22028307
8GB (2400 MHz): https://browser.geekbench.com/v5/cpu/22028116
4GB (2800 MHz): https://browser.geekbench.com/v5/cpu/22028479
8GB (2800 MHz): https://browser.geekbench.com/v5/cpu/22028225
Below are "perf" tool output for the stress-ng runs at 2.4 and 2.8 GHz for both boards:
4GB (2.4 GHz):
8GB (2.4 GHz):
4GB (2.8 GHz):
8GB (2.8 GHz):
The 8GB 2.8 GHz result sticks out when compared to the same board running at 2.4 GHz, due to:
Finally, I should mention that RAM bandwidth and latency tests do not show any issues.
System
raspinfo output: https://gist.github.com/Brunnis/4d8242cf757f28e1d5331b3f73b3a446
The text was updated successfully, but these errors were encountered: