-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TMU can't reliably read from MMAL camera frames on the RPi Zero #1
Comments
I made some changes to make debugging easier and did a lot of tests. With frame size 384x288 and no split column code, there are now exactly three programs scheduled to run each frame. Test command: Code 1The reference debug straight from debug_tiled and works as expected
Code 2Simple TMU testing code, values is not used, output is random VPM content
(*) It reliably stalls after several seconds to several tens of seconds. One time, a single QPUs recovered nearly instantly, causing the code to continue to run on that single QPU (seeing the same random VPM pattern three times), running reliably on the single QPU for several minutes. But usually, none of the three QPUs ever recover even after minutes and the QPU keeps stalling. However, it is NOT a hard stall, reenabling the QPU causes all QPUs to unstall. Code 3Same as Code 2 but with mutex around the TMU access - shouldn't be needed but does change some things
(**) It reliably stalls after only a few frames, so even faster than without mutex. Sometimes when once stalls, the code breaks in an interesting way. Instead of straight blocks, it outputs in angled stripes - this happens when the tgtStride is exactly 16 too low. Now I identified the Code 4Same as Code 2 but now the TMU value is read into r0 and then written to the VPM, without unpacking
(***) Just like Code 2, it reliably stalls after several seconds to several tens of seconds. Code 5Same as Code 4 but ONE
(****) It reliably stalls, before the first frame even finishes (only 100-200 instructions done on QPU) up to very few frames (few 100000s of instructions). After it stalls, the image might appear fine, sometimes artifacts may occur, sometimes it generates a ton of errors - maybe due to overwriting memory, but somehow different - these errors spam both the SSH console and the HDMI console output. Code 6Same as Code 4 but now the TMU value is unpacked into several ra registers and then written to the VPM. This results in different timing and behaves exactly like Code 5, except this outputs proper grayscale images
TL;DRSo this is a lot of information, but what I read from this - VPM is not the problem, the TMU access causes the stalls, until the QPU is reenabled. The timing of the code AROUND the TMU access is incredibly sensitive, small changes cause different stall and error behaviour. It does matter which QPUs and thus TMUs are used simultaneously, affecting stall behaviour. Adding mutexes around the TMU code introduces new errors on stall by seemingly messing up code timing and generally causes it to stall even faster, likely due to the timing changes also observed with Code 5 and 6. Next testsNext tests include making the framesize even smaller and using a constant image, and don't use cache clearing, so that the TMU always has a cache hit. |
Fixed buffer locking, caused crashes described in #1 Still stalling even at 480p@30 after a couple of seconds However the camera-emulation mode works fine even at 480p@250 so this bug is likely to be in the camera code However, increasing QPU program count from 5 to 10 by enabling split columns makes it stall faster, so there might be more to it still
So it turns out the main error seen here wasn't in the QPU program code at all. To Do:
|
So I tested above ideas, and the results are not very helpful. I'm thinking about experimenting with the frames a bit further, when I get time, namely:
|
So I tried 1. and 2. so far, no difference. |
Just tested on a Raspberry Pi 3 B+ and it worked without a problem.... |
Added qpu_mask_tiled in the meantime, works fine on the 3 B+, but stalls immediately on the Zero. Example from minimal code (can only execute qpu_mask_tiled):
|
Reduced the test case even further, a program that works on the 3 B+ but stalls on the Zero (although a lot of information is lost compared to the above branch, so I've seperated it into qpuminimal). |
For completeness sake, I currently circumvent this bug in my use case by first copying the camera buffer to a custom VCSM buffer before processing with the TMU. The copying process is done by using the VPM DMA Write and Read, which doesn't have any problem accessing the camera frame buffers. Unfortunately this nearly doubles the frametimes on the QPU. But for the actual algorithm the IO (TMU+VDW) and computation was well balanced, the QPU was never starved of input (no stalls) but the IO capacity of the TMUs was pretty much exhausted. Redesigning it with VDR as input would reduce the total frametime by making the workaround unnecessary, but it would be slower than the TMU+VDW alone, so I still have interest in fixing this issue. |
Fixed buffer locking, caused crashes described in #1 Still stalling even at 480p@30 after a couple of seconds However the camera-emulation mode works fine even at 480p@250 so this bug is likely to be in the camera code However, increasing QPU program count from 5 to 10 by enabling split columns makes it stall faster, so there might be more to it still
So I have finally circumvented this bug within a couple hours with an idea I had for months but only got around to implement now. |
Tiled rendering consists of multiple programs each accessing the TMU for reading and their own dedicated space in the VPM for writing.
qpu_debug_tiled demonstrates the tiling pattern, and VPM writing. All QPUs can simultaneously use their part of the VPM to write values. This works fine, even without mutex synchronization.
qpu_blit_tiled is structurally exactly the same but adds TMU load and writes that instead of the debug pattern. However, just adding the TMU loading instructions breaks the program. Uncommenting them and writing debug values makes the program work again. However, from what I gathered, it is not the timing that is the issue, some nop operations instead of the TMU access don't trigger the behaviour.
Executing each programs right one after another however works fine, so the functionality is fine.
So I tried adding mutex to synchronize the QPUs, at several different stages - whole program, each line, each VPM access. The whole program mutex works, but only at low framerates (e.g. 10). Without mutex, that would break. This indicates the mutex does work to a degree. However, when increasing the framerate, the QPUs quickly (after a few frames) start overwriting the whole memory without reason.
The mutex synchronizations on each line or even VPM access seem to only worsen this behaviour.
So there are three parts to this problem that I do not understand:
Any help is greatly appreciated. The referenced programs can be easily tested out with the commands found in commands.txt
The text was updated successfully, but these errors were encountered: