-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug aes-fvsr-key-batch timing issues #83
Comments
I have investigated the issue and I guess I found the root cause. We have two consecutive serial write command in 'capture_aes_fvsr_key_batch()' function in 'capture.py' script:
In the device firmware 'aes_serial.c', we are polling incoming packets and call the relevant function:
After starting the first operation, in this example seeding the prng, the second data packet may be sent before the first operation is completed and because of that it may be missed by the firmware. (Actually I would expect there would be some kind of buffer to hold serial data packets but it seems there aren't any.) We can simply avoid this issue by placing delays between 'simpleserial_write()' functions. We are doing that unintentionly by waiting the result of the operation after starting it by sending an encrypt request. However, the communication between the computer and device is not entirely deterministic and because of that putting a delay may not be a reliable solution. Maybe it would be better if we wait for a success message from the device after seeding PRNG, setting key etc. but not after starting an encryption because we are waiting the result anyway. It may decrease performance if we also wait for a success message after initializing an encryption request. What do you think? |
simpleserial has an "ack" command that helps with this, and can be used to delay the protocol while a long-running process happens. Checking the code I think the status is only sent on failure. By default the low-level In the original CW you can see the status return here. If you try inserting a status print after this line it may work there:
Then to force this to wait:
I would do similar things on any other write (you can see this is what we do for example on this key write function). I haven't been building for OT for a bit so sorry I can't test quickly here, but will try to validate. However to avoid someone chasing this down when I think the issue is relatively simple. |
Thanks @colinoflynn! I will try this and create a PR. |
Although I had no problems with the However, the behavior was a little bit different and more un(?)predictable. It always failed on my machine, but worked on @wettermo's machine. Further we needed the Does it make sense to add an |
In my opinion, adding |
Can we estimate the impact on the capture rate? |
I am not sure if we can estimate it. Maybe we can try and see :) I agree with you. We may not have a problem when we didn't add ack after every command but it does not mean we won't, I guess. |
I'm working on this issue. EDIT: |
Interestingly, I haven't yet observed any such issues while batch capturing on husky and waverunner using the scripts here: #194 |
Since we will replace the communication layer for post-silicon to unify with the other post-silicon SiVal testing, we may not want to spend much effort in fixing this? :) Generally I am fully in favor of more ack's and more robust communication even if sacrificing some speed. We need correct trace-data / text pairs |
PR #82 added the corresponding target binaries to support the aes-fvsr-key-batch command. For most of us this seems to work just fine but I saw lots of errors like:
In words, there are sometimes mismatches between the expected and received ciphertexts. Sometimes, the failure occurs for the first batch, sometimes for a later batch, sometimes it doesn't occur at all. The failures seem to depend on timing (adding some sleep command on the Python side seem to help), Husky firmware (the latest firmware seems to be more affected), and maybe also USB connection setup (docking station, hub, laptop directly).
We should root cause and fix the problem. Otherwise we can't reliably do long running captures. Imagine we collect 10 Mio traces and get such a failure after 2 hours - all traces will be lost.
The text was updated successfully, but these errors were encountered: