Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug aes-fvsr-key-batch timing issues #83

Open
vogelpi opened this issue Jun 20, 2022 · 10 comments
Open

Debug aes-fvsr-key-batch timing issues #83

vogelpi opened this issue Jun 20, 2022 · 10 comments
Assignees

Comments

@vogelpi
Copy link
Collaborator

vogelpi commented Jun 20, 2022

PR #82 added the corresponding target binaries to support the aes-fvsr-key-batch command. For most of us this seems to work just fine but I saw lots of errors like:

Fixed key: b'811e3731b0120a7842781e22b25cddf9'
Connected to ChipWhisperer (num_samples: 1200, num_samples_actual: 1200, num_segments_actual: 1)
Capturing:  16%|███▋                   | 162/1000 [00:00<00:01, 492.97 traces/s]
Traceback (most recent call last):
  File "./capture.py", line 771, in <module>
    app()
 ...
  File "./capture.py", line 502, in aes_fvsr_key_batch
    capture_aes_fvsr_key_batch(
  File "./capture.py", line 460, in capture_aes_fvsr_key_batch
    check_ciphertext(ot, expected_last_ciphertext, 4)
  File "./capture.py", line 221, in check_ciphertext
    assert actual_last_ciphertext == expected_last_ciphertext[0:ciphertext_len], (
AssertionError: Incorrect encryption result!
actual: bytearray(b'^\xdc\xe8{')
expected: [ 85 154 226   3  30  12  31 147  54 197 151  35   4 134  80  98]

In words, there are sometimes mismatches between the expected and received ciphertexts. Sometimes, the failure occurs for the first batch, sometimes for a later batch, sometimes it doesn't occur at all. The failures seem to depend on timing (adding some sleep command on the Python side seem to help), Husky firmware (the latest firmware seems to be more affected), and maybe also USB connection setup (docking station, hub, laptop directly).

We should root cause and fix the problem. Otherwise we can't reliably do long running captures. Imagine we collect 10 Mio traces and get such a failure after 2 hours - all traces will be lost.

@abdullahvarici
Copy link

abdullahvarici commented Aug 17, 2022

I have investigated the issue and I guess I found the root cause.

We have two consecutive serial write command in 'capture_aes_fvsr_key_batch()' function in 'capture.py' script:

ot.target.simpleserial_write("s", capture_cfg["batch_prng_seed"].to_bytes(4, "little"))
ot.target.simpleserial_write("t", key_fixed)

In the device firmware 'aes_serial.c', we are polling incoming packets and call the relevant function:

while (true) {
    simple_serial_process_packet();
}

After starting the first operation, in this example seeding the prng, the second data packet may be sent before the first operation is completed and because of that it may be missed by the firmware. (Actually I would expect there would be some kind of buffer to hold serial data packets but it seems there aren't any.)

We can simply avoid this issue by placing delays between 'simpleserial_write()' functions. We are doing that unintentionly by waiting the result of the operation after starting it by sending an encrypt request.

However, the communication between the computer and device is not entirely deterministic and because of that putting a delay may not be a reliable solution. Maybe it would be better if we wait for a success message from the device after seeding PRNG, setting key etc. but not after starting an encryption because we are waiting the result anyway. It may decrease performance if we also wait for a success message after initializing an encryption request.

What do you think?

@colinoflynn
Copy link
Contributor

simpleserial has an "ack" command that helps with this, and can be used to delay the protocol while a long-running process happens.

Checking the code I think the status is only sent on failure. By default the low-level simpleserial_write don't wait for the ack to return, only simpleserial_read will do that.

In the original CW you can see the status return here.

If you try inserting a status print after this line it may work there:

simple_serial_send_status(0);

Then to force this to wait:

ot.target.simpleserial_write("s", capture_cfg["batch_prng_seed"].to_bytes(4, "little"))
ot.target.simpleserial_wait_ack()
ot.target.simpleserial_write("t", key_fixed)

I would do similar things on any other write (you can see this is what we do for example on this key write function).

I haven't been building for OT for a bit so sorry I can't test quickly here, but will try to validate. However to avoid someone chasing this down when I think the issue is relatively simple.

@abdullahvarici
Copy link

Thanks @colinoflynn! I will try this and create a PR.

@m-temp
Copy link
Collaborator

m-temp commented Nov 2, 2022

Although I had no problems with the kmac-fvsr-key-batch, I ran into similar problems while developing the sha3-fvsr-data-batch.

However, the behavior was a little bit different and more un(?)predictable. It always failed on my machine, but worked on @wettermo's machine. Further we needed the ot.target.simpleserial_wait_ack() after the ot.target.simpleserial_write( "b", scope.num_segments_actual.to_bytes(4, "little") )

Does it make sense to add an ot.target.simpleserial_wait_ack() after each and every command that does not provide an immediate response? I can have a look on that after finishing the sha3-fvsr tests.

@abdullahvarici
Copy link

In my opinion, adding ot.target.simpleserial_wait_ack() would make it more stable but also the capture rate will drop.

@m-temp
Copy link
Collaborator

m-temp commented Nov 7, 2022

Can we estimate the impact on the capture rate?
Personally, I'd favor to acknowledge as many commands as possible, as long as the impact on the capture rate is acceptable.
See also the discussion in lowRISC/opentitan#15929 (comment)

@abdullahvarici
Copy link

I am not sure if we can estimate it. Maybe we can try and see :) I agree with you. We may not have a problem when we didn't add ack after every command but it does not mean we won't, I guess.

@vrozic
Copy link
Contributor

vrozic commented Oct 19, 2023

I'm working on this issue.
First experiments show that there is no observable decrease of the capture rate after acknowledging all write commands.
Once this is complete, I will update the scripts and the binaries.

EDIT:
Source for generating aes binaries: here

@johannheyszl
Copy link
Collaborator

Interestingly, I haven't yet observed any such issues while batch capturing on husky and waverunner using the scripts here: #194
But I do check full ciphertext after every batch and all texts are AES encryptions of previous one going back to first one that is sent to OT.

@johannheyszl
Copy link
Collaborator

Since we will replace the communication layer for post-silicon to unify with the other post-silicon SiVal testing, we may not want to spend much effort in fixing this? :)

Generally I am fully in favor of more ack's and more robust communication even if sacrificing some speed. We need correct trace-data / text pairs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants