-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vTPM communication and error handling refactoring #4400
Conversation
36e4d69
to
16be694
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4400 +/- ##
=======================================
Coverage 20.93% 20.93%
=======================================
Files 13 13
Lines 2895 2895
=======================================
Hits 606 606
Misses 2163 2163
Partials 126 126 ☔ View full report in Codecov by Sentry. |
@shjala. could you please add |
sure can do. |
I'm looking into PR, but I'm frequently interrupted, so it will take time. However, it's not abandoned! |
status.VirtualTPM = true | ||
defer func(status *types.DomainStatus) { | ||
if status.BootFailed || status.HasError() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have discussed these conditions with @shjala, and it looks like a more reliable way to check that we do not need to terminate the vTMP is status.Activated
set to true
, as it guarantees that ActivateTails
finished successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to change the condition in the original commit, where it was introduced, but let it be)
624a4e5
to
4fd6d20
Compare
CodeQL is not convinced that |
It might not know what all current and future callers will pass in as "id". Will be not complain if you pass in an argument of type uuid? |
af7618c
to
4fb3604
Compare
Yetus found a typo =) |
status.VirtualTPM = true | ||
defer func(status *types.DomainStatus) { | ||
if status.BootFailed || status.HasError() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to change the condition in the original commit, where it was introduced, but let it be)
if err != nil { | ||
err := fmt.Sprintf("vTPM faild to read pid file of SWTPM with id %s", id) | ||
http.Error(w, err, http.StatusExpectationFailed) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the ID for the pids slice in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to, because I don't know what is the state of SWTPM at this point and I don't want to do anything that might lead to data corruption. I leave to the user to decide, maybe they decide to do a system reset to fix the issue.
if isAlive(pid) { | ||
err := fmt.Sprintf("vTPM SWTPM instance with id %s is already running with pid %d", id, pid) | ||
http.Error(w, err, http.StatusOK) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, as well... I think it would make sense to remove the id from the slice, so the next if _, ok := pids[id]; ok
fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is http.StatusOK
, no need to remove anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I see why it is confusing, changed it to w.WriteHeader
.
@shjala I tried this on a device in the lab (which had previously logged the issue with the pid file), and with this fix I no longer see that in the logs. |
This changes refactors the control socket communication and error handling in the vTPM (server) and KVM (client). The control socket communication is now handled by HTTP over UDS, and the error handling is improved, since the vTPM server now returns an error message when an error occurs. Signed-off-by: Shahriyar Jalayeri <[email protected]>
Use a defer function to ensure that the virtual TPM is always terminated when the domain manager hits an error during the setup process or boot process. Signed-off-by: Shahriyar Jalayeri <[email protected]>
41b5f93
to
66b62bb
Compare
When server gets a launch request, it checks if the the requested instance is already running, but it only checks the internal list and not actually the running instances. This can lead to server thinking the instance is running but client fails to get the PID with error "failed to get pid from file ...". Signed-off-by: Shahriyar Jalayeri <[email protected]>
Validate ID before using it in, it must be in form of a UUID. Signed-off-by: Shahriyar Jalayeri <[email protected]>
Rename wd kicker in proc utils. Signed-off-by: Shahriyar Jalayeri <[email protected]>
Refactor vTPM setup/term/teardown functions to call the vTPM server endpoints asynchronously, this remove the timeout guessworks and make the vTPM setup more reliable. Refactor vTPM setup functions to accept all watchdog related parameters as struct. Signed-off-by: Shahriyar Jalayeri <[email protected]>
The domainmanager calls vTPM server asynchronously, so we dont need to worry and set the wait time too low to return quicly to prevent a watchdog kill on pillar. Signed-off-by: Shahriyar Jalayeri <[email protected]>
Add vtpm vendor directory to .spdxignore. Signed-off-by: Shahriyar Jalayeri <[email protected]>
The TestSwtpmAbruptTerminationRequest function verifies that if swtpm is terminated without vTPM notice, no stale id is left in the vtpm internal bookkeeping and vtpm can launch a new instance with the same id. The TestSwtpmMultipleLaucnhRequest function verifies that if swtpm is launched multiple times with the same id, only one instance is created and other requests are ignored. Signed-off-by: Shahriyar Jalayeri <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as the bug is fixed (as I see from the @eriknordmark's comment), it's good. I would say we should not postpone merging because of all the nitty fixes I requested. Let's see the Eden results, and we are good.
Again, the same Eden problem =( |
I ran locally smoke tests, it does onboard, on eden repo with EVE version 13.6.0 looks like it's working,based on lf-edge/eden#1040 perhaps it would make sense to try out specific EVE version inside runner with tmate to figure out what's wrong... |
@OhmSpectator I'm inclined to merge this to master so that it can be backported to 13.4-stable. |
Yeah, makes sense... I hope we'll find a solution for the Eden tests problem soon... |
This pull request includes changes to the
vtpm
,domainmgr
andkvm
components to improve the handling of virtual TPM (vTPM) instances. The changes includes bug fix, enhance error handling, refactor the vTPM launch and termination processes, and introduce HTTP-based communication for vTPM commands.vTPM : refactor control socket communication and error handling
This changes refactors the control socket communication and error handling
in the vTPM (server) and KVM (client). The control socket communication
is now handled by HTTP over UDS, and the error handling is improved,
since the vTPM server now returns an error message when an error occurs.
Domainmgr : refactor virtual TPM setup and termination
Use a defer function to ensure that the virtual TPM is always terminated
when the domain manager hits an error during the setup process or boot
process.
domainmgr : call vTPM asynchronously
Refactor vTPM setup/term/teardown functions to call the vTPM server
endpoints asynchronously, this remove the timeout guessworks and make the
vTPM setup more reliable.
Bug Fix
When server gets a launch request, it checks if the the requested instance is already running, but it only checks the internal list and not actually the running instances. This can lead to server thinking the instance is running but client fails to get the PID with error
failed to handle request: SWTPM instance with id XXXX already running
followed byfailed to get pid from file XXXX
.This can occur in case like explicit shutdown of the VM (from within the VM), or re-activating the app via cloud-controler and as result the VM will still boot up but without a vTPM.
This PR fixes this issue.
TODO :
-- Azure IoT Legacy (ptpm) passed
-- Azure IoT vTPM passed