vTPM communication and error handling refactoring #4400

shjala · 2024-10-25T11:02:18Z

This pull request includes changes to the vtpm, domainmgr and kvm components to improve the handling of virtual TPM (vTPM) instances. The changes includes bug fix, enhance error handling, refactor the vTPM launch and termination processes, and introduce HTTP-based communication for vTPM commands.

vTPM : refactor control socket communication and error handling
This changes refactors the control socket communication and error handling
in the vTPM (server) and KVM (client). The control socket communication
is now handled by HTTP over UDS, and the error handling is improved,
since the vTPM server now returns an error message when an error occurs.
Domainmgr : refactor virtual TPM setup and termination
Use a defer function to ensure that the virtual TPM is always terminated
when the domain manager hits an error during the setup process or boot
process.
domainmgr : call vTPM asynchronously
Refactor vTPM setup/term/teardown functions to call the vTPM server
endpoints asynchronously, this remove the timeout guessworks and make the
vTPM setup more reliable.

Bug Fix

When server gets a launch request, it checks if the the requested instance is already running, but it only checks the internal list and not actually the running instances. This can lead to server thinking the instance is running but client fails to get the PID with error failed to handle request: SWTPM instance with id XXXX already running followed by failed to get pid from file XXXX.

This can occur in case like explicit shutdown of the VM (from within the VM), or re-activating the app via cloud-controler and as result the VM will still boot up but without a vTPM.

This PR fixes this issue.

TODO :

add unit-test for the bug.
Test the bug on master and report here the implication.
Check Azure IoT
-- Azure IoT Legacy (ptpm) passed
-- Azure IoT vTPM passed

pkg/vtpm/swtpm-vtpm/src/main.go

codecov · 2024-10-28T15:09:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 20.93%. Comparing base (dd27fe1) to head (4808f88).
Report is 18 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #4400   +/-   ##
=======================================
  Coverage   20.93%   20.93%           
=======================================
  Files          13       13           
  Lines        2895     2895           
=======================================
  Hits          606      606           
  Misses       2163     2163           
  Partials      126      126

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

OhmSpectator · 2024-10-29T09:13:20Z

@shjala. could you please add pkg/vtpm/swtpm-vtpm/vendor/ into https://github.com/lf-edge/eve/blob/e0b943fd55e773f2000c8e63611f242f615c902c/.spdxignore

shjala · 2024-10-29T10:56:04Z

@shjala. could you please add pkg/vtpm/swtpm-vtpm/vendor/ into https://github.com/lf-edge/eve/blob/e0b943fd55e773f2000c8e63611f242f615c902c/.spdxignore

sure can do.

OhmSpectator · 2024-10-29T11:08:28Z

I'm looking into PR, but I'm frequently interrupted, so it will take time. However, it's not abandoned!

OhmSpectator · 2024-10-29T12:03:24Z

pkg/pillar/cmd/domainmgr/domainmgr.go

 		status.VirtualTPM = true
+		defer func(status *types.DomainStatus) {
+			if status.BootFailed || status.HasError() {


We have discussed these conditions with @shjala, and it looks like a more reliable way to check that we do not need to terminate the vTMP is status.Activated set to true, as it guarantees that ActivateTails finished successfully.

It would be nice to change the condition in the original commit, where it was introduced, but let it be)

shjala · 2024-10-30T14:15:54Z

CodeQL is not convinced that id, which is the "user-provided value" part of paths is validated to be UUID (passed through uuid.FromString).

eriknordmark · 2024-10-30T14:45:30Z

CodeQL is not convinced that id, which is the "user-provided value" part of paths is validated to be UUID (passed through uuid.FromString).

It might not know what all current and future callers will pass in as "id".

Will be not complain if you pass in an argument of type uuid?

OhmSpectator · 2024-10-31T13:45:52Z

Yetus found a typo =)

OhmSpectator · 2024-10-31T14:01:36Z

pkg/pillar/cmd/domainmgr/domainmgr.go

 		status.VirtualTPM = true
+		defer func(status *types.DomainStatus) {
+			if status.BootFailed || status.HasError() {


It would be nice to change the condition in the original commit, where it was introduced, but let it be)

OhmSpectator · 2024-10-31T14:51:38Z

pkg/vtpm/swtpm-vtpm/src/main.go

+			if err != nil {
+				err := fmt.Sprintf("vTPM faild to read pid file of SWTPM with id %s", id)
+				http.Error(w, err, http.StatusExpectationFailed)
+				return


Should we remove the ID for the pids slice in this case?

I prefer not to, because I don't know what is the state of SWTPM at this point and I don't want to do anything that might lead to data corruption. I leave to the user to decide, maybe they decide to do a system reset to fix the issue.

OhmSpectator · 2024-10-31T14:52:29Z

pkg/vtpm/swtpm-vtpm/src/main.go

+			if isAlive(pid) {
+				err := fmt.Sprintf("vTPM SWTPM instance with id %s is already running with pid %d", id, pid)
+				http.Error(w, err, http.StatusOK)
+				return


Here, as well... I think it would make sense to remove the id from the slice, so the next if _, ok := pids[id]; ok fails.

this is http.StatusOK, no need to remove anything.

sorry I see why it is confusing, changed it to w.WriteHeader.

pkg/pillar/utils/proc.go

eriknordmark · 2024-11-01T17:33:24Z

@shjala I tried this on a device in the lab (which had previously logged the issue with the pid file), and with this fix I no longer see that in the logs.

This changes refactors the control socket communication and error handling in the vTPM (server) and KVM (client). The control socket communication is now handled by HTTP over UDS, and the error handling is improved, since the vTPM server now returns an error message when an error occurs. Signed-off-by: Shahriyar Jalayeri <[email protected]>

Use a defer function to ensure that the virtual TPM is always terminated when the domain manager hits an error during the setup process or boot process. Signed-off-by: Shahriyar Jalayeri <[email protected]>

When server gets a launch request, it checks if the the requested instance is already running, but it only checks the internal list and not actually the running instances. This can lead to server thinking the instance is running but client fails to get the PID with error "failed to get pid from file ...". Signed-off-by: Shahriyar Jalayeri <[email protected]>

Validate ID before using it in, it must be in form of a UUID. Signed-off-by: Shahriyar Jalayeri <[email protected]>

Rename wd kicker in proc utils. Signed-off-by: Shahriyar Jalayeri <[email protected]>

Refactor vTPM setup/term/teardown functions to call the vTPM server endpoints asynchronously, this remove the timeout guessworks and make the vTPM setup more reliable. Refactor vTPM setup functions to accept all watchdog related parameters as struct. Signed-off-by: Shahriyar Jalayeri <[email protected]>

The domainmanager calls vTPM server asynchronously, so we dont need to worry and set the wait time too low to return quicly to prevent a watchdog kill on pillar. Signed-off-by: Shahriyar Jalayeri <[email protected]>

Add vtpm vendor directory to .spdxignore. Signed-off-by: Shahriyar Jalayeri <[email protected]>

The TestSwtpmAbruptTerminationRequest function verifies that if swtpm is terminated without vTPM notice, no stale id is left in the vtpm internal bookkeeping and vtpm can launch a new instance with the same id. The TestSwtpmMultipleLaucnhRequest function verifies that if swtpm is launched multiple times with the same id, only one instance is created and other requests are ignored. Signed-off-by: Shahriyar Jalayeri <[email protected]>

eriknordmark

Run tests

OhmSpectator

As far as the bug is fixed (as I see from the @eriknordmark's comment), it's good. I would say we should not postpone merging because of all the nitty fixes I requested. Let's see the Eden results, and we are good.

OhmSpectator · 2024-11-04T14:27:45Z

Again, the same Eden problem =(

uncleDecart · 2024-11-05T10:10:09Z

I ran locally smoke tests, it does onboard, on eden repo with EVE version 13.6.0 looks like it's working,based on lf-edge/eden#1040 perhaps it would make sense to try out specific EVE version inside runner with tmate to figure out what's wrong...

eriknordmark · 2024-11-05T21:40:16Z

@OhmSpectator I'm inclined to merge this to master so that it can be backported to 13.4-stable.
I think we have some other PRs which have the same Eden issues which we can spend more time on figuring out why eden fails to onboard.

OhmSpectator · 2024-11-05T21:46:52Z

@OhmSpectator I'm inclined to merge this to master so that it can be backported to 13.4-stable. I think we have some other PRs which have the same Eden issues which we can spend more time on figuring out why eden fails to onboard.

Yeah, makes sense... I hope we'll find a solution for the Eden tests problem soon...

shjala requested review from OhmSpectator, rene, rouming and milan-zededa as code owners October 25, 2024 11:02

github-actions bot requested review from eriknordmark, jsfakian, rucoder and uncleDecart October 25, 2024 11:02

shjala changed the title ~~Vtpm.server~~ vTPM communication and error handling refactoring Oct 25, 2024

github-advanced-security bot found potential problems Oct 25, 2024

View reviewed changes

shjala changed the title ~~vTPM communication and error handling refactoring~~ [WIP] vTPM communication and error handling refactoring Oct 25, 2024

shjala force-pushed the vtpm.server branch 7 times, most recently from 36e4d69 to 16be694 Compare October 28, 2024 14:51

shjala changed the title ~~[WIP] vTPM communication and error handling refactoring~~ vTPM communication and error handling refactoring Oct 28, 2024

OhmSpectator reviewed Oct 29, 2024

View reviewed changes

shjala force-pushed the vtpm.server branch from 4808f88 to 1cb653e Compare October 29, 2024 13:00

github-actions bot requested a review from OhmSpectator October 29, 2024 13:00

shjala force-pushed the vtpm.server branch 4 times, most recently from 624a4e5 to 4fd6d20 Compare October 29, 2024 13:20

shjala force-pushed the vtpm.server branch 2 times, most recently from af7618c to 4fb3604 Compare October 30, 2024 17:01

OhmSpectator reviewed Oct 31, 2024

View reviewed changes

pkg/pillar/utils/proc.go Show resolved Hide resolved

shjala added 2 commits November 4, 2024 11:17

Domainmgr : refactor virtual TPM setup and termination

20da6cd

Use a defer function to ensure that the virtual TPM is always terminated when the domain manager hits an error during the setup process or boot process. Signed-off-by: Shahriyar Jalayeri <[email protected]>

shjala force-pushed the vtpm.server branch from 4fb3604 to e8cba34 Compare November 4, 2024 10:27

github-actions bot requested a review from OhmSpectator November 4, 2024 10:27

shjala force-pushed the vtpm.server branch 2 times, most recently from 41b5f93 to 66b62bb Compare November 4, 2024 10:36

shjala added 7 commits November 4, 2024 12:44

vTPM : validate id before using it in the request

7294cce

Validate ID before using it in, it must be in form of a UUID. Signed-off-by: Shahriyar Jalayeri <[email protected]>

proc utils : rename wd kicker

9db3b4d

Rename wd kicker in proc utils. Signed-off-by: Shahriyar Jalayeri <[email protected]>

vTPM : more relaxed timeout

bd856c7

The domainmanager calls vTPM server asynchronously, so we dont need to worry and set the wait time too low to return quicly to prevent a watchdog kill on pillar. Signed-off-by: Shahriyar Jalayeri <[email protected]>

add vtpm vendor directory to .spdxignore

5d4f771

Add vtpm vendor directory to .spdxignore. Signed-off-by: Shahriyar Jalayeri <[email protected]>

shjala force-pushed the vtpm.server branch from 66b62bb to bc80a42 Compare November 4, 2024 10:44

eriknordmark approved these changes Nov 4, 2024

View reviewed changes

OhmSpectator approved these changes Nov 4, 2024

View reviewed changes

eriknordmark merged commit ede68a1 into lf-edge:master Nov 6, 2024
34 of 57 checks passed

shjala mentioned this pull request Nov 6, 2024

[13.4-stable] vTPM communication and error handling refactoring #4429

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vTPM communication and error handling refactoring #4400

vTPM communication and error handling refactoring #4400

shjala commented Oct 25, 2024 •

edited

Loading

codecov bot commented Oct 28, 2024 •

edited

Loading

OhmSpectator commented Oct 29, 2024

shjala commented Oct 29, 2024

OhmSpectator commented Oct 29, 2024

OhmSpectator Oct 29, 2024

OhmSpectator Oct 31, 2024

shjala commented Oct 30, 2024

eriknordmark commented Oct 30, 2024

OhmSpectator commented Oct 31, 2024

OhmSpectator Oct 31, 2024

OhmSpectator Oct 31, 2024

shjala Nov 2, 2024

OhmSpectator Oct 31, 2024

shjala Nov 2, 2024

shjala Nov 4, 2024

eriknordmark commented Nov 1, 2024

eriknordmark left a comment

OhmSpectator left a comment

OhmSpectator commented Nov 4, 2024

uncleDecart commented Nov 5, 2024

eriknordmark commented Nov 5, 2024

OhmSpectator commented Nov 5, 2024

vTPM communication and error handling refactoring #4400

vTPM communication and error handling refactoring #4400

Conversation

shjala commented Oct 25, 2024 • edited Loading

Bug Fix

codecov bot commented Oct 28, 2024 • edited Loading

Codecov Report

OhmSpectator commented Oct 29, 2024

shjala commented Oct 29, 2024

OhmSpectator commented Oct 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shjala commented Oct 30, 2024

eriknordmark commented Oct 30, 2024

OhmSpectator commented Oct 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eriknordmark commented Nov 1, 2024

eriknordmark left a comment

Choose a reason for hiding this comment

OhmSpectator left a comment

Choose a reason for hiding this comment

OhmSpectator commented Nov 4, 2024

uncleDecart commented Nov 5, 2024

eriknordmark commented Nov 5, 2024

OhmSpectator commented Nov 5, 2024

shjala commented Oct 25, 2024 •

edited

Loading

codecov bot commented Oct 28, 2024 •

edited

Loading