Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vTPM communication and error handling refactoring #4400

Merged
merged 9 commits into from
Nov 6, 2024

Conversation

shjala
Copy link
Member

@shjala shjala commented Oct 25, 2024

This pull request includes changes to the vtpm, domainmgr and kvm components to improve the handling of virtual TPM (vTPM) instances. The changes includes bug fix, enhance error handling, refactor the vTPM launch and termination processes, and introduce HTTP-based communication for vTPM commands.

  • vTPM : refactor control socket communication and error handling
    This changes refactors the control socket communication and error handling
    in the vTPM (server) and KVM (client). The control socket communication
    is now handled by HTTP over UDS, and the error handling is improved,
    since the vTPM server now returns an error message when an error occurs.

  • Domainmgr : refactor virtual TPM setup and termination
    Use a defer function to ensure that the virtual TPM is always terminated
    when the domain manager hits an error during the setup process or boot
    process.

  • domainmgr : call vTPM asynchronously
    Refactor vTPM setup/term/teardown functions to call the vTPM server
    endpoints asynchronously, this remove the timeout guessworks and make the
    vTPM setup more reliable.

Bug Fix

When server gets a launch request, it checks if the the requested instance is already running, but it only checks the internal list and not actually the running instances. This can lead to server thinking the instance is running but client fails to get the PID with error failed to handle request: SWTPM instance with id XXXX already running followed by failed to get pid from file XXXX.

This can occur in case like explicit shutdown of the VM (from within the VM), or re-activating the app via cloud-controler and as result the VM will still boot up but without a vTPM.

This PR fixes this issue.


TODO :

  • add unit-test for the bug.
  • Test the bug on master and report here the implication.
  • Check Azure IoT
    -- Azure IoT Legacy (ptpm) passed
    -- Azure IoT vTPM passed

@shjala shjala changed the title Vtpm.server vTPM communication and error handling refactoring Oct 25, 2024
pkg/vtpm/swtpm-vtpm/src/main.go Fixed Show fixed Hide fixed
pkg/vtpm/swtpm-vtpm/src/main.go Fixed Show fixed Hide fixed
pkg/vtpm/swtpm-vtpm/src/main.go Fixed Show fixed Hide fixed
pkg/vtpm/swtpm-vtpm/src/main.go Fixed Show fixed Hide fixed
pkg/vtpm/swtpm-vtpm/src/main.go Fixed Show fixed Hide fixed
pkg/vtpm/swtpm-vtpm/src/main.go Fixed Show fixed Hide fixed
@shjala shjala changed the title vTPM communication and error handling refactoring [WIP] vTPM communication and error handling refactoring Oct 25, 2024
@shjala shjala force-pushed the vtpm.server branch 7 times, most recently from 36e4d69 to 16be694 Compare October 28, 2024 14:51
@shjala shjala changed the title [WIP] vTPM communication and error handling refactoring vTPM communication and error handling refactoring Oct 28, 2024
Copy link

codecov bot commented Oct 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 20.93%. Comparing base (dd27fe1) to head (4808f88).
Report is 18 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #4400   +/-   ##
=======================================
  Coverage   20.93%   20.93%           
=======================================
  Files          13       13           
  Lines        2895     2895           
=======================================
  Hits          606      606           
  Misses       2163     2163           
  Partials      126      126           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@OhmSpectator
Copy link
Member

@shjala. could you please add pkg/vtpm/swtpm-vtpm/vendor/ into https://github.com/lf-edge/eve/blob/e0b943fd55e773f2000c8e63611f242f615c902c/.spdxignore

@shjala
Copy link
Member Author

shjala commented Oct 29, 2024

@shjala. could you please add pkg/vtpm/swtpm-vtpm/vendor/ into https://github.com/lf-edge/eve/blob/e0b943fd55e773f2000c8e63611f242f615c902c/.spdxignore

sure can do.

@OhmSpectator
Copy link
Member

I'm looking into PR, but I'm frequently interrupted, so it will take time. However, it's not abandoned!

status.VirtualTPM = true
defer func(status *types.DomainStatus) {
if status.BootFailed || status.HasError() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have discussed these conditions with @shjala, and it looks like a more reliable way to check that we do not need to terminate the vTMP is status.Activated set to true, as it guarantees that ActivateTails finished successfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to change the condition in the original commit, where it was introduced, but let it be)

@shjala
Copy link
Member Author

shjala commented Oct 30, 2024

CodeQL is not convinced that id, which is the "user-provided value" part of paths is validated to be UUID (passed through uuid.FromString).

@eriknordmark
Copy link
Contributor

CodeQL is not convinced that id, which is the "user-provided value" part of paths is validated to be UUID (passed through uuid.FromString).

It might not know what all current and future callers will pass in as "id".

Will be not complain if you pass in an argument of type uuid?

@shjala shjala force-pushed the vtpm.server branch 2 times, most recently from af7618c to 4fb3604 Compare October 30, 2024 17:01
@OhmSpectator
Copy link
Member

Yetus found a typo =)

status.VirtualTPM = true
defer func(status *types.DomainStatus) {
if status.BootFailed || status.HasError() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to change the condition in the original commit, where it was introduced, but let it be)

if err != nil {
err := fmt.Sprintf("vTPM faild to read pid file of SWTPM with id %s", id)
http.Error(w, err, http.StatusExpectationFailed)
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the ID for the pids slice in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to, because I don't know what is the state of SWTPM at this point and I don't want to do anything that might lead to data corruption. I leave to the user to decide, maybe they decide to do a system reset to fix the issue.

if isAlive(pid) {
err := fmt.Sprintf("vTPM SWTPM instance with id %s is already running with pid %d", id, pid)
http.Error(w, err, http.StatusOK)
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, as well... I think it would make sense to remove the id from the slice, so the next if _, ok := pids[id]; ok fails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is http.StatusOK, no need to remove anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I see why it is confusing, changed it to w.WriteHeader.

@eriknordmark
Copy link
Contributor

@shjala I tried this on a device in the lab (which had previously logged the issue with the pid file), and with this fix I no longer see that in the logs.

This changes refactors the control socket communication and error handling
in the vTPM (server) and KVM (client). The control socket communication
is now handled by HTTP over UDS, and the error handling is improved,
since the vTPM server now returns an error message when an error occurs.

Signed-off-by: Shahriyar Jalayeri <[email protected]>
Use a defer function to ensure that the virtual TPM is always terminated
when the domain manager hits an error during the setup process or boot
process.

Signed-off-by: Shahriyar Jalayeri <[email protected]>
When server gets a launch request, it checks if the the requested
instance is already running, but it only checks the internal list and
not actually the running instances. This can lead to server thinking
the instance is running but client fails to get the PID with error
"failed to get pid from file ...".

Signed-off-by: Shahriyar Jalayeri <[email protected]>
Validate ID before using it in, it must be in form of a UUID.

Signed-off-by: Shahriyar Jalayeri <[email protected]>
Rename wd kicker in proc utils.

Signed-off-by: Shahriyar Jalayeri <[email protected]>
Refactor vTPM setup/term/teardown functions to call the vTPM server
endpoints asynchronously, this remove the timeout guessworks and make the
vTPM setup more reliable.

Refactor vTPM setup functions to accept all watchdog related parameters
as struct.

Signed-off-by: Shahriyar Jalayeri <[email protected]>
The domainmanager calls vTPM server asynchronously, so we dont need to
worry and set the wait time too low to return quicly to prevent a watchdog
kill on pillar.

Signed-off-by: Shahriyar Jalayeri <[email protected]>
Add vtpm vendor directory to .spdxignore.

Signed-off-by: Shahriyar Jalayeri <[email protected]>
The TestSwtpmAbruptTerminationRequest function verifies that if swtpm is
terminated without vTPM notice, no stale id is left in the vtpm internal
bookkeeping and vtpm can launch a new instance with the same id.

The TestSwtpmMultipleLaucnhRequest function verifies that if swtpm is
launched multiple times with the same id, only one instance is created
and other requests are ignored.

Signed-off-by: Shahriyar Jalayeri <[email protected]>
Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run tests

Copy link
Member

@OhmSpectator OhmSpectator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as the bug is fixed (as I see from the @eriknordmark's comment), it's good. I would say we should not postpone merging because of all the nitty fixes I requested. Let's see the Eden results, and we are good.

@OhmSpectator
Copy link
Member

Again, the same Eden problem =(

@uncleDecart
Copy link
Member

I ran locally smoke tests, it does onboard, on eden repo with EVE version 13.6.0 looks like it's working,based on lf-edge/eden#1040 perhaps it would make sense to try out specific EVE version inside runner with tmate to figure out what's wrong...

@eriknordmark
Copy link
Contributor

@OhmSpectator I'm inclined to merge this to master so that it can be backported to 13.4-stable.
I think we have some other PRs which have the same Eden issues which we can spend more time on figuring out why eden fails to onboard.

@OhmSpectator
Copy link
Member

@OhmSpectator I'm inclined to merge this to master so that it can be backported to 13.4-stable. I think we have some other PRs which have the same Eden issues which we can spend more time on figuring out why eden fails to onboard.

Yeah, makes sense... I hope we'll find a solution for the Eden tests problem soon...

@eriknordmark eriknordmark merged commit ede68a1 into lf-edge:master Nov 6, 2024
34 of 57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stable Should be backported to stable release(s)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants