libpod: fix broken saveContainerError() #23577

Luap99 · 2024-08-12T09:53:03Z

We cannot unlock then lock again without syncing the state as this will
then save a potentially old state causing very bad things, such as
double netns cleanup issues.

The fix here is simple move the saveContainerError() under the same
lock. The comment about the re-lock is just wrong. Not doing this under
the same lock would cause us to update the error after something else
changed the container alreayd.

Most likely this was caused by a misunderstanding on how go defer's work.
Given they run Last In - First Out (LIFO) it is safe as long as out
defer function is after the defer unlock() call.

I think this issue is very bad and might have caused a variety of other
weird flakes. As fact I am confident that this fixes the double cleanup
errors.

Fixes #21569
Also fixes the netns removal ENOENT issues seen in #19721.

libpod: do not save expected stop errors in ctr state

If we try to stop a contianer that is not running or paused we get an
ErrCtrStateInvalid or ErrCtrStopped error. As podman stop is idempotent
this is not a user visable error at all so we should also never log it
in the container state.

libpod: reset state error on start

If we manage to start a container successfully we should unset any
previously stored state errors. Otherwise a user might be confused why
there is an error in the state about some old error even though the
container works/runs.

Does this PR introduce a user-facing change?

Fixed a race condition that caused a invalid container state to be saved to the DB potentially causing other issues such as double network cleanup.

We cannot unlock then lock again without syncing the state as this will then save a potentially old state causing very bad things, such as double netns cleanup issues. The fix here is simple move the saveContainerError() under the same lock. The comment about the re-lock is just wrong. Not doing this under the same lock would cause us to update the error after something else changed the container alreayd. Most likely this was caused by a misunderstanding on how go defer's work. Given they run Last In - First Out (LIFO) it is safe as long as out defer function is after the defer unlock() call. I think this issue is very bad and might have caused a variety of other weird flakes. As fact I am confident that this fixes the double cleanup errors. Fixes containers#21569 Also fixes the netns removal ENOENT issues seen in containers#19721. Signed-off-by: Paul Holzinger <[email protected]>

openshift-ci · 2024-08-12T09:53:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Luap99 · 2024-08-12T09:54:18Z

@mheon PTAL, I am marking this as 5.2 backport candidate as I think this is rather important

If we try to stop a contianer that is not running or paused we get an ErrCtrStateInvalid or ErrCtrStopped error. As podman stop is idempotent this is not a user visable error at all so we should also never log it in the container state. Signed-off-by: Paul Holzinger <[email protected]>

rhatdan · 2024-08-12T11:58:58Z

LGTM

mheon · 2024-08-12T12:20:50Z

libpod/container_internal.go

@@ -1300,6 +1300,9 @@ func (c *Container) start() error {

 	c.newContainerEvent(events.Start)

+	// Reset any previous errors as we managed to start the container successfully here.
+	c.state.Error = ""


This feels like it should happen in init() instead, as part of resetting the state from the last run of the container, but I won't argue too loud if you want to keep it here

Agree it would fit there, I thought it may best to only do when we now started worked but thinking about it if init or start didn't work we overwrite with the new error anyway so there is no practical difference.

mheon · 2024-08-12T12:21:18Z

One small comment, LGTM otherwise, strongly agree this needs to make it into 5.2.1

If we manage to init/start a container successfully we should unset any previously stored state errors. Otherwise a user might be confused why there is an error in the state about some old error even though the container works/runs. Signed-off-by: Paul Holzinger <[email protected]>

mheon · 2024-08-12T12:35:32Z

/lgtm

baude · 2024-08-12T12:50:53Z

LGTM for the record

baude · 2024-08-12T12:51:04Z

/cherry-pick v5.2

openshift-cherrypick-robot · 2024-08-12T12:51:08Z

@baude: once the present PR merges, I will cherry-pick it on top of v5.2 in a new PR and assign it to you.

In response to this:

/cherry-pick v5.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-cherrypick-robot · 2024-08-12T13:23:44Z

@baude: new pull request created: #23580

In response to this:

/cherry-pick v5.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci bot added the release-note label Aug 12, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 12, 2024

Luap99 added the 5.2 label Aug 12, 2024

Luap99 force-pushed the save-error branch from 54ff610 to 3ee243f Compare August 12, 2024 10:09

mheon reviewed Aug 12, 2024

View reviewed changes

Luap99 force-pushed the save-error branch from 3ee243f to ecf88f1 Compare August 12, 2024 12:31

openshift-ci bot assigned mheon Aug 12, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 12, 2024

openshift-merge-bot bot merged commit 6ef3a23 into containers:main Aug 12, 2024
82 of 83 checks passed

openshift-cherrypick-robot mentioned this pull request Aug 12, 2024

[v5.2] libpod: fix broken saveContainerError() #23580

Merged

Luap99 deleted the save-error branch August 12, 2024 13:29

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 11, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libpod: fix broken saveContainerError() #23577

libpod: fix broken saveContainerError() #23577

Luap99 commented Aug 12, 2024

openshift-ci bot commented Aug 12, 2024

Luap99 commented Aug 12, 2024

rhatdan commented Aug 12, 2024

mheon Aug 12, 2024

Luap99 Aug 12, 2024

mheon commented Aug 12, 2024

mheon commented Aug 12, 2024

baude commented Aug 12, 2024

baude commented Aug 12, 2024

openshift-cherrypick-robot commented Aug 12, 2024

openshift-cherrypick-robot commented Aug 12, 2024

libpod: fix broken saveContainerError() #23577

libpod: fix broken saveContainerError() #23577

Conversation

Luap99 commented Aug 12, 2024

Does this PR introduce a user-facing change?

openshift-ci bot commented Aug 12, 2024

Luap99 commented Aug 12, 2024

rhatdan commented Aug 12, 2024

mheon Aug 12, 2024

Choose a reason for hiding this comment

Luap99 Aug 12, 2024

Choose a reason for hiding this comment

mheon commented Aug 12, 2024

mheon commented Aug 12, 2024

baude commented Aug 12, 2024

baude commented Aug 12, 2024

openshift-cherrypick-robot commented Aug 12, 2024

openshift-cherrypick-robot commented Aug 12, 2024