Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libpod: fix broken saveContainerError() #23577

Merged
merged 3 commits into from
Aug 12, 2024

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Aug 12, 2024

We cannot unlock then lock again without syncing the state as this will
then save a potentially old state causing very bad things, such as
double netns cleanup issues.

The fix here is simple move the saveContainerError() under the same
lock. The comment about the re-lock is just wrong. Not doing this under
the same lock would cause us to update the error after something else
changed the container alreayd.

Most likely this was caused by a misunderstanding on how go defer's work.
Given they run Last In - First Out (LIFO) it is safe as long as out
defer function is after the defer unlock() call.

I think this issue is very bad and might have caused a variety of other
weird flakes. As fact I am confident that this fixes the double cleanup
errors.

Fixes #21569
Also fixes the netns removal ENOENT issues seen in #19721.


libpod: do not save expected stop errors in ctr state

If we try to stop a contianer that is not running or paused we get an
ErrCtrStateInvalid or ErrCtrStopped error. As podman stop is idempotent
this is not a user visable error at all so we should also never log it
in the container state.


libpod: reset state error on start

If we manage to start a container successfully we should unset any
previously stored state errors. Otherwise a user might be confused why
there is an error in the state about some old error even though the
container works/runs.

Does this PR introduce a user-facing change?

Fixed a race condition that caused a invalid container state to be saved to the DB potentially causing other issues such as double network cleanup.

We cannot unlock then lock again without syncing the state as this will
then save a potentially old state causing very bad things, such as
double netns cleanup issues.

The fix here is simple move the saveContainerError() under the same
lock. The comment about the re-lock is just wrong. Not doing this under
the same lock would cause us to update the error after something else
changed the container alreayd.

Most likely this was caused by a misunderstanding on how go defer's work.
Given they run Last In - First Out (LIFO) it is safe as long as out
defer function is after the defer unlock() call.

I think this issue is very bad and might have caused a variety of other
weird flakes. As fact I am confident that this fixes the double cleanup
errors.

Fixes containers#21569
Also fixes the netns removal ENOENT issues seen in containers#19721.

Signed-off-by: Paul Holzinger <[email protected]>
Copy link
Contributor

openshift-ci bot commented Aug 12, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 12, 2024
@Luap99
Copy link
Member Author

Luap99 commented Aug 12, 2024

@mheon PTAL, I am marking this as 5.2 backport candidate as I think this is rather important

@Luap99 Luap99 added the 5.2 label Aug 12, 2024
If we try to stop a contianer that is not running or paused we get an
ErrCtrStateInvalid or ErrCtrStopped error. As podman stop is idempotent
this is not a user visable error at all so we should also never log it
in the container state.

Signed-off-by: Paul Holzinger <[email protected]>
@rhatdan
Copy link
Member

rhatdan commented Aug 12, 2024

LGTM

@@ -1300,6 +1300,9 @@ func (c *Container) start() error {

c.newContainerEvent(events.Start)

// Reset any previous errors as we managed to start the container successfully here.
c.state.Error = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it should happen in init() instead, as part of resetting the state from the last run of the container, but I won't argue too loud if you want to keep it here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree it would fit there, I thought it may best to only do when we now started worked but thinking about it if init or start didn't work we overwrite with the new error anyway so there is no practical difference.

@mheon
Copy link
Member

mheon commented Aug 12, 2024

One small comment, LGTM otherwise, strongly agree this needs to make it into 5.2.1

If we manage to init/start a container successfully we should unset any
previously stored state errors. Otherwise a user might be confused why
there is an error in the state about some old error even though the
container works/runs.

Signed-off-by: Paul Holzinger <[email protected]>
@mheon
Copy link
Member

mheon commented Aug 12, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 12, 2024
@baude
Copy link
Member

baude commented Aug 12, 2024

LGTM for the record

@baude
Copy link
Member

baude commented Aug 12, 2024

/cherry-pick v5.2

@openshift-cherrypick-robot
Copy link
Collaborator

@baude: once the present PR merges, I will cherry-pick it on top of v5.2 in a new PR and assign it to you.

In response to this:

/cherry-pick v5.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 6ef3a23 into containers:main Aug 12, 2024
82 of 83 checks passed
@openshift-cherrypick-robot
Copy link
Collaborator

@baude: new pull request created: #23580

In response to this:

/cherry-pick v5.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Luap99 Luap99 deleted the save-error branch August 12, 2024 13:29
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 11, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Nov 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
5.2 approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

podman stop: rootless netns ref counter out of sync, counter is at -1, resetting it back to 0
5 participants