Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libpod: remove UpdateContainerStatus() #23644

Merged
merged 1 commit into from
Aug 16, 2024

Conversation

Luap99
Copy link
Member

@Luap99 Luap99 commented Aug 16, 2024

There are two major problems with UpdateContainerStatus() First, it can deadlock when the the state json is to big as it tries to read stderr until EOF but it will never hit EOF as long as the runtime process is alive. This means if the runtime json is to big to git into the pipe buffer we deadlock ourselves.
Second, the function modifies the container state struct and even adds and exit code to the db however when it is called from the stop() code path we will be unlocked here.

While the first problem is easy to fix the second one not so much. And when we cannot update the state there is no point in reading the from runtime in the first place as such remove the function as it does more harm then good.

And add some warnings the the functions that might be called unlocked.

Fixes #22246

Does this PR introduce a user-facing change?

Fixed a possible deadlock in the container stop code path when using big annotations 

@Luap99 Luap99 added the No New Tests Allow PR to proceed without adding regression tests label Aug 16, 2024
Copy link
Contributor

openshift-ci bot commented Aug 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2024
@Luap99
Copy link
Member Author

Luap99 commented Aug 16, 2024

@mheon PTAL

@mheon
Copy link
Member

mheon commented Aug 16, 2024

This was dealing with a legitimate bug (podman kill failing because of ESRCH was not updating container status to Exited, thus making the stop code produce bad results on already-dead containers in some circumstances). Can we restore the code, but in the locked part of stop() if an error was returned from killContainer()?

@Luap99
Copy link
Member Author

Luap99 commented Aug 16, 2024

This was dealing with a legitimate bug (podman kill failing because of ESRCH was not updating container status to Exited, thus making the stop code produce bad results on already-dead containers in some circumstances). Can we restore the code, but in the locked part of stop() if an error was returned from killContainer()?

Should we trust the clean-up process to update the state? I don't follow the logic here. This function was only called when kill failed (i.e. ESRCH) in which case the container process wasn't running so why should podman kill have to do anything in this case besides reporting an error?

@mheon
Copy link
Member

mheon commented Aug 16, 2024

The original issue this was attempting to fix is #8086 - sig-proxy errors, apparently?

UpdateContainerStatus definitely feels like the wrong way to resolve those. It feels like a manual kill with signal 0, check for ESRCH, would be more than sufficient to suppress the error?

@Luap99 Luap99 force-pushed the update-container-status branch from e502095 to 223d363 Compare August 16, 2024 13:28
@Luap99
Copy link
Member Author

Luap99 commented Aug 16, 2024

@mheon updated to return ErrCtrStateInvalid which is already handled in ProxySignals(), this should prevent #8086 but I haven't tried reproducing this

@mheon
Copy link
Member

mheon commented Aug 16, 2024

LGTM on my end. I don't think the original issue is easy to reproduce, so spending that much effort on it probably not a good idea

There are two major problems with UpdateContainerStatus()
First, it can deadlock when the the state json is to big as it tries to
read stderr until EOF but it will never hit EOF as long as the runtime
process is alive. This means if the runtime json is to big to git into
the pipe buffer we deadlock ourselves.
Second, the function modifies the container state struct and even adds
and exit code to the db however when it is called from the stop() code
path we will be unlocked here.

While the first problem is easy to fix the second one not so much. And
when we cannot update the state there is no point in reading the from
runtime in the first place as such remove the function as it does more
harm then good.

And add some warnings the the functions that might be called unlocked.

Fixes containers#22246

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99 Luap99 force-pushed the update-container-status branch from 223d363 to ddece75 Compare August 16, 2024 13:34
@rhatdan
Copy link
Member

rhatdan commented Aug 16, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 16, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 670b245 into containers:main Aug 16, 2024
83 checks passed
@Luap99 Luap99 deleted the update-container-status branch August 16, 2024 15:04
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 15, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Nov 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. No New Tests Allow PR to proceed without adding regression tests release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

e2e: kube play, huge annotation: podman rm hangs
3 participants