libpod: remove UpdateContainerStatus() #23644

Luap99 · 2024-08-16T11:05:43Z

There are two major problems with UpdateContainerStatus() First, it can deadlock when the the state json is to big as it tries to read stderr until EOF but it will never hit EOF as long as the runtime process is alive. This means if the runtime json is to big to git into the pipe buffer we deadlock ourselves.
Second, the function modifies the container state struct and even adds and exit code to the db however when it is called from the stop() code path we will be unlocked here.

While the first problem is easy to fix the second one not so much. And when we cannot update the state there is no point in reading the from runtime in the first place as such remove the function as it does more harm then good.

And add some warnings the the functions that might be called unlocked.

Fixes #22246

Does this PR introduce a user-facing change?

Fixed a possible deadlock in the container stop code path when using big annotations

openshift-ci · 2024-08-16T11:06:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Luap99 · 2024-08-16T11:49:02Z

@mheon PTAL

mheon · 2024-08-16T11:55:57Z

This was dealing with a legitimate bug (podman kill failing because of ESRCH was not updating container status to Exited, thus making the stop code produce bad results on already-dead containers in some circumstances). Can we restore the code, but in the locked part of stop() if an error was returned from killContainer()?

Luap99 · 2024-08-16T12:01:18Z

This was dealing with a legitimate bug (podman kill failing because of ESRCH was not updating container status to Exited, thus making the stop code produce bad results on already-dead containers in some circumstances). Can we restore the code, but in the locked part of stop() if an error was returned from killContainer()?

Should we trust the clean-up process to update the state? I don't follow the logic here. This function was only called when kill failed (i.e. ESRCH) in which case the container process wasn't running so why should podman kill have to do anything in this case besides reporting an error?

mheon · 2024-08-16T12:13:01Z

The original issue this was attempting to fix is #8086 - sig-proxy errors, apparently?

UpdateContainerStatus definitely feels like the wrong way to resolve those. It feels like a manual kill with signal 0, check for ESRCH, would be more than sufficient to suppress the error?

Luap99 · 2024-08-16T13:30:06Z

@mheon updated to return ErrCtrStateInvalid which is already handled in ProxySignals(), this should prevent #8086 but I haven't tried reproducing this

mheon · 2024-08-16T13:31:28Z

LGTM on my end. I don't think the original issue is easy to reproduce, so spending that much effort on it probably not a good idea

There are two major problems with UpdateContainerStatus() First, it can deadlock when the the state json is to big as it tries to read stderr until EOF but it will never hit EOF as long as the runtime process is alive. This means if the runtime json is to big to git into the pipe buffer we deadlock ourselves. Second, the function modifies the container state struct and even adds and exit code to the db however when it is called from the stop() code path we will be unlocked here. While the first problem is easy to fix the second one not so much. And when we cannot update the state there is no point in reading the from runtime in the first place as such remove the function as it does more harm then good. And add some warnings the the functions that might be called unlocked. Fixes containers#22246 Signed-off-by: Paul Holzinger <[email protected]>

rhatdan · 2024-08-16T14:59:58Z

/lgtm

openshift-ci bot added the release-note label Aug 16, 2024

Luap99 added the No New Tests Allow PR to proceed without adding regression tests label Aug 16, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2024

Luap99 force-pushed the update-container-status branch from e502095 to 223d363 Compare August 16, 2024 13:28

Luap99 force-pushed the update-container-status branch from 223d363 to ddece75 Compare August 16, 2024 13:34

openshift-ci bot assigned rhatdan Aug 16, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 16, 2024

openshift-merge-bot bot merged commit 670b245 into containers:main Aug 16, 2024
83 checks passed

Luap99 deleted the update-container-status branch August 16, 2024 15:04

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 15, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libpod: remove UpdateContainerStatus() #23644

libpod: remove UpdateContainerStatus() #23644

Luap99 commented Aug 16, 2024

openshift-ci bot commented Aug 16, 2024

Luap99 commented Aug 16, 2024

mheon commented Aug 16, 2024

Luap99 commented Aug 16, 2024

mheon commented Aug 16, 2024

Luap99 commented Aug 16, 2024

mheon commented Aug 16, 2024

rhatdan commented Aug 16, 2024

libpod: remove UpdateContainerStatus() #23644

libpod: remove UpdateContainerStatus() #23644

Conversation

Luap99 commented Aug 16, 2024

Does this PR introduce a user-facing change?

openshift-ci bot commented Aug 16, 2024

Luap99 commented Aug 16, 2024

mheon commented Aug 16, 2024

Luap99 commented Aug 16, 2024

mheon commented Aug 16, 2024

Luap99 commented Aug 16, 2024

mheon commented Aug 16, 2024

rhatdan commented Aug 16, 2024