Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use persist dir for oom file #21523

Merged
merged 1 commit into from
Feb 12, 2024

Conversation

umohnani8
Copy link
Member

@umohnani8 umohnani8 commented Feb 5, 2024

Conmon writes the exit file and oom file (if container
was oom killed) to the persist directory. This directory
is retained across reboots as well.
Update podman to create a persist-dir/ctr-id for the exit
and oom files for each container to be written to. The oom
state of container is set after reading the files
from the persist-dir/ctr-id directory.
The exit code still continues to read the exit file from
the exits directory.

Fixes #13102

Does this PR introduce a user-facing change?

Use the conmon persist-dir for reading the container's oom file if oom killed.

Copy link
Contributor

openshift-ci bot commented Feb 5, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: umohnani8

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2024
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

1 similar comment
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

Copy link

Cockpit tests failed for commit 67b526e0fc51538fff7220f649ff84e781aab34a. @martinpitt, @jelly, @mvollmer please check.

Copy link

Cockpit tests failed for commit 0c224be3c7f9e2034f0819ffb45799db44df1f48. @martinpitt, @jelly, @mvollmer please check.

@umohnani8 umohnani8 force-pushed the memory-final branch 3 times, most recently from d3c5dab to c7e8ec1 Compare February 6, 2024 19:25
Copy link

Cockpit tests failed for commit d3c5dab3c28b1386c034c0451c93c20efef02d62. @martinpitt, @jelly, @mvollmer please check.

Copy link

Cockpit tests failed for commit c7e8ec198eafe553d193993624d8c08319d9de4d. @martinpitt, @jelly, @mvollmer please check.

Copy link

Cockpit tests failed for commit c2d7ab84814f539a8d9529b372937833be95f047. @martinpitt, @jelly, @mvollmer please check.

@umohnani8
Copy link
Member Author

umohnani8 commented Feb 7, 2024

@rhatdan @mheon @haircommander PTAL

This is a breaking change as we are moving from using exit-dir to perisist-dir for the exit and oom files. Conmon writes the oom file to the persist-dir if a container is oom killed as well as the exit file so switching to only using persist-dir.

This fixes the bug also where we were not setting oom killed to true when the container was oom killed.

@haircommander
Copy link
Collaborator

this makes sense to me. exit dir is most useful for cri-o which registers an inotify watcher on it. Having the exists centralized is helpful. Whereas podman really only needs to read from one location, and having that be persist path makes sense

@mheon
Copy link
Member

mheon commented Feb 7, 2024

LGTM

@rhatdan
Copy link
Member

rhatdan commented Feb 7, 2024

/lgtm
/hold

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Feb 7, 2024
@umohnani8
Copy link
Member Author

Is the hyperv failure expected?

@mheon
Copy link
Member

mheon commented Feb 7, 2024

No, but it could be a flake

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this break all existing running containers? They can no longer read the exit file correctly.
I would love to have #21535 running on this before merging. If this doesn't work then we cannot really upgrade test properly.

libpod/oci_conmon_common.go Outdated Show resolved Hide resolved
libpod/oci_conmon_common.go Outdated Show resolved Hide resolved
libpod/oci_missing.go Outdated Show resolved Hide resolved
@Luap99 Luap99 removed the lgtm Indicates that a PR is ready to be merged. label Feb 8, 2024
@umohnani8
Copy link
Member Author

umohnani8 commented Feb 8, 2024

Sure we can wait on #21535 to get in before getting this in.

And yes it is a breaking change, any existing containers will need to be re-created.
This is no longer a breaking change.

Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@umohnani8 umohnani8 force-pushed the memory-final branch 2 times, most recently from 8b4cf78 to ca960e8 Compare February 8, 2024 19:25
Conmon writes the exit file and oom file (if container
was oom killed) to the persist directory. This directory
is retained across reboots as well.
Update podman to create a persist-dir/ctr-id for the exit
and oom files for each container to be written to. The oom
state of container is set after reading the files
from the persist-dir/ctr-id directory.
The exit code still continues to read the exit file from
the exits directory.

Signed-off-by: Urvashi Mohnani <[email protected]>
@umohnani8 umohnani8 changed the title Use persist dir for exit and oom file Use persist dir for oom file Feb 12, 2024
@umohnani8
Copy link
Member Author

So this is no longer a breaking change as we only use the persist dir for the oom file now. We were running into some timeout issues when switching to persist dir completely for the exit files and I think that was because we also use an inotify watcher.

@mheon @rhatdan @Luap99 this is ready, PTAL.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM but given that touches exec parts as well I guess we should test an podman exec oom kill as well here. I am not sure what happens if podman exec runs into an oom kill than does it a) only kill the exec session or b) the whole container or c ) can it be either depending on what the oom killer feels like or something else entirely?

@haircommander
Copy link
Collaborator

haircommander commented Feb 12, 2024

a) only kill the exec session or b) the whole container or c ) can it be either depending on what the oom killer feels like or something else entirely?

should be the exec session unless podman is setting the cgroupv2 knob memory.oom.group to 1 for the container's cgroup

though, with v1 oom killing, sometimes it's c :)

@umohnani8
Copy link
Member Author

Nothing with the exec functionality has been changed by adding the persist dir option here. The only thing is if an exec session is oom killed, I believe it will write oom file in the persist dir location and handle the exit code as it always was.

I am trying to add a test regardless, which works as expected when I do it manually but not in the gingko test suite. These are the commands I am running

podman run -d --name memory-test --memory 3M --oom-score-adj 1000 alpine tail -f /dev/null
podman exec -it memory-test sh -c "dd if=/dev/zero of=/dev/null bs=10M"

Any other possible examples that would work with the test suite?

@rhatdan
Copy link
Member

rhatdan commented Feb 12, 2024

/unhold

If something has to be done for exec lets do it in a separate pr.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 12, 2024
@rhatdan
Copy link
Member

rhatdan commented Feb 12, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 12, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 01bd79b into containers:main Feb 12, 2024
84 checks passed
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label May 13, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators May 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note
Projects
None yet
5 participants