podman start --filter restart-policy=always : container state improper #23246

edsantiago · 2024-07-10T20:21:08Z

Looks related to #22914, although probably not a regression (that one was a reliable panic, this is a flaky podman error):

[+0319s] not ok 151 [045] podman start --filter - start only containers that match the filter in 1914ms
...
<+010ms> # $ podman start --filter restart-policy=always   <<< first restart
<+268ms> # e214e6b649c557e0b24e610cb2bb986694f01f16e5545337195319b5f86763c8
         # 877d80fe7cfa11755af13fe02dbf4a359887ac6ca40e41b17f67b3d1c08392d9
         #
<+074ms> # $ podman start --filter restart-policy=always    <<<< second restart
<+351ms> # Error: unable to start container "e214e6b649c557e0b24e610cb2bb986694f01f16e5545337195319b5f86763c8": container e214e6b649c557e0b24e610cb2bb986694f01f16e5545337195319b5f86763c8 must be in Created or Stopped state to be started: container state improper

For extra credit, maybe someone could fix that error message to indicate the actual current container state

fedora-39 : sys podman fedora-39 rootless host boltdb
- PR CI: Use local cache registry #22726
  - 07-10 09:35 in [sys] [045] podman start --filter - start only containers that match the filter
  - 07-08 14:00 in [sys] [045] podman start --filter - start only containers that match the filter
rawhide : sys podman rawhide rootless host sqlite
- PR test composefs on rawhide #22425
  - 07-10 13:35 in [sys] [045] podman start --filter - start only containers that match the filter

x	x	x	x	x	x
sys(3)	podman(3)	fedora-39(2)	rootless(3)	host(3)	boltdb(2)
		rawhide(1)			sqlite(1)

The text was updated successfully, but these errors were encountered:

Luap99 · 2024-07-11T16:30:47Z

FYI reproduced locally, I didn't add a timer in my loop so not sure how long it took but it was somehwere around 30-60 mins

for i in {1..3}; do
    podman run --restart always --name c$i quay.io/libpod/testimage:20240123 true
done
while :; do bin/podman start --filter restart-policy=always || break ; done

I patched the binary to show the current state and it was running when it failed.

Luap99 · 2024-07-11T16:39:30Z

I think I see the issue given that, our code does:
GetContainers() (with the given filter)
Then we check the state of the container we do not hold the container lock, if !running then start
In Start() we now lock the container, which means until we get the lock the state could have been changed, now that we are locked we check the state and throw the report error here because it is already running.

My fix would be to move the running state check into the locked Start() function
cc @mheon

mheon · 2024-07-11T18:07:51Z

I feel like locking the existing check would be most correct to not make Start() more complicated than it is right now, but it would also add an extra lock/unlock so I won't argue too hard

Luap99 · 2024-07-11T18:15:04Z

I feel like locking the existing check would be most correct to not make Start() more complicated than it is right now, but it would also add an extra lock/unlock so I won't argue too hard

That wouldn't work because we must stay locked for both checks, if we unlock before then it again open ups the window for another state change.

mheon · 2024-07-11T18:26:45Z

Ew.

Maybe add a ErrCtrRunning, return that if the container is in running | paused state from Start, ignore it at the caller?

Luap99 · 2024-07-11T18:32:33Z

I need to take a look tomorrow but if all callers then have to ignore it just complicates the code. But if some code paths want this to fail then a typed error sounds good to me.

Luap99 · 2024-07-12T11:52:04Z

#23258
I will let my reproducer run for a while but I am pretty sure this will fix it.

The current code did something like this: lock() getState() unlock() if state != running lock() getState() == running -> error unlock() This of course is wrong because between the first unlock() and second lock() call another process could have modified the state. This meant that sometimes you would get a weird error on start because the internal setup errored as the container was already running. In general any state check without holding the lock is incorrect and will result in race conditions. As such refactor the code to combine both StartAndAttach and Attach() into one function that can handle both. With that we can move the running check into the locked code. Also use typed error for this specific error case then the callers can check and ignore the specific error when needed. This also allows us to fix races in the compat API that did a similar racy state check. This commit changes slightly how we output the result, previously a start on already running container would never print the id/name of the container which is confusing and sort of breaks idempotence. Now it will include the output except when --all is used. Then it only reports the ids that were actually started. Fixes containers#23246 Signed-off-by: Paul Holzinger <[email protected]>

edsantiago added flakes Flakes from Continuous Integration rootless labels Jul 10, 2024

Luap99 self-assigned this Jul 12, 2024

Luap99 mentioned this issue Jul 12, 2024

fix race conditions in start/attach logic #23258

Merged

openshift-merge-bot bot closed this as completed in #23258 Jul 15, 2024

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 14, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

podman start --filter restart-policy=always : container state improper #23246

podman start --filter restart-policy=always : container state improper #23246

edsantiago commented Jul 10, 2024 •

edited

Loading

Luap99 commented Jul 11, 2024

Luap99 commented Jul 11, 2024

mheon commented Jul 11, 2024

Luap99 commented Jul 11, 2024

mheon commented Jul 11, 2024

Luap99 commented Jul 11, 2024

Luap99 commented Jul 12, 2024

podman start --filter restart-policy=always : container state improper #23246

podman start --filter restart-policy=always : container state improper #23246

Comments

edsantiago commented Jul 10, 2024 • edited Loading

Luap99 commented Jul 11, 2024

Luap99 commented Jul 11, 2024

mheon commented Jul 11, 2024

Luap99 commented Jul 11, 2024

mheon commented Jul 11, 2024

Luap99 commented Jul 11, 2024

Luap99 commented Jul 12, 2024

edsantiago commented Jul 10, 2024 •

edited

Loading