Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman start --filter restart-policy=always : container state improper #23246

Closed
edsantiago opened this issue Jul 10, 2024 · 7 comments · Fixed by #23258
Closed

podman start --filter restart-policy=always : container state improper #23246

edsantiago opened this issue Jul 10, 2024 · 7 comments · Fixed by #23258
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless

Comments

@edsantiago
Copy link
Member

edsantiago commented Jul 10, 2024

Looks related to #22914, although probably not a regression (that one was a reliable panic, this is a flaky podman error):

[+0319s] not ok 151 [045] podman start --filter - start only containers that match the filter in 1914ms
...
<+010ms> # $ podman start --filter restart-policy=always   <<< first restart
<+268ms> # e214e6b649c557e0b24e610cb2bb986694f01f16e5545337195319b5f86763c8
         # 877d80fe7cfa11755af13fe02dbf4a359887ac6ca40e41b17f67b3d1c08392d9
         #
<+074ms> # $ podman start --filter restart-policy=always    <<<< second restart
<+351ms> # Error: unable to start container "e214e6b649c557e0b24e610cb2bb986694f01f16e5545337195319b5f86763c8": container e214e6b649c557e0b24e610cb2bb986694f01f16e5545337195319b5f86763c8 must be in Created or Stopped state to be started: container state improper

For extra credit, maybe someone could fix that error message to indicate the actual current container state

x x x x x x
sys(3) podman(3) fedora-39(2) rootless(3) host(3) boltdb(2)
rawhide(1) sqlite(1)
@edsantiago edsantiago added flakes Flakes from Continuous Integration rootless labels Jul 10, 2024
@Luap99
Copy link
Member

Luap99 commented Jul 11, 2024

FYI reproduced locally, I didn't add a timer in my loop so not sure how long it took but it was somehwere around 30-60 mins

for i in {1..3}; do
    podman run --restart always --name c$i quay.io/libpod/testimage:20240123 true
done
while :; do bin/podman start --filter restart-policy=always || break ; done

I patched the binary to show the current state and it was running when it failed.

@Luap99
Copy link
Member

Luap99 commented Jul 11, 2024

I think I see the issue given that, our code does:
GetContainers() (with the given filter)
Then we check the state of the container we do not hold the container lock, if !running then start
In Start() we now lock the container, which means until we get the lock the state could have been changed, now that we are locked we check the state and throw the report error here because it is already running.

My fix would be to move the running state check into the locked Start() function
cc @mheon

@mheon
Copy link
Member

mheon commented Jul 11, 2024

I feel like locking the existing check would be most correct to not make Start() more complicated than it is right now, but it would also add an extra lock/unlock so I won't argue too hard

@Luap99
Copy link
Member

Luap99 commented Jul 11, 2024

I feel like locking the existing check would be most correct to not make Start() more complicated than it is right now, but it would also add an extra lock/unlock so I won't argue too hard

That wouldn't work because we must stay locked for both checks, if we unlock before then it again open ups the window for another state change.

@mheon
Copy link
Member

mheon commented Jul 11, 2024

Ew.

Maybe add a ErrCtrRunning, return that if the container is in running | paused state from Start, ignore it at the caller?

@Luap99
Copy link
Member

Luap99 commented Jul 11, 2024

I need to take a look tomorrow but if all callers then have to ignore it just complicates the code. But if some code paths want this to fail then a typed error sounds good to me.

@Luap99
Copy link
Member

Luap99 commented Jul 12, 2024

#23258
I will let my reproducer run for a while but I am pretty sure this will fix it.

Luap99 added a commit to Luap99/libpod that referenced this issue Jul 12, 2024
The current code did something like this:
lock()
getState()
unlock()

if state != running
  lock()
  getState() == running -> error
  unlock()

This of course is wrong because between the first unlock() and second
lock() call another process could have modified the state. This meant
that sometimes you would get a weird error on start because the internal
setup errored as the container was already running.

In general any state check without holding the lock is incorrect and
will result in race conditions. As such refactor the code to combine
both StartAndAttach and Attach() into one function that can handle both.
With that we can move the running check into the locked code.

Also use typed error for this specific error case then the callers can
check and ignore the specific error when needed. This also allows us to
fix races in the compat API that did a similar racy state check.

This commit changes slightly how we output the result, previously a
start on already running container would never print the id/name of the
container which is confusing and sort of breaks idempotence. Now it will
include the output except when --all is used. Then it only reports the
ids that were actually started.

Fixes containers#23246

Signed-off-by: Paul Holzinger <[email protected]>
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 14, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Oct 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants