Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e: kube play, huge annotation: podman rm hangs #22246

Closed
edsantiago opened this issue Apr 3, 2024 · 5 comments · Fixed by #23644
Closed

e2e: kube play, huge annotation: podman rm hangs #22246

edsantiago opened this issue Apr 3, 2024 · 5 comments · Fixed by #23644
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote

Comments

@edsantiago
Copy link
Member

Seeing a new flake recently, so far only in podman-remote root. Not OS-specific:

Podman kube play
  test with annotation size within limits
....
# podman-remote [options] kube play /var/tmp/pme2e-1564132875/pm3097852031/kube.yaml
[works fine]
← Exit  [It] test with annotation size within limits
→ Enter [AfterEach] TOP-LEVEL
# podman-remote [options] stop --all -t 0
[works fine]
# podman-remote [options] pod rm -fa -t 0

[FAILED] Timed out after 90.000s.
  • debian-13 : int remote debian-13 root host sqlite [remote]
    • 04-02 12:13 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 04-01 13:30 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 03-26 14:32 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
  • fedora-39 : int remote fedora-39 root host sqlite [remote]
    • 04-02 08:52 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
  • rawhide : int remote rawhide root host sqlite [remote]
    • 04-02 21:12 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
x x x x x x
int(5) remote(5) debian-13(3) root(5) host(5) sqlite(5)
rawhide(1)
fedora-39(1)
@edsantiago edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Apr 3, 2024
@edsantiago
Copy link
Member Author

I instrumented my no-retries PR, to dump the yaml, and here it is f39 remote root:

           apiVersion: v1
           kind: Pod
           metadata:
             creationTimestamp: "2019-07-17T14:44:08Z"
             name: testPod
             labels:
               app: testPod
         
         
             annotations:
             
               name: SOMETHING TOO LONG FOR GITHUB TO LET ME PUT IN A COMMENT             
         
         
           spec:
             restartPolicy: Never
             hostname: 
             hostNetwork: false
         
             hostAliases:
             initContainers:
             containers:
             - command:
               - top
               args:
               - -d
               - 1.5
               env:
               - name: HOSTNAME
               image: quay.io/libpod/testimage:20240123
               name: testCtr
               imagePullPolicy: missing
               securityContext:
                 allowPrivilegeEscalation: true
                 privileged: false
                 readOnlyRootFilesystem: false
               ports:
               - containerPort: 
                 hostIP: 
                 hostPort: 
                 protocol: TCP
               workingDir: /
               volumeMounts:
           status: {}
           # podman-remote [options] kube play /var/tmp/podman-e2e-1368536477/subtest-2044718688/kube.yaml
           Pod:
           0ca4807ad93b155bc3c542fb835620d492581d937217f3f43cf260a0b90942b7
           Container:
           7db8ba2673e8a2f2fd99803932a9914b0273438795962bbf2b05b49b14535139

[hang]

@edsantiago
Copy link
Member Author

Still happening. The recent logs below include a dump of the annotation, in case it helps, but I think it won't.

  • debian-13 : int remote debian-13 root host sqlite [remote]
    • 07-22 11:54 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 07-17 17:11 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 07-11 23:13 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 04-04-2024 06:59 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 04-02-2024 12:13 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 04-01-2024 13:30 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 03-26-2024 14:32 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
  • fedora-39 : int remote fedora-39 root host boltdb [remote]
    • 07-12 10:23 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
  • fedora-39 : int remote fedora-39 root host sqlite [remote]
    • 05-02-2024 17:18 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 04-02-2024 08:52 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
  • rawhide : int remote rawhide root host sqlite [remote]
    • 04-02-2024 21:12 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
x x x x x x
int(11) remote(11) debian-13(7) root(11) host(11) sqlite(10)
fedora-39(3) boltdb(1)
rawhide(1)

@edsantiago
Copy link
Member Author

This one is failing multiple times a day in my no-retry PR. Here's the last two weeks:

  • debian-13 : int remote debian-13 root host sqlite [remote]
  • fedora-39 : int remote fedora-39 root host boltdb [remote]
    • 08-05 09:46 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 07-26 07:52 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
  • fedora-40 : int remote fedora-40 root host sqlite [remote]
    • 08-06 11:34 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 08-06 10:35 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 07-31 20:49 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 07-31 13:56 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
  • rawhide : int remote rawhide root host sqlite [remote]
    • 08-06 12:38 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 08-06 10:36 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 08-05 09:50 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 08-01 07:12 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
    • 07-31 23:06 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
x x x x x x
int(13) remote(13) rawhide(5) root(13) host(13) sqlite(11)
fedora-40(4) boltdb(2)
fedora-39(2)
debian-13(2)

@mheon
Copy link
Member

mheon commented Aug 6, 2024

I'll start poking at this one. Probably something to do with the sheer size of the annotation making our REST API rather angry.

@Luap99
Copy link
Member

Luap99 commented Aug 16, 2024

Based on the error here

I see what is happening, the code reads stderr first until EOF then stdout until EOF.
There is no error here see stderr is empty but we never get EOF as the crun process must exit in order get EOF.
The crun process however needs to write a very big json now to the stdout pipe, however note we do not start reading from the until the crun process exits. So if the json is large enough to exceed the pipe buffer the write on the crun side will block until we start reading from the pipe thus effectively dead locking us.

In fact this is trivial reproduce once we know this. The issue is why this is is flaky because UpdateContainerStatus is never called by default, it gets only called when the crun kill command fails which can happen if the container was already stopped/exited (which is a normal race condition because we unlock during stop)

And this isn't related to remote either, it is likely that remote makes the race to hit this more likely

Luap99 added a commit to Luap99/libpod that referenced this issue Aug 16, 2024
There are two major problems with UpdateContainerStatus()
First, it can deadlock when the the state json is to big as it tries to
read stderr until EOF but it will never hit EOF as long as the runtime
process is alive. This means if the runtime json is to big to git into
the pipe buffer we deadlock ourselves.
Second, the function modifies the container state struct and even adds
and exit code to the db however when it is called from the stop() code
path we will be unlocked here.

While the first problem is easy to fix the second one not so much. And
when we cannot update the state there is no point in reading the from
runtime in the first place as such remove the function as it does more
harm then good.

And add some warnings the the functions that might be called unlocked.

Fixes containers#22246

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/libpod that referenced this issue Nov 5, 2024
Luap99 added a commit to Luap99/libpod that referenced this issue Nov 6, 2024
Luap99 added a commit to Luap99/libpod that referenced this issue Nov 6, 2024
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 15, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Nov 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants