e2e: kube play, huge annotation: podman rm hangs #22246

edsantiago · 2024-04-03T11:35:02Z

Seeing a new flake recently, so far only in podman-remote root. Not OS-specific:

Podman kube play
  test with annotation size within limits
....
# podman-remote [options] kube play /var/tmp/pme2e-1564132875/pm3097852031/kube.yaml
[works fine]
← Exit  [It] test with annotation size within limits
→ Enter [AfterEach] TOP-LEVEL
# podman-remote [options] stop --all -t 0
[works fine]
# podman-remote [options] pod rm -fa -t 0

[FAILED] Timed out after 90.000s.

debian-13 : int remote debian-13 root host sqlite [remote]
- 04-02 12:13 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-01 13:30 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 03-26 14:32 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
fedora-39 : int remote fedora-39 root host sqlite [remote]
- 04-02 08:52 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
rawhide : int remote rawhide root host sqlite [remote]
- 04-02 21:12 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits

x	x	x	x	x	x
int(5)	remote(5)	debian-13(3)	root(5)	host(5)	sqlite(5)
		rawhide(1)
		fedora-39(1)

The text was updated successfully, but these errors were encountered:

edsantiago · 2024-05-02T22:10:11Z

I instrumented my no-retries PR, to dump the yaml, and here it is f39 remote root:

           apiVersion: v1
           kind: Pod
           metadata:
             creationTimestamp: "2019-07-17T14:44:08Z"
             name: testPod
             labels:
               app: testPod
         
         
             annotations:
             
               name: SOMETHING TOO LONG FOR GITHUB TO LET ME PUT IN A COMMENT             
         
         
           spec:
             restartPolicy: Never
             hostname: 
             hostNetwork: false
         
             hostAliases:
             initContainers:
             containers:
             - command:
               - top
               args:
               - -d
               - 1.5
               env:
               - name: HOSTNAME
               image: quay.io/libpod/testimage:20240123
               name: testCtr
               imagePullPolicy: missing
               securityContext:
                 allowPrivilegeEscalation: true
                 privileged: false
                 readOnlyRootFilesystem: false
               ports:
               - containerPort: 
                 hostIP: 
                 hostPort: 
                 protocol: TCP
               workingDir: /
               volumeMounts:
           status: {}
           # podman-remote [options] kube play /var/tmp/podman-e2e-1368536477/subtest-2044718688/kube.yaml
           Pod:
           0ca4807ad93b155bc3c542fb835620d492581d937217f3f43cf260a0b90942b7
           Container:
           7db8ba2673e8a2f2fd99803932a9914b0273438795962bbf2b05b49b14535139

[hang]

edsantiago · 2024-07-22T17:06:28Z

Still happening. The recent logs below include a dump of the annotation, in case it helps, but I think it won't.

debian-13 : int remote debian-13 root host sqlite [remote]
- 07-22 11:54 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-17 17:11 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-11 23:13 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-04-2024 06:59 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-02-2024 12:13 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-01-2024 13:30 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 03-26-2024 14:32 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
fedora-39 : int remote fedora-39 root host boltdb [remote]
- 07-12 10:23 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
fedora-39 : int remote fedora-39 root host sqlite [remote]
- 05-02-2024 17:18 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 04-02-2024 08:52 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
rawhide : int remote rawhide root host sqlite [remote]
- 04-02-2024 21:12 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits

x	x	x	x	x	x
int(11)	remote(11)	debian-13(7)	root(11)	host(11)	sqlite(10)
		fedora-39(3)			boltdb(1)
		rawhide(1)

edsantiago · 2024-08-06T17:33:36Z

This one is failing multiple times a day in my no-retry PR. Here's the last two weeks:

debian-13 : int remote debian-13 root host sqlite [remote]
- PR WIP: system test parallelization: two-pass approach #23275
  - 07-23 20:34 in Podman kube play test with annotation size within limits
  - 08-05 12:45 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
fedora-39 : int remote fedora-39 root host boltdb [remote]
- 08-05 09:46 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-26 07:52 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
fedora-40 : int remote fedora-40 root host sqlite [remote]
- 08-06 11:34 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 08-06 10:35 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-31 20:49 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-31 13:56 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
rawhide : int remote rawhide root host sqlite [remote]
- 08-06 12:38 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 08-06 10:36 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 08-05 09:50 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 08-01 07:12 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits
- 07-31 23:06 in TOP-LEVEL [AfterEach] Podman kube play test with annotation size within limits

x	x	x	x	x	x
int(13)	remote(13)	rawhide(5)	root(13)	host(13)	sqlite(11)
		fedora-40(4)			boltdb(2)
		fedora-39(2)
		debian-13(2)

mheon · 2024-08-06T18:22:02Z

I'll start poking at this one. Probably something to do with the sheer size of the annotation making our REST API rather angry.

To debug containers#22246 Signed-off-by: Paul Holzinger <[email protected]>

Luap99 · 2024-08-16T10:07:13Z

Based on the error here

I see what is happening, the code reads stderr first until EOF then stdout until EOF.
There is no error here see stderr is empty but we never get EOF as the crun process must exit in order get EOF.
The crun process however needs to write a very big json now to the stdout pipe, however note we do not start reading from the until the crun process exits. So if the json is large enough to exceed the pipe buffer the write on the crun side will block until we start reading from the pipe thus effectively dead locking us.

In fact this is trivial reproduce once we know this. The issue is why this is is flaky because UpdateContainerStatus is never called by default, it gets only called when the crun kill command fails which can happen if the container was already stopped/exited (which is a normal race condition because we unlock during stop)

And this isn't related to remote either, it is likely that remote makes the race to hit this more likely

There are two major problems with UpdateContainerStatus() First, it can deadlock when the the state json is to big as it tries to read stderr until EOF but it will never hit EOF as long as the runtime process is alive. This means if the runtime json is to big to git into the pipe buffer we deadlock ourselves. Second, the function modifies the container state struct and even adds and exit code to the db however when it is called from the stop() code path we will be unlocked here. While the first problem is easy to fix the second one not so much. And when we cannot update the state there is no point in reading the from runtime in the first place as such remove the function as it does more harm then good. And add some warnings the the functions that might be called unlocked. Fixes containers#22246 Signed-off-by: Paul Holzinger <[email protected]>

To debug containers#22246 Signed-off-by: Paul Holzinger <[email protected]>

edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Apr 3, 2024

Luap99 mentioned this issue Aug 8, 2024

test/e2e: improve command timeout handling #23551

Merged

edsantiago mentioned this issue Aug 8, 2024

libpod: cleanupNetwork() return error #23553

Merged

Luap99 mentioned this issue Aug 15, 2024

test/e2e: on test failures dump server stack strace #23631

Draft

Luap99 added a commit to Luap99/libpod that referenced this issue Aug 15, 2024

test/e2e: on test failures dump server stack strace

c3453dd

To debug containers#22246 Signed-off-by: Paul Holzinger <[email protected]>

Luap99 mentioned this issue Aug 16, 2024

libpod: remove UpdateContainerStatus() #23644

Merged

openshift-merge-bot bot closed this as completed in #23644 Aug 16, 2024

edsantiago mentioned this issue Aug 19, 2024

CI: disable ginkgo flake retries #23662

Merged

Luap99 added a commit to Luap99/libpod that referenced this issue Nov 5, 2024

test/e2e: on test failures dump server stack strace

80fc34e

To debug containers#22246 Signed-off-by: Paul Holzinger <[email protected]>

Luap99 added a commit to Luap99/libpod that referenced this issue Nov 6, 2024

test/e2e: on test failures dump server stack strace

394f68d

To debug containers#22246 Signed-off-by: Paul Holzinger <[email protected]>

Luap99 added a commit to Luap99/libpod that referenced this issue Nov 6, 2024

test/e2e: on test failures dump server stack strace

79c75ff

To debug containers#22246 Signed-off-by: Paul Holzinger <[email protected]>

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 15, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e: kube play, huge annotation: podman rm hangs #22246

e2e: kube play, huge annotation: podman rm hangs #22246

edsantiago commented Apr 3, 2024

edsantiago commented May 2, 2024

edsantiago commented Jul 22, 2024

edsantiago commented Aug 6, 2024

mheon commented Aug 6, 2024

Luap99 commented Aug 16, 2024

e2e: kube play, huge annotation: podman rm hangs #22246

e2e: kube play, huge annotation: podman rm hangs #22246

Comments

edsantiago commented Apr 3, 2024

edsantiago commented May 2, 2024

edsantiago commented Jul 22, 2024

edsantiago commented Aug 6, 2024

mheon commented Aug 6, 2024

Luap99 commented Aug 16, 2024