Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zombie containerd-shim processes #318

Open
tianon opened this issue Jul 19, 2021 · 12 comments
Open

zombie containerd-shim processes #318

tianon opened this issue Jul 19, 2021 · 12 comments

Comments

@tianon
Copy link
Member

tianon commented Jul 19, 2021

$ docker pull docker:20-dind
20-dind: Pulling from library/docker
Digest: sha256:4e1e22f471afc7ed5e024127396f56db392c1b6fc81fc0c05c0e072fb51909fe
Status: Image is up to date for docker:20-dind
docker.io/library/docker:20-dind

$ docker run -dit --privileged --name test docker:20-dind dockerd
1ee25dc98bf4bc5e232abe27a9e651b18cbfb8b3f6ca981c3ae64c894584e7b4
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   33 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
  154 root      0:00 ps faux
$ docker exec test docker run --rm tianon/true
Unable to find image 'tianon/true:latest' locally
latest: Pulling from tianon/true
c53fb220cbad: Pulling fs layer
c53fb220cbad: Verifying Checksum
c53fb220cbad: Download complete
c53fb220cbad: Pull complete
Digest: sha256:009cce421096698832595ce039aa13fa44327d96beedb84282a69d3dbcf5a81b
Status: Downloaded newer image for tianon/true:latest
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   33 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
  220 root      0:00 [containerd-shim]
  294 root      0:00 ps faux
$ docker exec test docker run --rm tianon/true
$ docker exec test docker run --rm tianon/true
$ docker exec test docker run --rm tianon/true
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   33 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
  220 root      0:00 [containerd-shim]
  331 root      0:00 [containerd-shim]
  429 root      0:00 [containerd-shim]
  529 root      0:00 [containerd-shim]
  600 root      0:00 ps faux

If I do the same test with --init or ... docker:20-dind docker-init dockerd, then we get no zombies.

I think this is technically a bug in containerd, because I can reproduce with bare containerd as pid1 as well, but it doesn't seem quite the same as containerd/containerd#5708 (although perhaps related).

cc @thaJeztah @cpuguy83

$ docker run -dit --privileged --name test --volume /var/lib/containerd docker:20-dind containerd
2fa1f7a0b543808572a7a2da7ad28fd165d783f1ac8f3e9c59ebb30417f43b9f
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
   44 root      0:00 ps faux
$ docker exec test ctr i pull docker.io/tianon/true:latest
...
$ docker exec test ctr run --rm docker.io/tianon/true:latest foo
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
  110 root      0:00 [containerd-shim]
  152 root      0:00 ps faux
@tianon
Copy link
Member Author

tianon commented Jul 19, 2021

The simplest "fix" (workaround) for this repository is something like adjusting ENTRYPOINT ["dockerd-entrypoint.sh"] to ENTRYPOINT ["docker-init", "dockerd-entrypoint.sh"].

@tianon
Copy link
Member Author

tianon commented Jul 19, 2021

(If you don't trust our entrypoint script [which, fair], you can also reproduce just the same with --entrypoint dockerd 😅)

@tianon
Copy link
Member Author

tianon commented Jul 20, 2021

Temporary workaround is up in #319 (to just throw docker-init on top of dockerd).

@thaJeztah
Copy link
Contributor

Did you open a ticket in containerd as well? (of the existing ones don't match this scenario?)

@tianon
Copy link
Member Author

tianon commented Jul 20, 2021

I didn't file an issue there yet, but I've commented at containerd/containerd#5708 (comment) now (because it feels way too similar to be coincidence, IMO).

@tianon
Copy link
Member Author

tianon commented Jul 23, 2021

Quoting containerd/containerd#5708 (comment) here for posterity:

I'm facing something that seems really closely related (and IMO it doesn't feel like it can be pure coincidence), although maybe not exactly the same? When running Docker in Docker (or even just raw conatinerd-in-Docker), I'm seeing 100% reliable behavior where every invocation of a container ends up in a containerd-shim zombie, and it goes away if I run the container with tini as pid1 instead:

$ docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd docker:20-dind
2fa1f7a0b543808572a7a2da7ad28fd165d783f1ac8f3e9c59ebb30417f43b9f
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
   44 root      0:00 ps faux
$ docker exec test ctr i pull docker.io/tianon/true:latest
...
$ docker exec test ctr run --rm docker.io/tianon/true:latest foo
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
  110 root      0:00 [containerd-shim]
  152 root      0:00 ps faux
$ docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd --init docker:20-dind
5d2d6ac195d6fdbb0646b6df8d64de3ac00c4ae3fc0dce62bdd8eb59ac20a322
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 /sbin/docker-init -- containerd
    8 root      0:00 containerd
   32 root      0:00 ps faux
$ docker exec test ctr i pull docker.io/tianon/true:latest
...
$ docker exec test ctr run --rm docker.io/tianon/true:latest foo
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 /sbin/docker-init -- containerd
    8 root      0:00 containerd
  142 root      0:00 ps faux

(See also docker-library/docker#318.)

@tianon The ctr uses containerd-shim-runc-v2 by default right now. The shimv2 binary will re-exec itself to start the running shim server, which makes that the parent pid of running shim server is 1. But the containerd isn't the reaper for the exited child processes. That is why that is zombie shim in dind.

And when use io.containerd.runtime.v1.linux as runtime, the runtime will call the containerd to publish that exit event.

https://github.com/containerd/containerd/blob/a963242f78c8a05967dfe050cab1016ac7aeabee/cmd/containerd-shim/main_unix.go#L287-L318

But the ctr run will delete the task when the task is stop.

https://github.com/containerd/containerd/blob/a963242f78c8a05967dfe050cab1016ac7aeabee/runtime/v1/shim/service.go#L509-L541

The p.SetExited(e.Status) will notify the ctr that the task quit. So, both the task.Delete in ctr and event publish action are handled in the same time. And the containerD will kill the shim force so that the containerd created by shim will be zombie.

➜  vagrant docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd docker:20-dind
82f541cbb604077d99f76da45d9b866e03de577ffb209bf88b437e41ddca8440
➜  vagrant docker exec test ctr i pull docker.io/tianon/true:latest > /dev/null
➜  vagrant docker exec test ctr run --runtime io.containerd.runtime.v1.linux docker.io/tianon/true:latest foo

➜  vagrant docker exec test ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
  107 root      0:00 [containerd]
  122 root      0:00 ps -ef

If you run the foo container with detach mode, the shim will reap that containerd command.

➜  vagrant docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd docker:20-dind
97243d2c9667a246827a07eca736f666dc9f0864744f532fb7bf16f7d80dda08
➜  vagrant docker exec test ctr i pull docker.io/tianon/true:latest > /dev/null
➜  vagrant docker exec test ctr run -d --runtime io.containerd.runtime.v1.linux docker.io/tianon/true:latest foo

➜  vagrant docker exec test ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
   74 root      0:00 containerd-shim -namespace default -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/default/foo -address /run/containerd/containerd.sock -containerd-binary /usr/local/bin/containerd
  112 root      0:00 ps -ef

➜  vagrant docker exec test ctr c rm foo

➜  vagrant docker exec test ps -ef
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
  140 root      0:00 ps -ef

@tianon
Copy link
Member Author

tianon commented Jun 11, 2022

FWIW, I can still reproduce (using --entrypoint this time to avoid #319): 😞

$ docker run -dit --privileged --name test --entrypoint dockerd --pull=always docker:dind
dind: Pulling from library/docker
Digest: sha256:a7a9383d0631b5f6b59f0a8138912d20b63c9320127e3fb065cb9ca0257a58b2
Status: Downloaded newer image for docker:dind
41749ef585c457ff1e737f7ef2efc6ac8d3395219a6526c25f042c31bc43ca01
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   22 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
  138 root      0:00 ps faux
$ docker exec test docker run --rm tianon/true
Unable to find image 'tianon/true:latest' locally
latest: Pulling from tianon/true
c53fb220cbad: Pulling fs layer
c53fb220cbad: Download complete
c53fb220cbad: Pull complete
Digest: sha256:009cce421096698832595ce039aa13fa44327d96beedb84282a69d3dbcf5a81b
Status: Downloaded newer image for tianon/true:latest
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   22 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
  196 root      0:00 [containerd-shim]
  270 root      0:00 ps faux
$ docker exec test docker run --rm tianon/true
$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   22 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
  196 root      0:00 [containerd-shim]
  303 root      0:00 [containerd-shim]
  376 root      0:00 ps faux
$ docker exec test docker version
Client:
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 22:56:42 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.11
  Git commit:       a89b842
  Built:            Mon Jun  6 23:01:45 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.6.6
  GitCommit:        10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc:
  Version:          1.1.2
  GitCommit:        v1.1.2-0-ga916309f
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

0lmi pushed a commit to 0lmi/mender-qa that referenced this issue Jan 4, 2023
dockerd might fail from time to time which looks related to the
known issue docker-library/docker#318
and using docker-init is the workaround used by the community

Changelog: None
Ticket: QA-508
Signed-off-by: Alex Miliukov <[email protected]>
0lmi pushed a commit to 0lmi/mender-qa that referenced this issue Jan 4, 2023
dockerd might fail from time to time which looks related to the
known issue docker-library/docker#318
and using docker-init is the workaround used by the community

Changelog: None
Ticket: QA-508
Signed-off-by: Alex Miliukov <[email protected]>
@tianon tianon mentioned this issue May 15, 2023
@tianon
Copy link
Member Author

tianon commented Jul 1, 2024

Coming back a year later to ring the bell again: 😭

$ docker run -dit --privileged --name test --entrypoint dockerd --pull=always docker:dind
dind: Pulling from library/docker
Digest: sha256:87d892c14d2b755ac4e8268b21e8c8a7ff7f44b52753e265b7a300d2fa065d50
Status: Image is up to date for docker:dind
99217162d401fa0c9785053345702d946c7e5fb241be3a6faf84dfb4056a13ce

$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   23 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml
  189 root      0:00 ps faux

$ docker exec test docker run --rm tianon/true
Unable to find image 'tianon/true:latest' locally
latest: Pulling from tianon/true
4e30b577f37b: Pulling fs layer
4e30b577f37b: Verifying Checksum
4e30b577f37b: Download complete
4e30b577f37b: Pull complete
Digest: sha256:45b95352fad44acee2c35a4ddc2205b61448b1daf2ba2c949b7136582446e682
Status: Downloaded newer image for tianon/true:latest

$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   23 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml
  248 root      0:00 [containerd-shim]
  316 root      0:00 ps faux

$ docker exec test docker run --rm tianon/true

$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   23 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml
  248 root      0:00 [containerd-shim]
  346 root      0:00 [containerd-shim]
  411 root      0:00 ps faux

$ docker exec test docker version
Client:
 Version:           27.0.2
 API version:       1.46
 Go version:        go1.21.11
 Git commit:        912c1dd
 Built:             Wed Jun 26 18:46:21 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.0.2
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       e953d76
  Built:            Wed Jun 26 18:47:59 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.7.18
  GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc:
  Version:          1.1.13
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

@thaJeztah
Copy link
Contributor

@tianon does the same happen with docker 26.1 with the same containerd version, or only 27.0? (I know we updated to containerd 1.7, bit I think the DIND image already had it?

cc @dmcgowan

@maoxuner
Copy link

maoxuner commented Jul 2, 2024

It's a long time, not sure whether is same condition. I changed the host kernel from realtime to a generic one, then problem solved.

@tianon
Copy link
Member Author

tianon commented Jul 2, 2024

Yes, 26 is also affected:

$ docker run -dit --privileged --name test --entrypoint dockerd --pull=always docker:26-dind
26-dind: Pulling from library/docker
Digest: sha256:dfaffff209798d9efe4ec07243d172ba8706918859c87869656a5d3091df44bb
Status: Image is up to date for docker:26-dind
94ddbbe9823bad23454556b690c854e6ac8b7e06adc71095676d7ccf2c7ef9d2

$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   26 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml
  163 root      0:00 ps faux

$ docker exec test docker run --rm tianon/true
Unable to find image 'tianon/true:latest' locally
latest: Pulling from tianon/true
4e30b577f37b: Pulling fs layer
4e30b577f37b: Verifying Checksum
4e30b577f37b: Download complete
4e30b577f37b: Pull complete
Digest: sha256:45b95352fad44acee2c35a4ddc2205b61448b1daf2ba2c949b7136582446e682
Status: Downloaded newer image for tianon/true:latest

$ docker exec test docker run --rm tianon/true

$ docker exec test ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 dockerd
   26 root      0:00 containerd --config /var/run/docker/containerd/containerd.toml
  197 root      0:00 [containerd-shim]
  277 root      0:00 [containerd-shim]
  336 root      0:00 ps faux

$ docker exec test docker version
Client:
 Version:           26.1.4
 API version:       1.45
 Go version:        go1.21.11
 Git commit:        5650f9b
 Built:             Wed Jun  5 11:27:57 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          26.1.4
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       de5c9cf
  Built:            Wed Jun  5 11:29:25 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.7.18
  GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

@tianon
Copy link
Member Author

tianon commented Jul 3, 2024

This isn't specific to the way dockerd runs/supervises containerd either:

$ docker run -dit --rm --name test --privileged --pull=always tianon/containerd:rc
rc: Pulling from tianon/containerd
Digest: sha256:bc0d7e7f36b2963769c4924a11bf1da09f501cbccdc7cb8c2f5d011d0d066440
Status: Image is up to date for tianon/containerd:rc
9f2cb8622b6ac98c90a0d2fbe325993199d71f5c469941a7c2117492c1d8ad12

$ docker exec test ctr i pull docker.io/tianon/true:latest > /dev/null

$ docker exec test ctr run --rm docker.io/tianon/true:latest test

$ docker exec test ctr run --rm docker.io/tianon/true:latest test

$ docker exec test ctr run --rm docker.io/tianon/true:latest test

$ # "tianon/containerd" doesn't have "ps" and I can't convince "docker top" to show zombies 🙈
$ docker run --rm --pid container:test bash ps faux
PID   USER     TIME  COMMAND
    1 root      0:00 containerd
   91 root      0:00 [containerd-shim]
  166 root      0:00 [containerd-shim]
  248 root      0:00 [containerd-shim]
  299 root      0:00 ps faux

$ docker exec test ctr version
Client:
  Version:  v2.0.0-rc.3
  Revision: 27de5fea738a38345aa1ac7569032261a6b1e562
  Go version: go1.22.4

Server:
  Version:  v2.0.0-rc.3
  Revision: 27de5fea738a38345aa1ac7569032261a6b1e562
  UUID: 46bfcb40-716f-46fb-8887-6010373bed51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@tianon @thaJeztah @maoxuner and others