Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix KVM not detecting containerd preload #17658

Merged
merged 1 commit into from
Jan 10, 2024

Conversation

spowelljr
Copy link
Member

@spowelljr spowelljr commented Nov 22, 2023

Cause

We were doing a sudo systemctl restart containerd right before we checked if the preloaded images are present. It was common on KVM that containerd wouldn't finish restarting before running sudo crictl images --output json which would fail, and would return false for the images being preloaded, resulting in the images be redownloaded unnecessarily.

Fix

Added a retry to the sudo crictl images --output json command so if containerd is still restarting we don't report that the preloaded images aren't present.

Result

First time start

Before: 59s
After: 38.6s
20.4s speedup (35%)

Normal start

Before: 48.7s
After: 30.5s
18.2s speedup (37%)

Also fixes the TestErrorSpam/setup test failure on KVM containerd due to the ❌ Unable to load cached images that was being output before.

Before:

$ time minikube start --driver kvm --container-runtime containerd
😄  minikube v1.32.0 on Ubuntu 20.04
✨  Using the kvm2 driver based on user configuration
💾  Downloading driver docker-machine-driver-kvm2:
    > docker-machine-driver-kvm2-...:  65 B / 65 B [---------] 100.00% ? p/s 0s
    > docker-machine-driver-kvm2-...:  13.01 MiB / 13.01 MiB [ 100.00% ? p/s 0s
💿  Downloading VM boot image ...
    > minikube-v1.32.1-1700142131...:  65 B / 65 B [---------] 100.00% ? p/s 0s
    > minikube-v1.32.1-1700142131...:  293.18 MiB / 293.18 MiB  100.00% 240.78 
👍  Starting control plane node minikube in cluster minikube
💾  Downloading Kubernetes v1.28.4 preload ...
    > preloaded-images-k8s-v18-v1...:  436.27 MiB / 436.27 MiB  100.00% 235.22 
🔥  Creating kvm2 VM (CPUs=2, Memory=6000MB, Disk=20000MB) ...
📦  Preparing Kubernetes v1.28.4 on containerd 1.7.9 ...
❌  Unable to load cached images: loading cached images: stat /home/powellsteven/.minikube/cache/images/amd64/registry.k8s.io/kube-proxy_v1.28.4: no such file or directory
    ▪ Generating certificates and keys ...
    ▪ Booting up control plane ...
    ▪ Configuring RBAC rules ...
🔗  Configuring bridge CNI (Container Networking Interface) ...
🔎  Verifying Kubernetes components...
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
🌟  Enabled addons: storage-provisioner, default-storageclass
🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

real	0m59.048s
user	0m4.741s
sys	0m2.471s

After:

$ time minikube start --driver kvm --container-runtime containerd
😄  minikube v1.32.0 on Ubuntu 20.04
✨  Using the kvm2 driver based on user configuration
💾  Downloading driver docker-machine-driver-kvm2:
    > docker-machine-driver-kvm2-...:  65 B / 65 B [---------] 100.00% ? p/s 0s
    > docker-machine-driver-kvm2-...:  13.01 MiB / 13.01 MiB [ 100.00% ? p/s 0s
💿  Downloading VM boot image ...
    > minikube-v1.32.1-1700142131...:  65 B / 65 B [---------] 100.00% ? p/s 0s
    > minikube-v1.32.1-1700142131...:  293.18 MiB / 293.18 MiB  100.00% 283.11 
👍  Starting control plane node minikube in cluster minikube
💾  Downloading Kubernetes v1.28.4 preload ...
    > preloaded-images-k8s-v18-v1...:  436.27 MiB / 436.27 MiB  100.00% 212.95 
🔥  Creating kvm2 VM (CPUs=2, Memory=6000MB, Disk=20000MB) ...
📦  Preparing Kubernetes v1.28.4 on containerd 1.7.9 ...
    ▪ Generating certificates and keys ...
    ▪ Booting up control plane ...
    ▪ Configuring RBAC rules ...
🔗  Configuring bridge CNI (Container Networking Interface) ...
🔎  Verifying Kubernetes components...
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
🌟  Enabled addons: storage-provisioner, default-storageclass
🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

real	0m38.648s
user	0m4.669s
sys	0m2.661s

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 22, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: spowelljr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@spowelljr
Copy link
Member Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 22, 2023
if err != nil {
var rr *command.RunResult

imageList := func() (err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... I am trying to think, is there a better condition to wait for instead of retrying?

Copy link
Contributor

@prezha prezha Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: we probably don't have to wait/retry - pls see the comment below and the followup discussion

@minikube-pr-bot
Copy link

kvm2 driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17658) |
+----------------+----------+---------------------+
| minikube start | 50.4s    | 51.8s               |
| enable ingress | 27.3s    | 26.6s               |
+----------------+----------+---------------------+

Times for minikube start: 49.2s 51.3s 48.2s 50.7s 52.6s
Times for minikube (PR 17658) start: 51.3s 51.5s 51.6s 52.4s 52.1s

Times for minikube (PR 17658) ingress: 27.2s 23.7s 27.1s 27.6s 27.2s
Times for minikube ingress: 28.1s 25.6s 27.1s 27.1s 28.7s

docker driver with docker runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17658) |
+----------------+----------+---------------------+
| minikube start | 23.5s    | 24.8s               |
| enable ingress | 20.2s    | 20.5s               |
+----------------+----------+---------------------+

Times for minikube start: 20.9s 23.4s 24.4s 24.0s 25.0s
Times for minikube (PR 17658) start: 24.2s 25.6s 22.5s 26.0s 25.6s

Times for minikube ingress: 20.8s 20.3s 20.8s 20.8s 18.3s
Times for minikube (PR 17658) ingress: 18.8s 21.8s 20.8s 22.3s 18.8s

docker driver with containerd runtime

+----------------+----------+---------------------+
|    COMMAND     | MINIKUBE | MINIKUBE (PR 17658) |
+----------------+----------+---------------------+
| minikube start | 22.8s    | 22.9s               |
| enable ingress | 34.5s    | 31.3s               |
+----------------+----------+---------------------+

Times for minikube start: 23.4s 21.0s 23.1s 23.1s 23.4s
Times for minikube (PR 17658) start: 23.6s 23.0s 23.2s 20.8s 23.9s

Times for minikube ingress: 31.3s 31.3s 47.3s 31.3s 31.3s
Times for minikube (PR 17658) ingress: 31.3s 31.3s 30.3s 31.3s 32.3s

@minikube-pr-bot
Copy link

These are the flake rates of all failed tests.

Environment Failed Tests Flake Rate (%)
QEMU_macOS TestIngressAddonLegacy/serial/ValidateIngressAddonActivation (gopogh) 1.92 (chart)
QEMU_macOS TestIngressAddonLegacy/StartLegacyK8sCluster (gopogh) 1.92 (chart)
QEMU_macOS TestImageBuild/serial/Setup (gopogh) 2.56 (chart)
QEMU_macOS TestMinikubeProfile (gopogh) 3.21 (chart)
QEMU_macOS TestJSONOutput/pause/Command (gopogh) 7.05 (chart)
QEMU_macOS TestJSONOutput/start/Command (gopogh) 7.05 (chart)
QEMU_macOS TestJSONOutput/unpause/Command (gopogh) 7.05 (chart)

To see the flake rates of all tests by environment, click here.

@prezha
Copy link
Contributor

prezha commented Nov 27, 2023

hey @spowelljr i think the actual problem is in (how we handle/expect) the ctr output:

from logs:

I1126 22:25:11.544084   26459 cache_images.go:88] LoadImages start: [registry.k8s.io/kube-apiserver:v1.28.4 registry.k8s.io/kube-controller-manager:v1.28.4 registry.k8s.io/kube-scheduler:v1.28.4 registry.k8s.io/kube-proxy:v1.28.4 registry.k8s.io/pause:3.9 registry.k8s.io/etcd:3.5.9-0 registry.k8s.io/coredns/coredns:v1.10.1 gcr.io/k8s-minikube/storage-provisioner:v5]
...
I1126 22:25:11.544147   26459 image.go:134] retrieving image: registry.k8s.io/kube-proxy:v1.28.4
...
I1126 22:25:11.547965   26459 image.go:177] daemon lookup for registry.k8s.io/kube-proxy:v1.28.4: Error response from daemon: No such image: registry.k8s.io/kube-proxy:v1.28.4
...
I1126 22:25:11.963349   26459 ssh_runner.go:195] Run: /bin/bash -c "sudo ctr -n=k8s.io images check | grep registry.k8s.io/kube-proxy:v1.28.4"
...
I1126 22:25:12.937500   26459 cache_images.go:116] "registry.k8s.io/kube-proxy:v1.28.4" needs transfer: "registry.k8s.io/kube-proxy:v1.28.4" does not exist at hash "83f6cc407eed88d214aad97f3539bde5c8e485ff14424cd021a3a2899304398e" in container runtime
...
I1126 22:25:12.937543   26459 cri.go:218] Removing image: registry.k8s.io/kube-proxy:v1.28.4
...
I1126 22:25:13.509165   26459 ssh_runner.go:195] Run: sudo /usr/bin/crictl rmi registry.k8s.io/kube-proxy:v1.28.4
...
I1126 22:25:14.276116   26459 cache_images.go:286] Loading image from: /home/prezha/.minikube/cache/images/amd64/registry.k8s.io/kube-proxy_v1.28.4
...
W1126 22:25:14.361961   26459 out.go:239] ❌  Unable to load cached images: loading cached images: stat /home/prezha/.minikube/cache/images/amd64/registry.k8s.io/kube-proxy_v1.28.4: no such file or directory
❌  Unable to load cached images: loading cached images: stat /home/prezha/.minikube/cache/images/amd64/registry.k8s.io/kube-proxy_v1.28.4: no such file or directory
...

so, the above underlying error:

I1126 22:25:12.937500 26459 cache_images.go:116] "registry.k8s.io/kube-proxy:v1.28.4" needs transfer: "registry.k8s.io/kube-proxy:v1.28.4" does not exist at hash "83f6cc407eed88d214aad97f3539bde5c8e485ff14424cd021a3a2899304398e" in container runtime

comes from this line, and that effectively come from this block

looking at the crt output, we have:

$ sudo ctr -n=k8s.io images list
REF                                                                                                             TYPE                                                      DIGEST                                                                  SIZE      PLATFORMS                                                                     LABELS
...
registry.k8s.io/kube-proxy:v1.28.4                                                                              application/vnd.docker.distribution.manifest.list.v2+json sha256:e63408a0f5068a7e9d4b34fd72b4a2b0e5512509b53cd2123a37fc991b0ef532 23.4 MiB  linux/amd64,linux/arm64,linux/ppc64le,linux/s390x                             io.cri-containerd.image=managed
...
registry.k8s.io/kube-proxy@sha256:e63408a0f5068a7e9d4b34fd72b4a2b0e5512509b53cd2123a37fc991b0ef532              application/vnd.docker.distribution.manifest.list.v2+json sha256:e63408a0f5068a7e9d4b34fd72b4a2b0e5512509b53cd2123a37fc991b0ef532 23.4 MiB  linux/amd64,linux/arm64,linux/ppc64le,linux/s390x                             io.cri-containerd.image=managed
...
sha256:83f6cc407eed88d214aad97f3539bde5c8e485ff14424cd021a3a2899304398e                                         application/vnd.docker.distribution.manifest.list.v2+json sha256:e63408a0f5068a7e9d4b34fd72b4a2b0e5512509b53cd2123a37fc991b0ef532 23.4 MiB  linux/amd64,linux/arm64,linux/ppc64le,linux/s390x                             io.cri-containerd.image=managed
...

which has image ref and image digest sha in each line, whereas we want image ref and image id sha on the same like, and hence that fails (as there's image id on a separate "line")

to prove the point, cricrl gives both image ref and id on the same line (from the same minikube cluster):

$ sudo crictl images --no-trunc --digests
IMAGE                                     TAG                  DIGEST                                                                    IMAGE ID                                                                  SIZE
...
registry.k8s.io/kube-proxy                v1.28.4              sha256:e63408a0f5068a7e9d4b34fd72b4a2b0e5512509b53cd2123a37fc991b0ef532   sha256:83f6cc407eed88d214aad97f3539bde5c8e485ff14424cd021a3a2899304398e   24.6MB
...

i've tested the following:

diff --git a/pkg/minikube/cruntime/containerd.go b/pkg/minikube/cruntime/containerd.go
index aa99dc696..6001a33a0 100644
--- a/pkg/minikube/cruntime/containerd.go
+++ b/pkg/minikube/cruntime/containerd.go
@@ -248,12 +248,11 @@ func (r *Containerd) Disable() error {

 // ImageExists checks if image exists based on image name and optionally image sha
 func (r *Containerd) ImageExists(name string, sha string) bool {
-       c := exec.Command("/bin/bash", "-c", fmt.Sprintf("sudo ctr -n=k8s.io images check | grep %s", name))
+       c := exec.Command("sudo", "ctr", "-n=k8s.io", "images", "check")
        rr, err := r.Runner.RunCmd(c)
-       if err != nil {
-               return false
-       }
-       if sha != "" && !strings.Contains(rr.Output(), sha) {
+       if err != nil ||
+               !strings.Contains(rr.Output(), name) ||
+               (sha != "" && !strings.Contains(rr.Output(), sha)) {
                return false
        }
        return true

and it worked:

I1127 23:06:49.993578  115508 cache_images.go:88] LoadImages start: [registry.k8s.io/kube-apiserver:v1.28.4 registry.k8s.io/kube-controller-manager:v1.28.4 registry.k8s.io/kube-scheduler:v1.28.4 registry.k8s.io/kube-proxy:v1.28.4 registry.k8s.io/pause:3.9 registry.k8s.io/etcd:3.5.9-0 registry.k8s.io/coredns/coredns:v1.10.1 gcr.io/k8s-minikube/storage-provisioner:v5]
...
I1127 23:06:49.993720  115508 image.go:134] retrieving image: registry.k8s.io/kube-proxy:v1.28.4
...
I1127 23:06:49.994852  115508 image.go:177] daemon lookup for registry.k8s.io/kube-proxy:v1.28.4: Error response from daemon: No such image: registry.k8s.io/kube-proxy:v1.28.4
...
I1127 23:06:50.401707  115508 ssh_runner.go:195] Run: sudo ctr -n=k8s.io images check
...
I1127 23:06:50.906378  115508 cache_images.go:123] Successfully loaded all cached images
...

before:

real    0m58.370s
user    0m1.498s
sys     0m0.620s

after:

real    0m33.686s
user    0m1.356s
sys     0m0.404s

i'm happy to make a separate pr to fully test this, if it would help

@spowelljr
Copy link
Member Author

spowelljr commented Nov 27, 2023

@prezha I wonder if you're running into another issue, if I output the error in the file pkg/minikube/cruntime/containerd.go line 571 I get the following

sudo crictl images --output json: Process exited with status 1
stdout:

stderr:
time="2023-11-27T23:43:08Z" level=fatal msg="validate service connection: validate CRI v1 image API for endpoint \"unix:///run/containerd/containerd.sock\": rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: no such file or directory\""

I believe what's happening is that the sudo crictl images --output json command I'm referening is failing because containerd is not running due to the restart. And if my command fails it, it marks preloads as not loaded, and then it ends up hitting the code you're referencing which actually outputs the error message we're seeing. So it looks like we have two issues, the containerd not being ready due to restart, and then the logic you're referencing is also broken. Me fixing the containerd restart is causing the code you're referencing to not run, but it's still broken.

@prezha
Copy link
Contributor

prezha commented Nov 28, 2023

@spowelljr i think you're right: we probably have two separate issues here - the one that you observed happens before this other code that i referred to is hit

here is the pr to address the latter in the comment above: #17671

so it makes sense to wait for the containerd to come up after the restart, and if we wait here, then i think it's reasonable to retry

@medyagh medyagh merged commit 6a290dc into kubernetes:master Jan 10, 2024
35 of 50 checks passed
@spowelljr spowelljr deleted the fixKVMPreload branch January 10, 2024 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants