kube-apiserver unhealthy because of etc failed (reason withheld) #6011

kenho811 · 2024-05-26T03:43:16Z

kenho811
May 26, 2024

I installed rke2 on a debian bookworm VM.

I notice that every now and then, my kube-apiserver becomes unhealthy. I checked the kubernetes logs and notice this.

Message:         Liveness probe failed: Error from server (InternalError): an error on the server ("[+]ping ok\n[+]log ok\n[-]etcd failed: reason withheld\n[+]kms-providers ok\n[+]poststarthook/start-encryption-provider-config-automatic-reload ok\n[+]poststarthook/start-kube-apiserver-admission-initializer ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/priority-and-fairness-config-consumer ok\n[+]poststarthook/priority-and-fairness-filter ok\n[+]poststarthook/storage-object-count-tracker-hook ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[+]poststarthook/crd-informer-synced ok\n[+]poststarthook/start-service-ip-repair-controllers ok\n[+]poststarthook/rbac/bootstrap-roles ok\n[+]poststarthook/scheduling/bootstrap-system-priority-classes ok\n[+]poststarthook/priority-and-fairness-config-producer ok\n[+]poststarthook/start-system-namespaces-controller ok\n[+]poststarthook/bootstrap-controller ok\n[+]poststarthook/start-cluster-authentication-info-controller ok\n[+]poststarthook/start-kube-apiserver-identity-lease-controller ok\n[+]poststarthook/start-deprecated-kube-apiserver-identity-lease-garbage-collector ok\n[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok\n[+]poststarthook/start-legacy-token-tracking-controller ok\n[+]poststarthook/aggregator-reload-proxy-client-cert ok\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[+]autoregister-completion ok\n[+]poststarthook/apiservice-openapi-controller ok\n[+]poststarthook/apiservice-openapiv3-controller ok\n[+]poststarthook/apiservice-discovery-controller ok\nlivez check failed") has prevented the request from succeeding

Apparently it is related to etcd failure, but I am not too sure how to debug this.

Full log attached below.

kube-apiserver-rke2-master-prod.17d2eca167fb7aca.txt

=========
Observation

I observe that the warning event occurs mostly when new pods are being created.

I have an Airflow Scheduler which schedules new Kubernetes Pods every now and then.

Below is a screenshot of the events in my cluster.

When I suddenly schedule a lot of pods, the WARNING event occurs

brandond · 2024-05-27T04:45:35Z

brandond
May 27, 2024
Maintainer

Have you checked the etcd pod for restarts, or looked at the etcd pod logs? You might also consider looking at the apiserver pod logs?

1 reply

kenho811 May 27, 2024
Author

Hi @brandond

Thank you for your help.

With some investigation, I realise that I am running etcd on a slow HDD.

etcd timeout and apiserver cannot get a response and becomes unhealthy.

After I moved the etcd database to a dedicated NVME SSD, the timeout error is gone.

I observe the apiserver (along with the kube-scheduler etc. etc.) have stayed healthy after the fix.

kenho811 · 2024-05-27T15:10:22Z

kenho811
May 27, 2024
Author

Moved etcd database to a dedicated NVME SSD and the issue is resolved.

0 replies

cengizhanR · 2024-06-05T10:09:14Z

cengizhanR
Jun 5, 2024

@kenho811 how did you move etcd database to a dedicated NVME SSD ? I am facing the same problem. Can u share the solution step by step?

3 replies

kenho811 Jun 5, 2024
Author

Hi @cengizhanR

I am running RKE2 inside a VM in Proxmox ( A hypervisor).

Via VM's setting in proxmox, I added an SSD storage.

Once I shell-ed into the VM, I mount the SSD storage to the location of the etcd database folder.

This is done via the below.


Your new storage will appear as a block device.

Use lsblk to list block.

Then format it with file system (e.g. ext4).

Then add it to /etc/fstab/.

(Follow the standard procedure online to format a new device, basically)

As for the location of the etcd data, If I remember correctly, it is in some /var/lib/rancher/server/db folder.

In short, the key is to mount your SSD to that path where rke2 stores the etcd data (the db folder).

cengizhanR Jun 5, 2024

oh I understand, thank you very much @kenho811

thinzaung Aug 1, 2024

We have a similar intermittent issue and I see these error on kubelet log. We already have etcd on different partition and it is on SATA SSD drive. is NVME only the solution?
And what other might cause these errors?

valkiriaaquatica · 2024-10-18T09:35:30Z

valkiriaaquatica
Oct 18, 2024

Hi!, as @kenho811 I was having the same issue. I have a three node cluster in ubuntu22.04 using kuberntes 1.30 and RKE2 also in Proxmox with one master node with etc and two worker nodes. This is a development cluter so IOPS are not as heavy as a prod one. The problem I was having is after installation of prometheus stack that queries a lot of IOPS, the cluster started to fail every time a upgrade or new installation (when etcd was a bit bussy) . I was also using a HDD for whole cluster, after mounting a SSD into the etcd directory problem resolved. Here the steps on how I solved it:

Add new disk to the master node, this disk is from SSD
Mount the volume into the machine
systemctl stop rke2-server
lsblk to detect the new disk whichin my case was /dev/vdb
sudo mkfs.ext4 /dev/vdb
sudo mv /var/lib/rancher/rke2/server/db /var/lib/rancher/rke2/server/db_old_of_hdd
sudo mkdir -p /var/lib/rancher/rke2/server/db
sudo mount /dev/vdb /var/lib/rancher/rke2/server/db
ls /var/lib/rancher/rke2/server/db_old_of_hdd check the permissions of the files
vi /etc/fstab add the new partition
cp -r /var/lib/rancher/rke2/server/db_old_of_hdd/* /var/lib/rancher/rke2/server/db
sudo chown -R etcd:etcd /var/lib/rancher/rke2/server/db/etcd
sudo systemctl restart rke2-server

Then restart the pods of the kube-system that I was having the issue when etcd faield: controller, scheduler...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-apiserver unhealthy because of etc failed (reason withheld) #6011

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

kube-apiserver unhealthy because of etc failed (reason withheld) #6011

kenho811 May 26, 2024

Replies: 4 comments · 4 replies

brandond May 27, 2024 Maintainer

kenho811 May 27, 2024 Author

kenho811 May 27, 2024 Author

cengizhanR Jun 5, 2024

kenho811 Jun 5, 2024 Author

cengizhanR Jun 5, 2024

thinzaung Aug 1, 2024

valkiriaaquatica Oct 18, 2024

kenho811
May 26, 2024

Replies: 4 comments 4 replies

brandond
May 27, 2024
Maintainer

kenho811 May 27, 2024
Author

kenho811
May 27, 2024
Author

cengizhanR
Jun 5, 2024

kenho811 Jun 5, 2024
Author

valkiriaaquatica
Oct 18, 2024