Pods are going into pending state after upgrading from v1.26.12-k3s1 to v1.27.11-k3s1 and v1.28.5-k3s1 (Issue is quite random) #10044

ujala-singh · 2024-04-29T18:14:16Z

ujala-singh
Apr 29, 2024

Environmental Info:
K3s Version:
k3s: v1.27.11-k3s1 and v1.28.5-k3s1 (Tried on both)

Node(s) CPU architecture, OS, and Version:
Linux

Cluster Configuration:
I am running k3s servers on k8s Host clusters as multi tenant.

Describe the bug:
We were running v1.26.12-k3s1 for quite sometime and recently we have upgraded out host clusters as well as k3s to v1.27.9 and v1.27.11-k3s1 respectively. After that I have started facing the pods being stuck into pending state. I have checked the k3s log controller is able to crate the pods successfully but pod goes into pending state.

Controller nginx-deployment-5bc8fcb6c7 created pod nginx-deployment-5bc8fcb6c7-tjxxr

Event occurred	{"component": "k3s", "location": "event.go:307", "object": "default/nginx-deployment-5bc8fcb6c7", "fieldPath": "", "kind": "ReplicaSet", "apiVersion": "apps/v1", "type": "Normal", "reason": "SuccessfulCreate", "message": "Created pod: nginx-deployment-5bc8fcb6c7-tjxxr"}

There is one scenario that is quite weird that I am facing currently. Say I am running 3 replicas of nginx-deployment and when the issue occurs I tried to scale the replicas to 5 the newer 2 pods go into pending state. Also at the same time I deleted one of the older nginx pod and described the svc, in svc endpoints it was still showing the older 3 pod IPs.

$ kubectl get pods -owide | grep -i 'nginx'
nginx-deployment-5bc8fcb6c7-2l4l4                                    1/1     Running            0               5h21m   10.231.108.23    aks-azpmt2sp-83987392-vmss00001o   <none>           <none>
nginx-deployment-5bc8fcb6c7-m557n                                    1/1     Running            0               5h21m   10.231.108.180   aks-azpmt2sp-83987392-vmss00001o   <none>           <none>
nginx-deployment-5bc8fcb6c7-tjxxr                                    1/1     Running            0               5h21m   10.231.108.240   aks-azpmt2sp-83987392-vmss00001o   <none>           <none>
nginx-deployment-5bc8fcb6c7-85n56                                    0/1     Pending        	0               26s   	<none>   	<none>		<none>	<none>
nginx-deployment-5bc8fcb6c7-k86h8                                    0/1     Pending            0               26s     <none>   	<none>   	<none>    <none>

$ kubectl describe svc nginx-service
Name:              nginx-service
Namespace:         default
Labels:            <none>
Annotations:       <none>
Selector:          app=nginx
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.0.184.124
IPs:               10.0.184.124
Port:              <unset>  80/TCP
TargetPort:        80/TCP
Endpoints:         10.231.108.23:80,10.231.108.180:80,10.231.108.240:80
Session Affinity:  None
Events: <none>

Node Conditions

Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  FrequentUnregisterNetDevice   False   Sat, 27 Apr 2024 23:23:20 +0530   Sat, 27 Apr 2024 23:22:44 +0530   NoFrequentUnregisterNetDevice   node is functioning properly
  FrequentKubeletRestart        False   Sat, 27 Apr 2024 23:23:20 +0530   Sat, 27 Apr 2024 23:22:44 +0530   NoFrequentKubeletRestart        kubelet is functioning properly
  ReadonlyFilesystem            False   Sat, 27 Apr 2024 23:23:20 +0530   Sat, 27 Apr 2024 23:22:44 +0530   FilesystemIsNotReadOnly         Filesystem is not read-only
  VMEventScheduled              False   Sat, 27 Apr 2024 23:23:20 +0530   Sat, 27 Apr 2024 23:23:19 +0530   NoVMEventScheduled              VM has no scheduled event
  ContainerRuntimeProblem       False   Sat, 27 Apr 2024 23:23:20 +0530   Sat, 27 Apr 2024 23:22:44 +0530   ContainerRuntimeIsUp            container runtime service is up
  FrequentDockerRestart         False   Sat, 27 Apr 2024 23:23:20 +0530   Sat, 27 Apr 2024 23:22:44 +0530   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False   Sat, 27 Apr 2024 23:23:20 +0530   Sat, 27 Apr 2024 23:22:44 +0530   NoFrequentContainerdRestart     containerd is functioning properly
  KernelDeadlock                False   Sat, 27 Apr 2024 23:23:20 +0530   Sat, 27 Apr 2024 23:22:44 +0530   KernelHasNoDeadlock             kernel has no deadlock
  KubeletProblem                False   Sat, 27 Apr 2024 23:23:20 +0530   Sat, 27 Apr 2024 23:22:44 +0530   KubeletIsUp                     kubelet service is up
  FilesystemCorruptionProblem   False   Sat, 27 Apr 2024 23:23:20 +0530   Sat, 27 Apr 2024 23:22:44 +0530   FilesystemIsOK                  Filesystem is healthy
  MemoryPressure                False   Sat, 27 Apr 2024 23:23:01 +0530   Sat, 27 Apr 2024 23:22:31 +0530   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure                  False   Sat, 27 Apr 2024 23:23:01 +0530   Sat, 27 Apr 2024 23:22:31 +0530   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Sat, 27 Apr 2024 23:23:01 +0530   Sat, 27 Apr 2024 23:22:31 +0530   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         True    Sat, 27 Apr 2024 23:23:01 +0530   Sat, 27 Apr 2024 23:22:31 +0530   KubeletReady                    kubelet is posting ready status. AppArmor enabled

Steps To Reproduce:
It's happening randomly. Not sure, if its reproducible.

Expected behavior:
Pod scheduling should happen without any issues i.e. pods should not suck on pending state.

Actual behavior:
Pods are being stuck on Pending state and it's happening randomly.

Answered by ujala-singh

May 3, 2024

@brandond I guess, the issue was related to Kubernetes version 1.27 caused issues with watching events for certain clients. The change introduced in v1.27 allowed certain watches to directly access etcd/kine and were completely unfiltered, potentially starving all other watches and leading to the observed problems.

Issue: kubernetes/kubernetes#123448
Fix: kubernetes/kubernetes#123532

I have tried the k3s versions 1.27.13-k3s1 and 1.28.9-k3s1, I haven't faced this issue yet. Still monitoring though, but I feel this was the issue and the fix resolves it.

View full answer

brandond · 2024-04-29T19:08:30Z

brandond
Apr 29, 2024
Collaborator

Why are they pending? If you describe the pod and/or check the kubelet logs on the node it should tell you why. Just saying that they are pending doesn't really provide enough information to work off of.

4 replies

ujala-singh Apr 29, 2024
Author

I found this issue relates to my issue: kubernetes/kubernetes#123448

increase(apiserver_watch_cache_events_dispatched_total [5m])

If you see the image that I have attached, it clearly shows that until I restarted the k3s server watcher was reporting to 0 for pods resources.

ujala-singh Apr 29, 2024
Author

I checked the pod events it show nothing and kubelet logs doesn't provide any clue as well.

brandond Apr 29, 2024
Collaborator

What makes you think that issue is related, other than that metric value?

You didn't fill out the portion of the issue template that asks for the cluster datastore and number of nodes, can you provide that info?

Are you using the default scheduler? Are the pods being scheduled to nodes, or are they waiting on something else? There should be some indication in the status as to their phase.

ujala-singh Apr 30, 2024
Author

I have enough number of nodes which can accommodate these pods but it seems k3s scheduler itself is not scheduling these pods. And I am using default embedded sqlite datastore and k3s's scheduler.

NAME                               STATUS   ROLES   AGE     VERSION
aks-worker-od-49727859-vmss000001   Ready    agent   24h     v1.27.9
aks-worker-od-49727859-vmss000000   Ready    agent   24h     v1.27.9
aks-worker-sp-83987392-vmss000022   Ready    agent   4h35m   v1.27.9
aks-worker-sp-83987392-vmss00001s   Ready    agent   11h     v1.27.9
aks-worker-sp-83987392-vmss00001q   Ready    agent   16h     v1.27.9

Also I have cluster autoscaler enabled which scales in and out whenever there are any requirements.

And in the Pending pods status show as below:

status:
  phase: Pending
  qosClass: BestEffort

And in events section is hows:

Events: <none>

And if you see the above node condition where that node was added by cluster autoscaler but it says VM has no scheduled event. Even on this system node critical DS are not scheduled like they are also in Pending state and once I restart the k3s server it starts working means all of the pods get scheduled.

brandond · 2024-04-30T16:23:59Z

brandond
Apr 30, 2024
Collaborator

I am using default embedded sqlite datastore and k3s's scheduler.

So you have a single server node? I don't see it listed in your nodes list, are you running with --disable-agent or did you just not show it? Have you checked the server logs to see if there are any errors from the scheduler or controller-manager? If you are running a single server with sqlite there is no leader election, so the components should always be running and active on the server.

1 reply

ujala-singh May 1, 2024
Author

I am using a single server node. I haven't shown the nodes list here. I have checked the server logs for errors from the scheduler, didn't find anything as such and on the controller side as I mentioned above like deployment is able to create the pod and replicasets and event is also showing post that pods are going into pending states.

Controller nginx-deployment-5bc8fcb6c7 created pod nginx-deployment-5bc8fcb6c7-tjxxr
Event occurred	{"component": "k3s", "location": "event.go:307", "object": "default/nginx-deployment-5bc8fcb6c7", "fieldPath": "", "kind": "ReplicaSet", "apiVersion": "apps/v1", "type": "Normal", "reason": "SuccessfulCreate", "message": "Created pod: nginx-deployment-5bc8fcb6c7-tjxxr"}

ujala-singh · 2024-05-03T04:55:07Z

ujala-singh
May 3, 2024
Author

@brandond I guess, the issue was related to Kubernetes version 1.27 caused issues with watching events for certain clients. The change introduced in v1.27 allowed certain watches to directly access etcd/kine and were completely unfiltered, potentially starving all other watches and leading to the observed problems.

Issue: kubernetes/kubernetes#123448
Fix: kubernetes/kubernetes#123532

I have tried the k3s versions 1.27.13-k3s1 and 1.28.9-k3s1, I haven't faced this issue yet. Still monitoring though, but I feel this was the issue and the fix resolves it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods are going into pending state after upgrading from v1.26.12-k3s1 to v1.27.11-k3s1 and v1.28.5-k3s1 (Issue is quite random) #10044

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Pods are going into pending state after upgrading from v1.26.12-k3s1 to v1.27.11-k3s1 and v1.28.5-k3s1 (Issue is quite random) #10044

ujala-singh Apr 29, 2024

Node Conditions

Replies: 3 comments · 5 replies

brandond Apr 29, 2024 Collaborator

ujala-singh Apr 29, 2024 Author

ujala-singh Apr 29, 2024 Author

brandond Apr 29, 2024 Collaborator

ujala-singh Apr 30, 2024 Author

brandond Apr 30, 2024 Collaborator

ujala-singh May 1, 2024 Author

ujala-singh May 3, 2024 Author

ujala-singh
Apr 29, 2024

Replies: 3 comments 5 replies

brandond
Apr 29, 2024
Collaborator

ujala-singh Apr 29, 2024
Author

ujala-singh Apr 29, 2024
Author

brandond Apr 29, 2024
Collaborator

ujala-singh Apr 30, 2024
Author

brandond
Apr 30, 2024
Collaborator

ujala-singh May 1, 2024
Author

ujala-singh
May 3, 2024
Author