Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-9012] Node drain blocked with pods stuck in terminating state during rke2 rolling updates #507

Open
kkaempf opened this issue Nov 28, 2024 · 0 comments
Labels
JIRA must shout kind/bug Something isn't working
Milestone

Comments

@kkaempf
Copy link

kkaempf commented Nov 28, 2024

SURE-9012

See issue for more context information

Issue description:

Node drain blocked with pods stuck in terminating state during rke2 rolling updates
During rolling upgrades, control-plane node machine is removed from etcd cluster as soon as the machine is being rolled out here :

=> https://github.com/rancher/cluster-api-provider-rke2/blob/v0.6.0/controlplane/internal/controllers/scale.go#L154-L166

	// If etcd leadership is on machine that is about to be deleted, move it to the newest member available.
	etcdLeaderCandidate := controlPlane.Machines.Newest()
	if err := r.workloadCluster.ForwardEtcdLeadership(ctx, machineToDelete, etcdLeaderCandidate); err != nil {
		logger.Error(err, "Failed to move leadership to candidate machine", "candidate", etcdLeaderCandidate.Name)

		return ctrl.Result{}, err
	}

	if err := r.workloadCluster.RemoveEtcdMemberForMachine(ctx, machineToDelete); err != nil {
		logger.Error(err, "Failed to remove etcd member for machine")

		return ctrl.Result{}, err
	}

	logger = logger.WithValues("machine", machineToDelete)
	if err := r.Client.Delete(ctx, machineToDelete); err != nil && !apierrors.IsNotFound(err) {
		logger.Error(err, "Failed to delete control plane machine")
		r.recorder.Eventf(rcp, corev1.EventTypeWarning, "FailedScaleDown",
			"Failed to delete control plane Machine %s for cluster %s/%s control plane: %v", machineToDelete.Name, cluster.Namespace, cluster.Name, err)

		return ctrl.Result{}, err
	} 

The issue is that in rke2 deployments, kubelet is configured to use local api server (127.0.0.1:443), which in turn relies on local etcd pod. But as this node is removed from etcd cluster, kubelet won't be able to reach the API any more, and it will fail to properly drain the node as all pods will remain stuck in Terminating state from kubernetes perspective.

We should probably try to avoid removing etcd member so early during rolling upgrades, we could instead rely on periodic reconcileEtcdMembers
that ensures the number of etcd members is in sync with the number of machines/nodes, this way etcd members will be removed only after the node has been properly drained and removed from cluster by capi controller.

==> https://github.com/rancher/cluster-api-provider-rke2/blob/v0.6.0/controlplane/internal/controllers/rke2controlplane_controller.go#L511-L515

	// Ensures the number of etcd members is in sync with the number of machines/nodes.
	// NOTE: This is usually required after a machine deletion.
	if err := r.reconcileEtcdMembers(ctx, controlPlane); err != nil {
		return ctrl.Result{}, err
	} 

It seems to us that the longer it takes to drain, the more likely the occurrences are failures due to this behavior (which seem to us as being a bug) are a frequent cause of CI failures in Sylva project pipelines, so this problem is quite "hot" for us

If we look at the node, we wan see that kubelet has stopped to report status:

spec:
    podCIDR: 100.72.3.0/24
    podCIDRs:
    - 100.72.3.0/24
    providerID: metal3://sylva-system/mgmt-1440551165-rke2-capm3-virt-management-cp-1/mgmt-1440551165-rke2-capm3-virt-cp-af8bd00850-x5v6f
    taints:
    - effect: NoSchedule
      key: node.kubernetes.io/unreachable
      timeAdded: "2024-09-04T21:52:01Z"
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      timeAdded: "2024-09-04T21:52:09Z"
    - effect: NoSchedule
      key: node.kubernetes.io/unschedulable
      timeAdded: "2024-09-04T21:54:25Z"
    unschedulable: true
  status:
    [...]
    conditions:
    - lastHeartbeatTime: "2024-09-04T21:49:39Z"    # <<< last Heartbeat
      lastTransitionTime: "2024-09-04T21:52:01Z"
      message: Kubelet stopped posting node status.
      reason: NodeStatusUnknown
      status: Unknown
      type: Ready

While looking at kubelet logs, we see that it starts failing to reach the API at 21:51:41 :

E0904 21:51:41.639187    1995 kubelet_node_status.go:540] "Error updating node status, will retry" err="failed to patch status \"" for node \"mgmt-1440551165-rke2-capm3-virt-management-cp-1\": Patch \"https://127.0.0.1:6443/api/v1/nodes/mgmt-1440551165-rke2-capm3-virt-management-cp-1/status?timeout=10s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)" 

It can be explained by the fact that api-server is in crashLoop, failing to reach etcd:

2024-09-04T22:50:18.161588374Z stderr F W0904 22:50:18.161465       1 logging.go:59] [core] [Channel #2 SubChannel #3] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
2024-09-04T22:50:18.171177365Z stderr F W0904 22:50:18.171091       1 logging.go:59] [core] [Channel #5 SubChannel #6] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
2024-09-04T22:50:19.67234822Z stderr F W0904 22:50:19.672219       1 logging.go:59] [core] [Channel #1 SubChannel #4] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
2024-09-04T22:50:19.783642922Z stderr F W0904 22:50:19.783530       1 logging.go:59] [core] [Channel #2 SubChannel #3] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused"
2024-09-04T22:50:19.953395062Z stderr F W0904 22:50:19.953280       1 logging.go:59] [core] [Channel #5 SubChannel #6] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused" 

Which is unavailable as local etcd node has been removed from cluster by capi controller (as described in #1420 (closed)).
=> https://gitlab.com/sylva-projects/sylva-core/-/issues/1420

On kubeadm, we don't have the same issue as kubelet is using the VIP to reach the api-server.
This issue was maybe hidden by the drainTimeout that was previously set on nodes: sylva-projects/sylva-elements/helm-charts/sylva-capi-cluster!421 (merged)
=> https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-capi-cluster/-/merge_requests/421

Additional notes:

#431
https://gitlab.com/sylva-projects/sylva-core/-/issues/1595

@kkaempf kkaempf added kind/bug Something isn't working JIRA must shout labels Nov 28, 2024
@kkaempf kkaempf added this to the coming-next milestone Nov 28, 2024
@kkaempf kkaempf changed the title [SURE-9012] [SURE-9012] Node drain blocked with pods stuck in terminating state during rke2 rolling updates Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
JIRA must shout kind/bug Something isn't working
Development

No branches or pull requests

1 participant