Restored backup results in wrong control plane node IP #3498

7oku · 2024-12-18T14:15:06Z

What happened?

I'm trying to build a recovery procedure for our clusters, involving 3 control plane nodes at Hetzner, according to the manual cluster recovery guide over at https://docs.kubermatic.com/kubeone/v1.9/guides/manual-cluster-recovery/

In each of my tries this results in one control plane not being able to schedule jobs. The node overview looks like this (removed some columns for better readability):

# kubectl get nodes -o wide
NAME                     STATUS   ROLES           AGE     VERSION   INTERNAL-IP   EXTERNAL-IP  
control-plane-1          Ready    control-plane   20h     v1.31.3   10.0.0.5      167.xxx.xx.xx    
control-plane-2          Ready    control-plane   5m41s   v1.31.3   10.0.0.5      188.xxx.xxx.xxx   
control-plane-3          Ready    control-plane   4m55s   v1.31.3   10.0.0.6      157.xx.xxx.xxx   
pool1-68546f7bb8-27b78   Ready    <none>          5d3h    v1.31.3   10.0.0.7      49.xx.xxx.xxx    
pool1-68546f7bb8-76jgk   Ready    <none>          5d3h    v1.31.3   10.0.0.8      5.xx.xxx.xxx      
pool1-68546f7bb8-x892g   Ready    <none>          5d3h    v1.31.3   10.0.0.9      159.xx.x.xxx      
pool2-6f8589d7c8-nrpzh   Ready    <none>          5d3h    v1.31.3   10.0.0.10     116.xxx.xx.xx

Notice the internal ip of cp1 is wrong and clashes with cp2. Since cloud providers assign ip's dynamically, they were assigned differently this time and 10.0.0.5 was actually the OLD ip of cp1 before the restore.

However, in last task of the etcd restore process, the correct NEW ip 10.0.0.4 for cp1 was provided:

$ sudo ctr run --rm --mount type=bind,src=$HOME/backup,dst=/backup,options=rbind:ro --mount type=bind,src=/var/lib,dst=/var/lib,options=rbind:rw --env ETCDCTL_API=3 registry.k8s.io/etcd:3.5.15-0 sh etcdctl snapshot restore --data-dir=/var/lib/etcd --name=control-plane-1 --initial-advertise-peer-urls=https://10.0.0.4:2380 --initial-cluster=control-plane-1=https://10.0.0.4:2380 /backup/etcd-snapshot.db

time="2024-12-18T13:52:16+01:00" level=warning msg="DEPRECATION: The `mirrors` property of `[plugins.\"io.containerd.grpc.v1.cri\".registry]` is deprecated since containerd v1.5 and will be removed in containerd v2.1. Use `config_path` instead." 
time="2024-12-18T13:52:16+01:00" level=warning msg="DEPRECATION: The `configs` property of `[plugins.\"io.containerd.grpc.v1.cri\".registry]` is deprecated since containerd v1.5 and will be removed in containerd v2.1. Use `config_path` instead." 
Deprecated: Use `etcdutl snapshot restore` instead.
2024-12-18T12:52:16Z info snapshot/v3_snapshot.go:265 restoring snapshot {"path": "/backup/etcd-snapshot.db", "wal-dir": "/var/lib/etcd/member/wal", "data-dir": "/var/lib/etcd", "snap-dir": "/var/lib/etcd/member/snap", "initial-memory-map-size": 0}
2024-12-18T12:52:16Z info membership/store.go:141 Trimming membership information from the backend...
2024-12-18T12:52:16Z info membership/cluster.go:421 added member {"cluster-id": "1375c1fffxxxxd1", "local-member-id": "0", "added-peer-id": "54c89axxxx0017ab", "added-peer-peer-urls": ["[https://10.0.0.4:2380](https://10.0.0.4:2380)"]}
2024-12-18T12:52:16Z info snapshot/v3_snapshot.go:293 restored snapshot {"path": "/backup/etcd-snapshot.db", "wal-dir": "/var/lib/etcd/member/wal", "data-dir": "/var/lib/etcd", "snap-dir": "/var/lib/etcd/member/snap", "initial-memory-map-size": 0}

The following kubeone apply -y -m kubeone.yaml -t tf.json -c credentials.yml -v will get stuck infinitely while waiting for machine-controller to start up:

NAME                                        READY   STATUS              RESTARTS        AGE    IP             NODE
kube-apiserver-control-plane-1              1/1     Running             1 (13m ago)     18m    10.0.0.5       control-plane-1
kube-apiserver-control-plane-2              1/1     Running             0               17m    10.0.0.5       control-plane-2
kube-apiserver-control-plane-3              1/1     Running             0               16m    10.0.0.6       control-plane-3
kube-controller-manager-control-plane-1     1/1     Running             1 (13m ago)     16m    10.0.0.5       control-plane-1
kube-controller-manager-control-plane-2     1/1     Running             0               16m    10.0.0.5       control-plane-2
kube-controller-manager-control-plane-3     1/1     Running             0               16m    10.0.0.6       control-plane-3
kube-scheduler-control-plane-1              1/1     Running             1 (13m ago)     20h    10.0.0.5       control-plane-1
kube-scheduler-control-plane-2              1/1     Running             0               17m    10.0.0.5       control-plane-2
kube-scheduler-control-plane-3              1/1     Running             0               16m    10.0.0.6       control-plane-3
machine-controller-f8b99c57c-rgcqv          0/1     ContainerCreating   1               20h             control-plane-1
machine-controller-webhook-86964ffb74-4z6b4 0/1     ContainerCreating   1               20h             control-plane-1
metrics-server-5b468fbfc8-q7fvr             1/1     Running             0               17m    10.244.3.209   pool1-68546f7bb8-27b78
node-local-dns-4xn8f                        1/1     Running             0               5d3h   10.0.0.9       pool1-68546f7bb8-x892g
node-local-dns-9rptz                        1/1     Running             0               16m    10.0.0.6       control-plane-3

Here as well, all jobs on cp1 show the wrong ip.

The only way out of this situation is to throw away cp1 again and reprovision it, so it also JOINS the other 2 nodes.

Expected behavior

I would exepect kubeone to initialize the first control plane with the correct ip address in case of a restored etcd snapshot.

What KubeOne version are you using?

$ kubeone version
{
  "kubeone": {
    "major": "1",
    "minor": "9",
    "gitVersion": "v1.9.0-8-g971c9953",
    "gitCommit": "971c99537cb5ad277eb14a2e9be6b0ca98769102",
    "gitTreeState": "",
    "buildDate": "2024-12-17T11:18:09+00:00",
    "goVersion": "go1.23.0",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "machine_controller": {
    "major": "1",
    "minor": "60",
    "gitVersion": "v1.60.0",
    "gitCommit": "",
    "gitTreeState": "",
    "buildDate": "",
    "goVersion": "",
    "compiler": "",
    "platform": "linux/amd64"
  }
}

Provide your KubeOneCluster manifest here (if applicable)

apiVersion: kubeone.k8c.io/v1beta2
kind: KubeOneCluster
versions:
  kubernetes: '1.31.3'
cloudProvider:
  hetzner: {}
  external: true
clusterNetwork:
  kubeProxy:
    skipInstallation: true
  cni:
    cilium:
      enableHubble: true
      kubeProxyReplacement: strict
addons:
  enable: true
  addons:
  - name: default-storage-class
  - name: backups-restic
    params:
      resticPassword: "<redacted>"
         s3Bucket: "<redacted>"
        awsDefaultRegion: "<redacted>"
controlPlaneComponents:
  apiServer:
    flags:
      event-ttl: '24h0m0s'

What cloud provider are you running on?

Hetzner

What operating system are you running in your cluster?

Ubuntu 24.04

Additional Information

We could also try to reassign the same ip addresses to the nodes with terraform changes, but it's not guaranteed the cloud provider will be able to assign the ip's again. Keeping the dynamic assignment at cloud provider level is preferred therefore.

The text was updated successfully, but these errors were encountered:

7oku added kind/bug Categorizes issue or PR as related to a bug. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. labels Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restored backup results in wrong control plane node IP #3498

Restored backup results in wrong control plane node IP #3498

7oku commented Dec 18, 2024

Restored backup results in wrong control plane node IP #3498

Restored backup results in wrong control plane node IP #3498

Comments

7oku commented Dec 18, 2024

What happened?

Expected behavior

What KubeOne version are you using?

Provide your KubeOneCluster manifest here (if applicable)

What cloud provider are you running on?

What operating system are you running in your cluster?

Additional Information