Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restored backup results in wrong control plane node IP #3498

Open
7oku opened this issue Dec 18, 2024 · 0 comments
Open

Restored backup results in wrong control plane node IP #3498

7oku opened this issue Dec 18, 2024 · 0 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management.

Comments

@7oku
Copy link

7oku commented Dec 18, 2024

What happened?

I'm trying to build a recovery procedure for our clusters, involving 3 control plane nodes at Hetzner, according to the manual cluster recovery guide over at https://docs.kubermatic.com/kubeone/v1.9/guides/manual-cluster-recovery/

In each of my tries this results in one control plane not being able to schedule jobs. The node overview looks like this (removed some columns for better readability):

# kubectl get nodes -o wide
NAME                     STATUS   ROLES           AGE     VERSION   INTERNAL-IP   EXTERNAL-IP  
control-plane-1          Ready    control-plane   20h     v1.31.3   10.0.0.5      167.xxx.xx.xx    
control-plane-2          Ready    control-plane   5m41s   v1.31.3   10.0.0.5      188.xxx.xxx.xxx   
control-plane-3          Ready    control-plane   4m55s   v1.31.3   10.0.0.6      157.xx.xxx.xxx   
pool1-68546f7bb8-27b78   Ready    <none>          5d3h    v1.31.3   10.0.0.7      49.xx.xxx.xxx    
pool1-68546f7bb8-76jgk   Ready    <none>          5d3h    v1.31.3   10.0.0.8      5.xx.xxx.xxx      
pool1-68546f7bb8-x892g   Ready    <none>          5d3h    v1.31.3   10.0.0.9      159.xx.x.xxx      
pool2-6f8589d7c8-nrpzh   Ready    <none>          5d3h    v1.31.3   10.0.0.10     116.xxx.xx.xx     

Notice the internal ip of cp1 is wrong and clashes with cp2. Since cloud providers assign ip's dynamically, they were assigned differently this time and 10.0.0.5 was actually the OLD ip of cp1 before the restore.

However, in last task of the etcd restore process, the correct NEW ip 10.0.0.4 for cp1 was provided:

$ sudo ctr run --rm --mount type=bind,src=$HOME/backup,dst=/backup,options=rbind:ro --mount type=bind,src=/var/lib,dst=/var/lib,options=rbind:rw --env ETCDCTL_API=3 registry.k8s.io/etcd:3.5.15-0 sh etcdctl snapshot restore --data-dir=/var/lib/etcd --name=control-plane-1 --initial-advertise-peer-urls=https://10.0.0.4:2380 --initial-cluster=control-plane-1=https://10.0.0.4:2380 /backup/etcd-snapshot.db

time="2024-12-18T13:52:16+01:00" level=warning msg="DEPRECATION: The `mirrors` property of `[plugins.\"io.containerd.grpc.v1.cri\".registry]` is deprecated since containerd v1.5 and will be removed in containerd v2.1. Use `config_path` instead." 
time="2024-12-18T13:52:16+01:00" level=warning msg="DEPRECATION: The `configs` property of `[plugins.\"io.containerd.grpc.v1.cri\".registry]` is deprecated since containerd v1.5 and will be removed in containerd v2.1. Use `config_path` instead." 
Deprecated: Use `etcdutl snapshot restore` instead.
2024-12-18T12:52:16Z info snapshot/v3_snapshot.go:265 restoring snapshot {"path": "/backup/etcd-snapshot.db", "wal-dir": "/var/lib/etcd/member/wal", "data-dir": "/var/lib/etcd", "snap-dir": "/var/lib/etcd/member/snap", "initial-memory-map-size": 0}
2024-12-18T12:52:16Z info membership/store.go:141 Trimming membership information from the backend...
2024-12-18T12:52:16Z info membership/cluster.go:421 added member {"cluster-id": "1375c1fffxxxxd1", "local-member-id": "0", "added-peer-id": "54c89axxxx0017ab", "added-peer-peer-urls": ["[https://10.0.0.4:2380](https://10.0.0.4:2380)"]}
2024-12-18T12:52:16Z info snapshot/v3_snapshot.go:293 restored snapshot {"path": "/backup/etcd-snapshot.db", "wal-dir": "/var/lib/etcd/member/wal", "data-dir": "/var/lib/etcd", "snap-dir": "/var/lib/etcd/member/snap", "initial-memory-map-size": 0}

The following kubeone apply -y -m kubeone.yaml -t tf.json -c credentials.yml -v will get stuck infinitely while waiting for machine-controller to start up:

NAME                                        READY   STATUS              RESTARTS        AGE    IP             NODE
kube-apiserver-control-plane-1              1/1     Running             1 (13m ago)     18m    10.0.0.5       control-plane-1
kube-apiserver-control-plane-2              1/1     Running             0               17m    10.0.0.5       control-plane-2
kube-apiserver-control-plane-3              1/1     Running             0               16m    10.0.0.6       control-plane-3
kube-controller-manager-control-plane-1     1/1     Running             1 (13m ago)     16m    10.0.0.5       control-plane-1
kube-controller-manager-control-plane-2     1/1     Running             0               16m    10.0.0.5       control-plane-2
kube-controller-manager-control-plane-3     1/1     Running             0               16m    10.0.0.6       control-plane-3
kube-scheduler-control-plane-1              1/1     Running             1 (13m ago)     20h    10.0.0.5       control-plane-1
kube-scheduler-control-plane-2              1/1     Running             0               17m    10.0.0.5       control-plane-2
kube-scheduler-control-plane-3              1/1     Running             0               16m    10.0.0.6       control-plane-3
machine-controller-f8b99c57c-rgcqv          0/1     ContainerCreating   1               20h             control-plane-1
machine-controller-webhook-86964ffb74-4z6b4 0/1     ContainerCreating   1               20h             control-plane-1
metrics-server-5b468fbfc8-q7fvr             1/1     Running             0               17m    10.244.3.209   pool1-68546f7bb8-27b78
node-local-dns-4xn8f                        1/1     Running             0               5d3h   10.0.0.9       pool1-68546f7bb8-x892g
node-local-dns-9rptz                        1/1     Running             0               16m    10.0.0.6       control-plane-3

Here as well, all jobs on cp1 show the wrong ip.

The only way out of this situation is to throw away cp1 again and reprovision it, so it also JOINS the other 2 nodes.

Expected behavior

I would exepect kubeone to initialize the first control plane with the correct ip address in case of a restored etcd snapshot.

What KubeOne version are you using?

$ kubeone version
{
  "kubeone": {
    "major": "1",
    "minor": "9",
    "gitVersion": "v1.9.0-8-g971c9953",
    "gitCommit": "971c99537cb5ad277eb14a2e9be6b0ca98769102",
    "gitTreeState": "",
    "buildDate": "2024-12-17T11:18:09+00:00",
    "goVersion": "go1.23.0",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "machine_controller": {
    "major": "1",
    "minor": "60",
    "gitVersion": "v1.60.0",
    "gitCommit": "",
    "gitTreeState": "",
    "buildDate": "",
    "goVersion": "",
    "compiler": "",
    "platform": "linux/amd64"
  }
}

Provide your KubeOneCluster manifest here (if applicable)

apiVersion: kubeone.k8c.io/v1beta2
kind: KubeOneCluster
versions:
  kubernetes: '1.31.3'
cloudProvider:
  hetzner: {}
  external: true
clusterNetwork:
  kubeProxy:
    skipInstallation: true
  cni:
    cilium:
      enableHubble: true
      kubeProxyReplacement: strict
addons:
  enable: true
  addons:
  - name: default-storage-class
  - name: backups-restic
    params:
      resticPassword: "<redacted>"
         s3Bucket: "<redacted>"
        awsDefaultRegion: "<redacted>"
controlPlaneComponents:
  apiServer:
    flags:
      event-ttl: '24h0m0s'

What cloud provider are you running on?

Hetzner

What operating system are you running in your cluster?

Ubuntu 24.04

Additional Information

We could also try to reassign the same ip addresses to the nodes with terraform changes, but it's not guaranteed the cloud provider will be able to assign the ip's again. Keeping the dynamic assignment at cloud provider level is preferred therefore.

@7oku 7oku added kind/bug Categorizes issue or PR as related to a bug. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. labels Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management.
Projects
None yet
Development

No branches or pull requests

1 participant