Please make sure that you have read through the documentation on etcd data restoration before reading this document and attempting to perform a manual restoration of etcd data.
As mentioned in previously, automatic restoration will be triggered if the etcd data gets corrupted, as the etcd process will crash. But if for some reason, restoration is not automatically triggered when it should have, you may choose to manually restore the etcd data.
You may choose to follow different methods of restoration, based on your etcd + backup sidecar setup:
-
Deploying the provided helm chart, in which etcdbrctl is started in
server
mode-
Exec into the
etcd
container of themain-etcd-0
pod and delete themember
directory under the data directory in order to invalidate itrm -rf /var/etcd/data/new.etcd/member
- You may choose to rename the
member
directory instead of deleting it, if you wish to retain the old data for debugging
-
This will crash the etcd container and when it restarts, the backup sidecar will perform a validation of the data directory, and seeing that the data is corrupt, it will restore the data from the latest backup
⚠️ Keep in mind that the latest backup in the object storage bucket might not be up-to-date with the latest etcd data, and you could see a maximum data loss corresponding to the delta snapshot interval. For instance, settingdelta-snapshot-interval=5m
could result in a maximum data loss worth 5 minutes.
-
If for some reason, automatic restoration isn't getting triggered even after removing the
member
directory, it may be required to temporarily modify the etcdbrctl command torestore
mode to force a manual restoration of data, and then change the container spec back to its original form once the restoration is successful. Do not change any field in the container spec other than thecommand
field, which is detailed below:command: - etcdbrctl - restore - --data-dir=<same as previous value> - --storage-provider=<same as previous value> - --store-prefix=<same as previous value> - --embedded-etcd-quota-bytes=<same as previous value> - --snapstore-temp-directory=<same as previous value>
⚠️ Edit etcd-main statefulset to change command field. After saving it, ideally statefulset controller will recreate the etcd-main-0 pod itself. But, since it doesn't handle the case of unhealthy pod, delete etcd-main-0 pod maually using commnadkubectl -n <namespace> delete etcd-main-0
. Now, check etcd and backup-restore sidecar logs for successful restoration.Once the spec is changed, monitor the logs to make sure restoration occurs. Once restoration is complete, change the container spec back to its previous state and restart the pod. This should purge any previous issues with etcd or backup sidecar, and start snapshotting successfully.
-
-
Deploying etcd and etcdbrctl separately, where etcdbrctl is started in
server
mode- If running etcd-wrapper or legacy etcd-custom-image for running the etcd, then deleting the
member
directory under the etcd data directory should kill the etcd process, and subsequently the etcd-wrapper or etcd-custom-image process finishes execution and exits. You will have to re-run the etcd via one of the components and allow it to trigger data validation anf restoration by etcdbrctl.- If running etcd-wrapper or etcd-custom-image via Kubernetes pods, where the pods are managed by a pod-group such as a statefulset, then the statefulset controller takes care of restarting the pod once it crashes, and there is no need to manually restart the pod or the etcd process.
- If not running etcd via the above-mentioned method, then:
- Delete the
member
directory and wait for etcd to crash curl http://localhost:8080/initialization/status
, assuming etcdbrctl is running on port 8080curl http://localhost:8080/initialization/start
- Wait for the restoration to finish, by observing the logs from etcdbrctl
- Again,
curl http://localhost:8080/initialization/status
to complete the initialization process, and etcdbrctl will resume regular snapshotting after this
- Delete the
- If running etcd-wrapper or legacy etcd-custom-image for running the etcd, then deleting the
-
Deploying etcd and etcdbrctl separately, where etcdbrctl is started in
snapshot
mode- Delete the
member
directory and wait for etcd to crash - Kill the etcdbrctl process
- Run etcdbrctl in
restore
mode to perform a restoration of the data- It is highly recommended to run etcdbrctl in
initialize
mode rather thanrestore
mode, as this performs the necessary validation checks on the data directory taking the decision to trigger a restoration.
- It is highly recommended to run etcdbrctl in
- Start etcd again
- Restart etcdbrctl in
snapshot
mode
⚠️ If running etcdbrctl insnapshot
mode, it is necessary to stop this snapshotter process before triggering a restoration, to avoid data inconsistency.
- Delete the
etcdctl
tool before deciding to perform a manual restoration.
member
directory, else the restoration will fail.