Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dev-docs: etcd disaster recovery #1544

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

dev-docs: etcd disaster recovery #1544

wants to merge 4 commits into from

Conversation

Nirusu
Copy link
Contributor

@Nirusu Nirusu commented Mar 29, 2023

Proposed change(s)

  • Add dev-docs documentation on how to do etcd disaster recovery

I am still not 100% sure about what a good way would be to approach the last step. The restored control plane still holds the manually uploaded VHD, while all other newly created VMs will have no data disk if you want to scale up again. The problem is that while the one VM holds a disk in LUN 0 (which is hard-coded in the disk-mapper), we cannot add a VMSS global disk definition on LUN 0.

To get back to the original state (which honestly might not be super desirable given that these disks disappear), we would somehow:

  1. Create a new disk in the whole VMSS in LUN 1 or higher
  2. Upgrade the control plane node to the newest VMSS state
  3. Clone the disk onto the VMSS state (difficulty: doing this while the cluster is online?)
  4. Turn off the control plane (difficulty: could cause downtime though worker nodes should stay alive for a short while)
  5. Remove the attached disk
  6. Change the LUN from the new VMSS disk we cloned the VHD onto to LUN 0 (difficulty: is this possible without losing data and not ending up with a VM specific disk again?)
  7. Start the control plane again
  8. Perform Constellation recovery once

I haven't played this though (and honestly don't want to), but I think this is what it roughly would look like. This isn't really needed if you just need one control plane, though. But if you have multiple ones, this likely would need to be done in some way without modifying the disk-mapper.

@Nirusu Nirusu added the no changelog Change won't be listed in release changelog label Mar 29, 2023
@Nirusu Nirusu requested a review from m1ghtym0 March 29, 2023 10:31
@edgelesssys edgelesssys deleted a comment from netlify bot Mar 29, 2023
@Nirusu Nirusu force-pushed the docs/dev/etcd-recovery branch from 2900506 to 1e1d360 Compare March 29, 2023 10:37
@Nirusu Nirusu force-pushed the docs/dev/etcd-recovery branch from 1e1d360 to 13db77d Compare March 29, 2023 11:42
@derpsteb
Copy link
Member

derpsteb commented Apr 3, 2023

Will break the dogfooding cluster tomorrow and play this through. ^^

@derpsteb
Copy link
Member

derpsteb commented Apr 27, 2023

Held up by other tasks. Will test the documentation asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no changelog Change won't be listed in release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants