0.11.x migration from existing clusters without losing state #1380

davidmccormick · 2018-06-29T16:26:05Z

This PR is for the master 0.11.x candidate branch and is intended to allow a smoother migration for existing users with 0.10.x clusters which do not have the new separate etcd stack. When first testing the 0.11.x code I found that the process would always fail and roll-back due to cloud-formation dependencies, but once these were cleaned up and worked around then the new etcd cluster would come up empty - effectively wiping the state of the existing cluster in the process of the upgrade.

TL/DR: Upgrades from legacy etcds by importing a copy of all their keys and then allows them to be destroyed. ALSO you are expected to have updated to the 0.10.x migration release first otherwise the migration will fail because cloud-formation dependencies or your new etcd servers won't be able to connect to the old ones!

The approach for the upgrade is this: -

Examine existing etcd cluster state to determine if we are upgrading from a legacy cluster.
Refuse to update unless Kubernetes.Networking.SelfHosting is enabled.
If migration required, lookup and construct and etcdadm connect string for the existing etcd servers by querying aws api.
Render the etcd cloud-formation with additonal etcd state information in the template context.
During a migration extra export and import systemd units are created on the new etcds.
A new etcd specific StackConfig has been created which contains the additional etcd state information so I can use knowledge of the existing state when rendering the etcd cfn stack json.
The Etcd stack creates an extra /var/run/coreos/etcdadm-environment-migration file with details for connecting to the existing etcd cluster.
8 etcdadm has been enhanced to provide extra 'cluster-is-healthy', 'member-is-leader', 'migration-export-kube-state' and 'migration-import-kube-state' commands.
The etcd cluster leader will export all of the kubernetes objects under prefix '/registry', the others will write an empty file.
On each of the new etcd servers if there is an export file that is not empty they will import it into the new cluster. The others just pass through with success so that they trigger their cfn-signal but the leader will only trigger cfn-signal if the import is successful. This means if the import fails for some reason the cluster can roll back to the previous 0.10.x version before any of the old etcds are destroyed.
Node/everything loses connection to the legacy api servers during the Network phase of the update - this is an unintended yet extremely helpful side-effect of the stack change which effectivly disables the existing apiservers during the update.
Once the new apiservers come up they can see and properly interact with the existing nodepools.
The Control plane stack will clean up all existing etcd servers and resources once everything else has been successful.

The bulk of changes are in using knowlege of existing state when templating assets, such as cloud-configs and stack templates. I wasn't happy putting the state in the config package because these are not settings that a user can select, so I ended up working things out at the cluster package level and then looking for ways to included my extra state information into the templating contexts. I think that perhaps some more thought and re-factoring could be applied to better model the role of config and existing state when bringing up a cluster.

…rnetes.Networking.SelfHosting is Enabled. This is to break the dependency that exists on the nodepool stacks on etcd stack resources.

…rnetes.Networking.SelfHosting is Enabled. (kubernetes-retired#1367) This is to break the dependency that exists on the nodepool stacks on etcd stack resources. Ref kubernetes-retired#1370

…py state from existing etcd over to the new ones during a migration.

… render routine.

…-aws into 0.11.x/remove-etcd-dependency-on-nodepools-when-selfhosted-networking-enabled

davidmccormick · 2018-06-29T16:37:52Z

core/controlplane/config/templates/stack-template.json

@@ -158,6 +158,57 @@
        }
      }
    },
+    "SecurityGroupWorker": {


This security group needed returning to the control plane. We can remove it again in later releases but without it in the updated control plane stack will throw an error about it being in use by the nodepools

Makes sense. Fine with reviving this then.

// Wish we had something like reference counting to keep AWS resources only while they're used 😆

davidmccormick · 2018-06-29T16:41:03Z

core/etcd/config/stack_config.go

+	controlplaneconfig.StackTemplateOptions
+	UserDataEtcd      model.UserData
+	ExtraCfnResources map[string]interface{}
+	model.EtcdExistingState


This and func NewStackConfig are the only changes from the controlplane version.

davidmccormick · 2018-06-29T16:41:30Z

core/etcd/config/templates/cloud-config-etcd

@@ -207,6 +207,63 @@ coreos:
        [Install]
        WantedBy=multi-user.target
 {{end}}
+    {{ if .EtcdMigrationEnabled -}}


The migration units, path unit to trigger the import

davidmccormick · 2018-06-29T16:46:14Z

core/root/cluster.go

@@ -504,17 +504,18 @@ func (c clusterImpl) LegacyUpdate(targets OperationTargets) (string, error) {

 func (c clusterImpl) update(cfSvc *cloudformation.CloudFormation, targets OperationTargets) (string, error) {

+	// Look at existing state of cloud formation and stacks to determine if we need to take special measures in migrating our etcd
+	// clusters from the control plane stack to their own Etcd stack.
 	exists, err := cfnstack.NestedStackExists(cfSvc, c.controlPlane.ClusterName, naming.FromStackToCfnResource(c.etcd.Etcd.LogicalName()))
 	if err != nil {
 		logger.Errorf("please check your AWS credentials/permissions")
 		return "", fmt.Errorf("can't lookup AWS CloudFormation stacks: %s", err)
 	}


Only fail-fast if we don't have SelfHosting enabled - otherwise take this as our cue to migrate instead!

… etcdconfig which depends on controlplane config 2) Allow mocks to return nil response and not crash lookupExistingEtcdEndpoints

davidmccormick · 2018-07-03T10:18:37Z

Fixes issue #1112

mumoshu · 2018-07-07T02:03:29Z

core/controlplane/config/templates/stack-template.json

+      "Description" : "The security group assigned to worker nodes",
+      "Value" :  { "Ref" : "SecurityGroupWorker" },
+      "Export" : { "Name" : {"Fn::Sub": "${AWS::StackName}-WorkerSecurityGroup" }}
+    },


This corresponds to https://github.com/kubernetes-incubator/kube-aws/pull/1380/files#r199216521

mumoshu

This took some time for me to get but - you did an excellent work! LGTM.

davidmccormick added 7 commits June 15, 2018 12:34

Remove the etcd-environment metadata section on the nodepools if Kube…

58fa7a5

…rnetes.Networking.SelfHosting is Enabled. This is to break the dependency that exists on the nodepool stacks on etcd stack resources.

disable fail fast

e16f916

Remove the etcd-environment metadata section on the nodepools if Kube…

215d240

…rnetes.Networking.SelfHosting is Enabled. (kubernetes-retired#1367) This is to break the dependency that exists on the nodepool stacks on etcd stack resources. Ref kubernetes-retired#1370

Modified etcdadm with special export and import commands that will co…

0687f89

…py state from existing etcd over to the new ones during a migration.

creating systemd migration units

42f6029

save existing state into etcd's stack config and pass to cloud-config…

d8bc8ac

… render routine.

Merge branch 'master' of https://github.com/kubernetes-incubator/kube…

3eb0797

…-aws into 0.11.x/remove-etcd-dependency-on-nodepools-when-selfhosted-networking-enabled

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 29, 2018

davidmccormick changed the title ~~Enable migration from existing 0.10.x clusters without losing all the cluster state~~ Enable migration from existing 0.10.x clusters without losing cluster state Jun 29, 2018

davidmccormick commented Jun 29, 2018

View reviewed changes

Resolve tests, 1) package cycle with control plane tests depending on…

8ae9190

… etcdconfig which depends on controlplane config 2) Allow mocks to return nil response and not crash lookupExistingEtcdEndpoints

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 2, 2018

davidmccormick changed the title ~~Enable migration from existing 0.10.x clusters without losing cluster state~~ 0.11.x migration from existing clusters without losing state Jul 2, 2018

mumoshu reviewed Jul 7, 2018

View reviewed changes

mumoshu approved these changes Jul 7, 2018

View reviewed changes

mumoshu merged commit cdceab6 into kubernetes-retired:master Jul 7, 2018

mumoshu added this to the v0.11.0 milestone Jul 7, 2018

davidmccormick deleted the 0.11.x/remove-etcd-dependency-on-nodepools-when-selfhosted-networking-enabled branch January 2, 2019 11:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.11.x migration from existing clusters without losing state #1380

0.11.x migration from existing clusters without losing state #1380

davidmccormick commented Jun 29, 2018

davidmccormick Jun 29, 2018 •

edited

Loading

mumoshu Jul 7, 2018

davidmccormick Jun 29, 2018

davidmccormick Jun 29, 2018

davidmccormick Jun 29, 2018

davidmccormick commented Jul 3, 2018

mumoshu Jul 7, 2018

mumoshu left a comment

0.11.x migration from existing clusters without losing state #1380

0.11.x migration from existing clusters without losing state #1380

Conversation

davidmccormick commented Jun 29, 2018

davidmccormick Jun 29, 2018 • edited Loading

Choose a reason for hiding this comment

mumoshu Jul 7, 2018

Choose a reason for hiding this comment

davidmccormick Jun 29, 2018

Choose a reason for hiding this comment

davidmccormick Jun 29, 2018

Choose a reason for hiding this comment

davidmccormick Jun 29, 2018

Choose a reason for hiding this comment

davidmccormick commented Jul 3, 2018

mumoshu Jul 7, 2018

Choose a reason for hiding this comment

mumoshu left a comment

Choose a reason for hiding this comment

davidmccormick Jun 29, 2018 •

edited

Loading