Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Conversation

davidmccormick
Copy link
Contributor

This PR is for the master 0.11.x candidate branch and is intended to allow a smoother migration for existing users with 0.10.x clusters which do not have the new separate etcd stack. When first testing the 0.11.x code I found that the process would always fail and roll-back due to cloud-formation dependencies, but once these were cleaned up and worked around then the new etcd cluster would come up empty - effectively wiping the state of the existing cluster in the process of the upgrade.

TL/DR: Upgrades from legacy etcds by importing a copy of all their keys and then allows them to be destroyed. ALSO you are expected to have updated to the 0.10.x migration release first otherwise the migration will fail because cloud-formation dependencies or your new etcd servers won't be able to connect to the old ones!

The approach for the upgrade is this: -

  1. Examine existing etcd cluster state to determine if we are upgrading from a legacy cluster.
  2. Refuse to update unless Kubernetes.Networking.SelfHosting is enabled.
  3. If migration required, lookup and construct and etcdadm connect string for the existing etcd servers by querying aws api.
  4. Render the etcd cloud-formation with additonal etcd state information in the template context.
  5. During a migration extra export and import systemd units are created on the new etcds.
  6. A new etcd specific StackConfig has been created which contains the additional etcd state information so I can use knowledge of the existing state when rendering the etcd cfn stack json.
  7. The Etcd stack creates an extra /var/run/coreos/etcdadm-environment-migration file with details for connecting to the existing etcd cluster.
    8 etcdadm has been enhanced to provide extra 'cluster-is-healthy', 'member-is-leader', 'migration-export-kube-state' and 'migration-import-kube-state' commands.
  8. The etcd cluster leader will export all of the kubernetes objects under prefix '/registry', the others will write an empty file.
  9. On each of the new etcd servers if there is an export file that is not empty they will import it into the new cluster. The others just pass through with success so that they trigger their cfn-signal but the leader will only trigger cfn-signal if the import is successful. This means if the import fails for some reason the cluster can roll back to the previous 0.10.x version before any of the old etcds are destroyed.
  10. Node/everything loses connection to the legacy api servers during the Network phase of the update - this is an unintended yet extremely helpful side-effect of the stack change which effectivly disables the existing apiservers during the update.
  11. Once the new apiservers come up they can see and properly interact with the existing nodepools.
  12. The Control plane stack will clean up all existing etcd servers and resources once everything else has been successful.

The bulk of changes are in using knowlege of existing state when templating assets, such as cloud-configs and stack templates. I wasn't happy putting the state in the config package because these are not settings that a user can select, so I ended up working things out at the cluster package level and then looking for ways to included my extra state information into the templating contexts. I think that perhaps some more thought and re-factoring could be applied to better model the role of config and existing state when bringing up a cluster.

…rnetes.Networking.SelfHosting is Enabled.

This is to break the dependency that exists on the nodepool stacks on etcd stack resources.
…rnetes.Networking.SelfHosting is Enabled. (kubernetes-retired#1367)

This is to break the dependency that exists on the nodepool stacks on etcd stack resources.

Ref kubernetes-retired#1370
…py state from existing etcd over to the new ones during a migration.
…-aws into 0.11.x/remove-etcd-dependency-on-nodepools-when-selfhosted-networking-enabled
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 29, 2018
@davidmccormick davidmccormick changed the title Enable migration from existing 0.10.x clusters without losing all the cluster state Enable migration from existing 0.10.x clusters without losing cluster state Jun 29, 2018
@@ -158,6 +158,57 @@
}
}
},
"SecurityGroupWorker": {
Copy link
Contributor Author

@davidmccormick davidmccormick Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This security group needed returning to the control plane. We can remove it again in later releases but without it in the updated control plane stack will throw an error about it being in use by the nodepools

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Fine with reviving this then.

// Wish we had something like reference counting to keep AWS resources only while they're used 😆

controlplaneconfig.StackTemplateOptions
UserDataEtcd model.UserData
ExtraCfnResources map[string]interface{}
model.EtcdExistingState
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and func NewStackConfig are the only changes from the controlplane version.

@@ -207,6 +207,63 @@ coreos:
[Install]
WantedBy=multi-user.target
{{end}}
{{ if .EtcdMigrationEnabled -}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration units, path unit to trigger the import

@@ -504,17 +504,18 @@ func (c clusterImpl) LegacyUpdate(targets OperationTargets) (string, error) {

func (c clusterImpl) update(cfSvc *cloudformation.CloudFormation, targets OperationTargets) (string, error) {

// Look at existing state of cloud formation and stacks to determine if we need to take special measures in migrating our etcd
// clusters from the control plane stack to their own Etcd stack.
exists, err := cfnstack.NestedStackExists(cfSvc, c.controlPlane.ClusterName, naming.FromStackToCfnResource(c.etcd.Etcd.LogicalName()))
if err != nil {
logger.Errorf("please check your AWS credentials/permissions")
return "", fmt.Errorf("can't lookup AWS CloudFormation stacks: %s", err)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only fail-fast if we don't have SelfHosting enabled - otherwise take this as our cue to migrate instead!

… etcdconfig which depends on controlplane config 2) Allow mocks to return nil response and not crash lookupExistingEtcdEndpoints
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 2, 2018
@davidmccormick davidmccormick changed the title Enable migration from existing 0.10.x clusters without losing cluster state 0.11.x migration from existing clusters without losing state Jul 2, 2018
@davidmccormick
Copy link
Contributor Author

Fixes issue #1112

"Description" : "The security group assigned to worker nodes",
"Value" : { "Ref" : "SecurityGroupWorker" },
"Export" : { "Name" : {"Fn::Sub": "${AWS::StackName}-WorkerSecurityGroup" }}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@mumoshu mumoshu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This took some time for me to get but - you did an excellent work! LGTM.

@mumoshu mumoshu merged commit cdceab6 into kubernetes-retired:master Jul 7, 2018
@mumoshu mumoshu added this to the v0.11.0 milestone Jul 7, 2018
@davidmccormick davidmccormick deleted the 0.11.x/remove-etcd-dependency-on-nodepools-when-selfhosted-networking-enabled branch January 2, 2019 11:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants