-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
callysto: maintenance plan for Wed, Mar 22, 0-4am PDT #2388
Comments
Maintenance notes
Upgrading the node pools beyond the k8s api-server wasn't an option like it was in EKS. Looking at these docs from GCP and these from k8s, I conclude that we should always have k8s control plane ahead or equal to the nodes, and let the nodes fall behind at most 2 minor versions. Previously I undertood that it was okay if the nodes were two minor versions ahead of the control plane as well, but that was probably a mixup of mine with a version skew policy for
Upgrading the master must be done one step at the time.
Upgrading of a regional GKE cluster (three separate k8s api-servers, done in a rolling update) took ~25 minutes. This seems to be reliably take 25-26 minutes.
Upgrade k8s cluster first, then node pools separately If terraform changes node pools and master k8s api version, node pools are destroyed, k8s upgraded, and then node pools are added. Due to this, its better to do a k8s version bump first separately as otherwise there is a large downtime when nodes aren't available. Avoid multiple core nodes If we avoid needing two nodes in the core node pool by doing steps detailed hered, with policy planned here, we can reduce the time to upgrade a k8s version. Upgrading a node pool from one k8s version to another takes ~8 min for the core node pool with two nodes, and causes disruption when pods relocate after the new node has been added - surge upgrade. Replacing a node pool by also changing its machine type for example takes ~4m20s to delete it and ~1m25s to create it, with a disruption of ~5 minutes. |
Maintenance goals
Maintenance steps
n2-highmem-2
machines (from n1-highmem-4)n2-highmem-4
machines (from n1-highmem-4)Bonus:
Related
n2-highmem-2
andr5.large
machines #2212Resolved by #2402
The text was updated successfully, but these errors were encountered: