callysto: maintenance plan for Wed, Mar 22, 0-4am PDT #2388

consideRatio · 2023-03-20T21:18:33Z

Maintenance goals

Callysto wants to reduce uneccessary running costs
I saw an opportunity to:
- upgrade k8s
- help ensure stability by using a bit more memory on proposed core node and current user node type
- reduce costs further using a trick (bonus)

Maintenance steps

As requested, make a backup of the NFS storage via the cloud console
Update core node pool to use n2-highmem-2 machines (from n1-highmem-4)
Update user node pool to use n2-highmem-4 machines (from n1-highmem-4)
Upgrade k8s from 1.22 to 1.25
As requested, after a successful upgrade, where acecss to user storage is verified, remove the backup of the NFS storage via the cloud console

Bonus:

Manually adjust autoscaling configuration of the CNI so we avoid needing two core nodes unless ten or more (instead of two or more) nodes are running, based on GKE clusters get two core nodes without being CPU / Memory constrained #2199 (comment).

Resolved by #2402

The text was updated successfully, but these errors were encountered:

consideRatio · 2023-03-22T09:05:13Z

Maintenance notes

Its important to remember terraform init and terraform workspace select
Its good to start by checking if terraform plan without changes doesn't lead to changes to start
Its relevant to know the minor versions to increment over, which can be done using terraform output regular_channel_latest_k8s_versions after terraform plan has been run

Upgrading the node pools beyond the k8s api-server wasn't an option like it was in EKS.

Looking at these docs from GCP and these from k8s, I conclude that we should always have k8s control plane ahead or equal to the nodes, and let the nodes fall behind at most 2 minor versions. Previously I undertood that it was okay if the nodes were two minor versions ahead of the control plane as well, but that was probably a mixup of mine with a version skew policy for kubectl the CLI in relation to the control plane.

google_container_node_pool.core: Destruction complete after 4m44s
google_container_node_pool.core: Creating...
╷
│ Error: error creating NodePool: googleapi: Error 400: Node version "1.25.6-gke.1000" must not have a greater minor version than master version "1.23.14-gke.1800"., badRequest

Upgrading the master must be done one step at the time.

google_container_cluster.cluster: Modifying... [id=projects/callysto-202316/locations/northamerica-northeast1/clusters/callysto-cluster]
╷
│ Error: googleapi: Error 400: Master cannot be upgraded to "1.25.6-gke.1000": cannot upgrade the master more than a minor version at a time., badRequest

Upgrading of a regional GKE cluster (three separate k8s api-servers, done in a rolling update) took ~25 minutes. This seems to be reliably take 25-26 minutes.

google_container_cluster.cluster: Modifications complete after 25m2s [id=projects/callysto-202316/locations/northamerica-northeast1/clusters/callysto-cluster]
google_container_cluster.cluster: Modifications complete after 24m54s [id=projects/callysto-202316/locations/northamerica-northeast1/clusters/callysto-cluster]

Upgrade k8s cluster first, then node pools separately

If terraform changes node pools and master k8s api version, node pools are destroyed, k8s upgraded, and then node pools are added. Due to this, its better to do a k8s version bump first separately as otherwise there is a large downtime when nodes aren't available.

Avoid multiple core nodes

If we avoid needing two nodes in the core node pool by doing steps detailed hered, with policy planned here, we can reduce the time to upgrade a k8s version.

Upgrading a node pool from one k8s version to another takes ~8 min for the core node pool with two nodes, and causes disruption when pods relocate after the new node has been added - surge upgrade.

Replacing a node pool by also changing its machine type for example takes ~4m20s to delete it and ~1m25s to create it, with a disruption of ~5 minutes.

github-project-automation bot added this to DEPRECATED Engineering and Product Backlog Mar 20, 2023

github-project-automation bot moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Mar 20, 2023

This was referenced Mar 22, 2023

Migrate existing GCP GKE based hubs from k8s 1.22+ to 1.24+ #2157

Closed

terraform, callysto: update core and notebook node types, and k8s version #2402

Merged

consideRatio closed this as completed in #2402 Mar 22, 2023

github-project-automation bot moved this from Needs Shaping / Refinement to Complete in DEPRECATED Engineering and Product Backlog Mar 22, 2023

damianavila added this to Sprint Board Mar 29, 2023

damianavila moved this to Done 🎉 in Sprint Board Mar 29, 2023

damianavila assigned consideRatio Mar 29, 2023

consideRatio mentioned this issue Apr 12, 2023

Callysto core nodepool rightsizing #2005

Closed

consideRatio mentioned this issue Oct 10, 2023

Document a k8s upgrade procedure for GKE #3250

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

callysto: maintenance plan for Wed, Mar 22, 0-4am PDT #2388

callysto: maintenance plan for Wed, Mar 22, 0-4am PDT #2388

consideRatio commented Mar 20, 2023 •

edited

Loading

consideRatio commented Mar 22, 2023 •

edited

Loading

callysto: maintenance plan for Wed, Mar 22, 0-4am PDT #2388

callysto: maintenance plan for Wed, Mar 22, 0-4am PDT #2388

Comments

consideRatio commented Mar 20, 2023 • edited Loading

Maintenance goals

Maintenance steps

Related

Resolved by #2402

consideRatio commented Mar 22, 2023 • edited Loading

Maintenance notes

consideRatio commented Mar 20, 2023 •

edited

Loading

consideRatio commented Mar 22, 2023 •

edited

Loading