Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

callysto: maintenance plan for Wed, Mar 22, 0-4am PDT #2388

Closed
6 tasks done
consideRatio opened this issue Mar 20, 2023 · 1 comment · Fixed by #2402
Closed
6 tasks done

callysto: maintenance plan for Wed, Mar 22, 0-4am PDT #2388

consideRatio opened this issue Mar 20, 2023 · 1 comment · Fixed by #2402
Assignees

Comments

@consideRatio
Copy link
Contributor

consideRatio commented Mar 20, 2023

Maintenance goals

  • Callysto wants to reduce uneccessary running costs
  • I saw an opportunity to:
    • upgrade k8s
    • help ensure stability by using a bit more memory on proposed core node and current user node type
    • reduce costs further using a trick (bonus)

Maintenance steps

  • As requested, make a backup of the NFS storage via the cloud console
  • Update core node pool to use n2-highmem-2 machines (from n1-highmem-4)
  • Update user node pool to use n2-highmem-4 machines (from n1-highmem-4)
  • Upgrade k8s from 1.22 to 1.25
  • As requested, after a successful upgrade, where acecss to user storage is verified, remove the backup of the NFS storage via the cloud console

Bonus:

Related

Resolved by #2402

@consideRatio
Copy link
Contributor Author

consideRatio commented Mar 22, 2023

Maintenance notes

  • Its important to remember terraform init and terraform workspace select
  • Its good to start by checking if terraform plan without changes doesn't lead to changes to start
  • Its relevant to know the minor versions to increment over, which can be done using terraform output regular_channel_latest_k8s_versions after terraform plan has been run

Upgrading the node pools beyond the k8s api-server wasn't an option like it was in EKS.

Looking at these docs from GCP and these from k8s, I conclude that we should always have k8s control plane ahead or equal to the nodes, and let the nodes fall behind at most 2 minor versions. Previously I undertood that it was okay if the nodes were two minor versions ahead of the control plane as well, but that was probably a mixup of mine with a version skew policy for kubectl the CLI in relation to the control plane.

google_container_node_pool.core: Destruction complete after 4m44s
google_container_node_pool.core: Creating...
╷
│ Error: error creating NodePool: googleapi: Error 400: Node version "1.25.6-gke.1000" must not have a greater minor version than master version "1.23.14-gke.1800"., badRequest

Upgrading the master must be done one step at the time.

google_container_cluster.cluster: Modifying... [id=projects/callysto-202316/locations/northamerica-northeast1/clusters/callysto-cluster]
╷
│ Error: googleapi: Error 400: Master cannot be upgraded to "1.25.6-gke.1000": cannot upgrade the master more than a minor version at a time., badRequest

Upgrading of a regional GKE cluster (three separate k8s api-servers, done in a rolling update) took ~25 minutes. This seems to be reliably take 25-26 minutes.

google_container_cluster.cluster: Modifications complete after 25m2s [id=projects/callysto-202316/locations/northamerica-northeast1/clusters/callysto-cluster]
google_container_cluster.cluster: Modifications complete after 24m54s [id=projects/callysto-202316/locations/northamerica-northeast1/clusters/callysto-cluster]

Upgrade k8s cluster first, then node pools separately

If terraform changes node pools and master k8s api version, node pools are destroyed, k8s upgraded, and then node pools are added. Due to this, its better to do a k8s version bump first separately as otherwise there is a large downtime when nodes aren't available.


Avoid multiple core nodes

If we avoid needing two nodes in the core node pool by doing steps detailed hered, with policy planned here, we can reduce the time to upgrade a k8s version.


Upgrading a node pool from one k8s version to another takes ~8 min for the core node pool with two nodes, and causes disruption when pods relocate after the new node has been added - surge upgrade.

Replacing a node pool by also changing its machine type for example takes ~4m20s to delete it and ~1m25s to create it, with a disruption of ~5 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant