Skip to content

Administering our production environment

Jonathan Stray edited this page Jul 3, 2019 · 26 revisions

Install tools

  1. Install Kubernetes: https://kubernetes.io/docs/tasks/tools/install-kubectl/
  2. Install GCloud: https://cloud.google.com/sdk/docs/downloads-interactive
  3. Run gcloud init to authenticate with Google Cloud
  4. Run gcloud container clusters get-credentials workbench --zone us-central1-b --project cj-workbench to make kubectl work
  5. Install jq and make sure it's in your $PATH: https://stedolan.github.io/jq/

To test that everything is well-wired: kubectl -n production logs -lapp=frontend-app should show you the latest log messages.

Deploy a new feature

  1. Code, test, integration-test, commit and push the feature to master.
  2. Wait for tests to pass -- https://github.com/CJWorkbench/cjworkbench/commits shows latest successes and failures with dots, checkmarks and Xs.
  3. Run deploy/update-staging-to-latest-passing-master to deploy the most recent passing commit to master.
  4. Test the feature on staging: https://staging.workbenchdata.com.
  5. Run deploy/update-production-to-staging to make production match staging.

Reverting a deployment

In case of disaster: deploy/advanced-deploy production [SHA1] to revert to a previous version. But we don't revert database migrations, so anticipate chaos if you revert to a SHA1 before a migration that breaks old code.

List running pods

kubectl -n production get pods or kubectl -n staging get pods

Reboot a server

Use the provided script:

deploy/restart-because-you-know-what-you-are-doing ENV SERVICE

where ENV is staging or production and service is cron,fetcher,frontend or renderer

To do this manually, from the Google Cloud project console, navigate to the pod (not deployment) that is having problems, and delete the pod (not the deployment)

To reboot the Wordpress database, gcloud compute ssh [wordpress]; systemctl restart apache2; systemctl restart mysql

Viewing logs

Go to the deployment page on Google Cloud Console and click "container logs"

Or to view at the console, kubectl -n production logs -f [container id]

To view many containers that run in the same pod, e.g. all web servers, kubectl -n production logs -f -lapp=frontend-app

Changing environment variables

Environment variables are set per-pod in individual yaml files, e.g. frontend-deployment.yaml

To view current values, you can do kubectl -n production exec -it frontend-deployment-644866b6d4-fcmh7 env for a particular pod.

Many environment variables are secrets, which can be set through kubectl like this: kubectl edit secret cjw-intercom-secret --namespace production or through Google Cloud console

Architecture Notes

  • We use three namespaces: kube-system (no choice on this one), production and staging. We do not use the default namespace.
  • Currently, CJWorkbench depends on a shared filesystem. With much pain, we have deployed NFS to each namespace. Patches welcome: we plan to move importedmodules to the database and saveddata (a.k.a. media) to Minio (an S3-compatible server that saves to object storage in production and filesystem in development).
  • Each namespace has these services: frontend (the website), backend (cron, essentially), database (Postgresql, for most data) and redis (which powers lets backend send messages to users via any frontend, over Websockets). We also run migrate each deploy.
  • Images are stored publicly at gcr.io/cj-workbench/{frontend|backend|migrate}:{sha1}.
  • We only publish images after they pass integration tests. We only deploy images that have been published.
  • "deploy" means:
    1. Run migrate
    2. Roll out new backend by killing old one and starting new one
    3. Roll out new frontend by starting new one, connecting nginx to it, disconnecting nginx from the old one, and killing the old one.
  • There's a race: migrate runs while old versions of frontend and backend are reading and writing the database. Approaches for handling the race:
    1. When deleting columns/tables, try a two-phase deploy:
      1. Commit and deploy code without a migration that will work both before and after the migration is applied. For instance, if the migration deletes a table, deploy code that ignores the table.
      2. Commit and deploy the migration.
    2. When adding columns with NOT NULL, make sure they're optional for a while:
      1. Commit and deploy a migration and code that allow NULL in the column. The old code can ignore the migration; the new code won't write NULLs.
      2. Commit and deploy a migration that rewrites NULL in the column. The code from the previous step won't misbehave.
    3. Alternatively, test very carefully and plan for the downtime. (It may only last a few seconds.)

RabbitMQ stuck (seen 2019-01-25)

RabbitMQ runs on a high-availability cluster, with three nodes. Soon after we deployed on staging (but not production), one of these nodes became "stuck" (deadlocked) on 2019-01-25.

"Stuck" means:

  • One RabbitMQ node's heartbeat checks continued to succeed.
  • It accepted TCP connections, but it did not complete auth attempts.
  • It did not log any new messages. (As of 2019-02-06, the last log message was from 2019-01-25. It was a long week.)
  • kubectl -n staging exec -it rabbitmq-1-rabbitmq-0 -- rabbitmq-diagnostics maybe_stuck revealed thousands of stuck processes.
  • Workbench did not try to reconnect, because the TCP connection never closed.
  • Deleting the pod (kubectl -n staging delete pod rabbitmq-1-rabbitmq-0) caused it to restart and solved the issue. (Workbench reconnected correctly.)

Should this appear to happen again, diagnose it with the maybe_stuck command above on any of the three rabbitmq nodes: rabbitmq-1-rabbitmq-0, rabbitmq-1-rabbitmq-1, rabbitmq-1-rabbitmq-2, in the environment in question (production or staging); only after confirming the pods are indeed stuck should you delete the pod with the next kubectl command.