-
Notifications
You must be signed in to change notification settings - Fork 45
Administering our production environment
- Install Kubernetes: https://kubernetes.io/docs/tasks/tools/install-kubectl/
- Install GCloud: https://cloud.google.com/sdk/docs/downloads-interactive
- Run
gcloud init
to authenticate with Google Cloud - Run
gcloud container clusters get-credentials workbench --zone us-central1-b --project cj-workbench
to makekubectl
work - Install
jq
and make sure it's in your$PATH
: https://stedolan.github.io/jq/
To test that everything is well-wired: kubectl -n production logs -lapp=frontend-app
should show you the latest log messages.
- Code, test, integration-test, commit and push the feature to master.
- Wait for tests to pass -- https://github.com/CJWorkbench/cjworkbench/commits shows latest successes and failures with dots, checkmarks and Xs.
- Run
deploy/update-staging-to-latest-passing-master
to deploy the most recent passing commit to master. - Test the feature on staging: https://staging.workbenchdata.com.
- Run
deploy/update-production-to-staging
to make production match staging.
In case of disaster: deploy/advanced-deploy production [SHA1]
to revert to a previous version. But we don't revert database migrations, so anticipate chaos if you revert to a SHA1 before a migration that breaks old code.
kubectl -n production get pods
or kubectl -n staging get pods
Use the provided script:
deploy/restart-because-you-know-what-you-are-doing ENV SERVICE
where ENV
is staging
or production
and service
is cron
,fetcher
,frontend
or renderer
To do this manually, from the Google Cloud project console, navigate to the pod (not deployment) that is having problems, and delete the pod (not the deployment)
To reboot the Wordpress database, gcloud compute ssh [wordpress]; systemctl restart apache2; systemctl restart mysql
Go to the deployment page on Google Cloud Console and click "container logs"
Or to view at the console, kubectl -n production logs -f [container id]
To view many containers that run in the same pod, e.g. all web servers, kubectl -n production logs -f -lapp=frontend-app
Environment variables are set per-pod in individual yaml files, e.g. frontend-deployment.yaml
To view current values, you can do kubectl -n production exec -it frontend-deployment-644866b6d4-fcmh7 env
for a particular pod.
Many environment variables are secrets, which can be set through kubectl like this: kubectl edit secret cjw-intercom-secret --namespace production
or through Google Cloud console
- We use three namespaces:
kube-system
(no choice on this one),production
andstaging
. We do not use thedefault
namespace. - Currently, CJWorkbench depends on a shared filesystem. With much pain, we have deployed NFS to each namespace. Patches welcome: we plan to move
importedmodules
to the database andsaveddata
(a.k.a.media
) to Minio (an S3-compatible server that saves to object storage in production and filesystem in development). - Each namespace has these services:
frontend
(the website),backend
(cron, essentially),database
(Postgresql, for most data) andredis
(which powers letsbackend
send messages to users via anyfrontend
, over Websockets). We also runmigrate
each deploy. - Images are stored publicly at
gcr.io/cj-workbench/{frontend|backend|migrate}:{sha1}
. - We only publish images after they pass integration tests. We only deploy images that have been published.
- "deploy" means:
- Run
migrate
- Roll out new
backend
by killing old one and starting new one - Roll out new
frontend
by starting new one, connecting nginx to it, disconnecting nginx from the old one, and killing the old one.
- Run
- There's a race:
migrate
runs while old versions offrontend
andbackend
are reading and writing the database. Approaches for handling the race:- When deleting columns/tables, try a two-phase deploy:
- Commit and deploy code without a migration that will work both before and after the migration is applied. For instance, if the migration deletes a table, deploy code that ignores the table.
- Commit and deploy the migration.
- When adding columns with
NOT NULL
, make sure they're optional for a while:- Commit and deploy a migration and code that allow
NULL
in the column. The old code can ignore the migration; the new code won't write NULLs. - Commit and deploy a migration that rewrites
NULL
in the column. The code from the previous step won't misbehave.
- Commit and deploy a migration and code that allow
- Alternatively, test very carefully and plan for the downtime. (It may only last a few seconds.)
- When deleting columns/tables, try a two-phase deploy:
RabbitMQ runs on a high-availability cluster, with three nodes. Soon after we deployed on staging (but not production), one of these nodes became "stuck" (deadlocked) on 2019-01-25.
"Stuck" means:
- One RabbitMQ node's heartbeat checks continued to succeed.
- It accepted TCP connections, but it did not complete auth attempts.
- It did not log any new messages. (As of 2019-02-06, the last log message was from 2019-01-25. It was a long week.)
-
kubectl -n staging exec -it rabbitmq-1-rabbitmq-0 -- rabbitmq-diagnostics maybe_stuck
revealed thousands of stuck processes. - Workbench did not try to reconnect, because the TCP connection never closed.
- Deleting the pod (
kubectl -n staging delete pod rabbitmq-1-rabbitmq-0
) caused it to restart and solved the issue. (Workbench reconnected correctly.)
Should this appear to happen again, diagnose it with the maybe_stuck
command above on any of the three rabbitmq nodes: rabbitmq-1-rabbitmq-0
, rabbitmq-1-rabbitmq-1
, rabbitmq-1-rabbitmq-2
, in the environment in question (production
or staging
); only after confirming the pods are indeed stuck should you delete the pod with the next kubectl
command.