-
Notifications
You must be signed in to change notification settings - Fork 45
Administering our production environment
- Install Kubernetes: https://kubernetes.io/docs/tasks/tools/install-kubectl/
- Install GCloud: https://cloud.google.com/sdk/docs/downloads-interactive
- Install
jq
and make sure it's in your$PATH
: https://stedolan.github.io/jq/ - Run
gcloud init
to authenticate with Google Cloud - Run
gcloud container clusters get-credentials workbench --zone us-central1-b --project workbenchdata-production
to makekubectl
work
To test that everything is well-wired: kubectl logs -lapp=frontend-app
should show you the latest log messages.
- Code, test, integration-test, commit and push the feature to master.
- Wait for tests to pass and for auto-deploy to staging -- https://github.com/CJWorkbench/cjworkbench/commits shows latest successes and failures with dots, checkmarks and Xs.
- Test the feature on staging: https://app.workbenchdata-staging.com
- Run
deploy/update-production-to-staging
to make production match staging.
In case of disaster: deploy/advanced-deploy production [SHA1]
to revert to a previous version. But we don't revert database migrations, so anticipate chaos if you revert to a SHA1 before a migration that breaks old code.
kubectl get pods
or kubectl get pods
Use the provided script:
deploy/restart-because-you-know-what-you-are-doing ENV SERVICE
where ENV
is staging
or production
and service
is cron
,fetcher
,frontend
or renderer
To do this manually, from the Google Cloud project console, navigate to the pod (not deployment) that is having problems, and delete the pod (not the deployment).
To reboot the Wordpress database, gcloud compute ssh [wordpress]; systemctl restart apache2; systemctl restart mysql
Run deploy/clear-render-cache
to clear the render cache.
This will force a re-execute on every workflow. That can get ... expensive. Users will notice a slowdown for a few minutes.
Or to view at the console, kubectl logs -f [pod id] [container name]
To view many containers that run in the same pod, e.g. all web servers, kubectl logs -f -lapp=frontend-app frontend
kubectl exec -it frontend-deployment-[tab-complete] -- python ./manage.py dbshell
Environment variables are set per-pod in individual yaml files, e.g. frontend-deployment.yaml
To view current values, you can do kubectl -n production exec -it frontend-deployment-644866b6d4-fcmh7 env
for a particular pod.
Many environment variables are secrets, which can be set through kubectl like this: kubectl edit secret cjw-intercom-secret --namespace production
or through Google Cloud console
- Each namespace has these services:
frontend
(the website),cron
,fetcher
,renderer
,database
(Postgresql, for most data),rabbitmq
(which powers Websockets and fetch+render queues), andminio
(which stores all files -- on Google Cloud Storage). - Images are stored publicly at
gcr.io/workbenchdata-ci
and tagged by Git commit sha1. - We only publish images after they pass integration tests. We only deploy images that have been published. We auto-deploy to staging.
- "deploy" means:
- Run
migrate
- Rolling-deploy
frontend
,fetcher
andrenderer
. Kill-deploycron
(because it's a singleton). - Wait for kubernetes rolling deploys to finish.
- Run
- There's a race:
migrate
runs while old versions offrontend
,cron
,renderer
andfetcher
are reading and writing the database. Approaches for handling the race:- When deleting columns/tables, try a two-phase deploy:
- Commit and deploy code without a migration that will work both before and after the migration is applied. For instance, if the migration deletes a table, deploy code that ignores the table.
- Commit and deploy the migration.
- When adding columns with
NOT NULL
, make sure they're optional for a while:- Commit and deploy a migration and code that allow
NULL
in the column. The old code can ignore the migration; the new code won't write NULLs. - Commit and deploy a migration that rewrites
NULL
in the column. The code from the previous step won't misbehave.
- Commit and deploy a migration and code that allow
- Alternatively, test very carefully and plan for the downtime. (It may only last a few seconds.)
- When deleting columns/tables, try a two-phase deploy:
RabbitMQ runs on a high-availability cluster, with three nodes. Soon after we deployed on staging (but not production), one of these nodes became "stuck" (deadlocked) on 2019-01-25.
"Stuck" means:
- One RabbitMQ node's heartbeat checks continued to succeed.
- It accepted TCP connections, but it did not complete auth attempts.
- It did not log any new messages. (As of 2019-02-06, the last log message was from 2019-01-25. It was a long week.)
-
kubectl -n staging exec -it rabbitmq-1-rabbitmq-0 -- rabbitmq-diagnostics maybe_stuck
revealed thousands of stuck processes. - Workbench did not try to reconnect, because the TCP connection never closed.
- Deleting the pod (
kubectl -n staging delete pod rabbitmq-1-rabbitmq-0
) caused it to restart and solved the issue. (Workbench reconnected correctly.)
Should this appear to happen again, diagnose it with the maybe_stuck
command above on any of the three rabbitmq nodes: rabbitmq-1-rabbitmq-0
, rabbitmq-1-rabbitmq-1
, rabbitmq-1-rabbitmq-2
, in the environment in question (production
or staging
); only after confirming the pods are indeed stuck should you delete the pod with the next kubectl
command.
- Sign in at https://dashboard.stripe.com
- Create an account, named after the company
- Create a Product (Premium Plan), and a monthly Price.
- Go to Settings -> Customer Portal. Allow customers to view their billing history; Allow customers to update their billing address; Allow customers to update their payment methods; Allow customers to cancel subscriptions -> Cancel Immediately -> Prorate canceled subscriptions; set Headline and set links to https://workbenchdata.com/terms-of-service and https://workbenchdata.com/privacy; set default redirect link to https://app.workbenchdata.com/settings/billing (well, https://app.workbenchdata-staging.com/settings/billing on staging)
- Go to Settings -> Branding. Adjust.
- Go to Settings -> Invoice Template. Adjust.
- Go to Settings -> Subscriptions and Emails. Send emails about expiring cards; Use Smart Retries; Send emails when card payment fails; Send a Stripe-hosted link for cardholders to authenticate when required; Send reminders after 3, 5, 7 days; don't send invoices to customers; click lots of Save buttons
- Go to Settings -> Emails. Add "workbenchdata.com" and verify it. Email customers about Successful payments and Refunds.
- Copy everything to production.
- Configure staging and production secrets in Kubernetes deployments and redeploy. (Staging secrets are Stripe's "test mode"; Production secrets are its non-test mode.)
- On https://dashboard.stripe.com/test/webhooks (or non-"test" in production), add the endpoint
https://app.workbenchdata.com/stripe/webhook
. Make the description point to this wiki page. For "Events to send", see the docstrings incjworkbench/views/stripe.py
. - Look up the signing secret of the webhook. Let's call it
$STRIPE_WEBHOOK_SIGNING_SECRET
- At https://dashboard.stripe.com/test/apikeys (or non-"test" in production), copy/paste
$STRIPE_PUBLIC_API_KEY
and$STRIPE_API_KEY
(the "Publishable key" and "Secret key", respectively) kubectl --context="$KUBECTL_CONTEXT" create secret generic cjw-stripe-secret --from-literal=STRIPE_PUBLIC_API_KEY="$STRIPE_PUBLIC_API_KEY" --from-literal=STRIPE_API_KEY="$STRIPE_API_KEY" --from-literal=STRIPE_WEBHOOK_SIGNING_SECRET="$STRIPE_WEBHOOK_SIGNING_SECRET"
- Restart the
frontend
pods with the new environment variables (derived from the secret)
- On https://dashboard.stripe.com/test/webhooks (or non-"test" in production), add the endpoint
- Synchronize:
kubectl exec -it SOME_FRONTEND frontend python ./manage.py import-plans-from-stripe