Skip to content

Latest commit

 

History

History
142 lines (107 loc) · 5.74 KB

DEBUGGING.md

File metadata and controls

142 lines (107 loc) · 5.74 KB

Debugging!

This file will serve as a collection of commands, queries, or other tricks for debugging our systems in production. The hope is that it will grow organically as we debug various issues, however at some point as it grows it may need to be reorganized. For now though, make each sub header the class of issue the section will help to debug.

Too Many ProcessorJobs Were Queued Per OriginalFile

In the scenario where for some reason too many ProcessorJobs were queued per OriginalFile, you may want to leave one ProcessorJob per OriginalFile so they will either run, or preserve the record of the work that was done. The following queries in concert will leave one ProcessorJob per OriginalFile, and delete the rest. The ProcessorJob with the smallest id will be left for reach OriginalFile.

WARNING: Any time you run a DELETE query, you should first replace DELETE with SELECT * or SELECT COUNT(*) to make sure that you know what you will be deleting and that it makes sense.

DELETE FROM processor_jobs WHERE id NOT IN
(SELECT pj_id FROM
        (SELECT MIN(processorjob_originalfile_associations.processor_job_id) pj_id, original_file_id
         FROM processorjob_originalfile_associations
         GROUP BY original_file_id) AS pjs);
DELETE FROM processorjob_originalfile_associations WHERE processor_job_id NOT IN
(SELECT pj_id FROM
        (SELECT MIN(processorjob_originalfile_associations.processor_job_id) pj_id, original_file_id
         FROM processorjob_originalfile_associations
         GROUP BY original_file_id) AS pjs);

Debugging GitHub Actions workflows

CircleCI had a nice button called "Rerun job with SSH" that made debugging simple, but GitHub hasn't added that feature (yet). I was trying to debug our end-to-end tests timing out, which meant I needed access to the nomad logs, so I looked into alternative ways to SSH into the machine while it was running the tests.

I tried an action called action-tmate, but I think tmate times out after a certain amount of time, so I coundn't stay ssh-ed in long enough to actually observe what was wrong.

What I settled on was adapted from this blogpost:

  1. Fork the refinebio repo (this lets you set your own secrets on it and run tests without bothering other people).

  2. Create an ngrok account. ngrok is a service that lets you tunnel local connections to a public URL hosted by them, including SSH connections.

  3. Add this line to config.yaml in the workflow you are trying to debug:

- name: Start SSH via Ngrok
  run: curl -sL https://gist.githubusercontent.com/retyui/7115bb6acf151351a143ec8f96a7c561/raw/7099b9db76729dc5761da72aa8525f632d8875c9/debug-github-actions.sh | bash
  env:
    NGROK_TOKEN: ${{ secrets.NGROK_TOKEN }}
    USER_PASS: ${{ secrets.USER_PASS }}
  1. Set the USER_PASS and NGROK_TOKEN secrets in your fork. USER_PASS should be the password you want to use when logging in over SSH.

  2. Optionally add this line to the end of the workflow you are trying to debug. It keeps the runner alive for an hour after something fails:

- name: Don't kill instace
  if: ${{ failure() }}
  run: sleep 1h # Prevent to killing instance after failure

Now, when you re-run the action the Start SSH via Ngrok step of the workflow will print out a URL that you can SSH into using the password set in USER_PASS and you can start debugging the failed workflow. Our repo is in the work/refinebio/refinebio folder.

Emptying the queue

If you ever need to empty the queue of processor and downloader jobs, you can run this in the foreman:

from data_refinery_common.models.jobs.processor_job import ProcessorJob
from data_refinery_common.models.jobs.downloader_job import DownloaderJob
from django.utils import timezone
from datetime import datetime

JOB_CREATED_AT_CUTOFF = datetime(2021, 6, 23, tzinfo=timezone.utc)

ProcessorJob.failed_objects.filter(created_at__gt=JOB_CREATED_AT_CUTOFF).exclude(
    pipeline_applied="JANITOR"
).update(no_retry=True)
ProcessorJob.lost_objects.filter(created_at__gt=JOB_CREATED_AT_CUTOFF).exclude(
    pipeline_applied="JANITOR"
).update(no_retry=True)
ProcessorJob.hung_objects.filter(created_at__gt=JOB_CREATED_AT_CUTOFF).exclude(
    pipeline_applied="JANITOR"
).update(no_retry=True)

DownloaderJob.failed_objects.filter(created_at__gt=JOB_CREATED_AT_CUTOFF).update(no_retry=True)
DownloaderJob.lost_objects.filter(created_at__gt=JOB_CREATED_AT_CUTOFF).update(no_retry=True)
DownloaderJob.hung_objects.filter(created_at__gt=JOB_CREATED_AT_CUTOFF).update(no_retry=True)

This sets all lost, hung, and failed processor jobs to not retry.

Summarizing failure reasons

If you want to figure out why processor jobs are failing, you can use this command to summarize the failure reason for recent jobs (you can tweak what "recent" means, here it means the last 10 hours).

from data_refinery_common.models.jobs.processor_job import ProcessorJob
from django.utils import timezone
from datetime import timedelta
from collections import Counter
t = timezone.now() - timedelta(hours=10)

Counter([j.failure_reason for j in ProcessorJob.objects.filter(created_at__gt=t, success=False)])

or for the 5 most common failure reasons:

Counter([j.failure_reason for j in ProcessorJob.objects.filter(created_at__gt=t, success=False)]).most_common(5)

You can also do the same for downloader jobs:

from data_refinery_common.models.jobs.downloader_job import DownloaderJob
from django.utils import timezone
from datetime import timedelta
from collections import Counter
t = timezone.now() - timedelta(hours=10)

Counter([j.failure_reason for j in DownloaderJob.objects.filter(created_at__gt=t, success=False)])