-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automate ingest and phylogenetic workflows #52
Changes from all commits
d44f2ae
2c415e7
eb5e76d
65a8acc
77ca1d4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,139 @@ | ||
name: Ingest to phylogenetic | ||
|
||
defaults: | ||
run: | ||
# This is the same as GitHub Action's `bash` keyword as of 20 June 2023: | ||
# https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsshell | ||
# | ||
# Completely spelling it out here so that GitHub can't change it out from under us | ||
# and we don't have to refer to the docs to know the expected behavior. | ||
shell: bash --noprofile --norc -eo pipefail {0} | ||
|
||
on: | ||
schedule: | ||
# Note times are in UTC, which is 1 or 2 hours behind CET depending on daylight savings. | ||
# | ||
# Note the actual runs might be late. | ||
# Numerous people were confused, about that, including me: | ||
# - https://github.community/t/scheduled-action-running-consistently-late/138025/11 | ||
# - https://github.com/github/docs/issues/3059 | ||
# | ||
# Note, '*' is a special character in YAML, so you have to quote this string. | ||
# | ||
# Docs: | ||
# - https://docs.github.com/en/actions/learn-github-actions/events-that-trigger-workflows#schedule | ||
# | ||
# Tool that deciphers this particular format of crontab string: | ||
# - https://crontab.guru/ | ||
# | ||
# Runs at 4pm UTC (12pm EDT) since curation by NCBI happens on the East Coast. | ||
- cron: '0 16 * * *' | ||
|
||
workflow_dispatch: | ||
inputs: | ||
ingest_image: | ||
description: 'Specific container image to use for ingest workflow (will override the default of "nextstrain build")' | ||
required: false | ||
phylogenetic_image: | ||
description: 'Specific container image to use for phylogenetic workflow (will override the default of "nextstrain build")' | ||
required: false | ||
|
||
jobs: | ||
ingest: | ||
permissions: | ||
id-token: write | ||
uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master | ||
secrets: inherit | ||
with: | ||
# Starting with the default docker runtime | ||
# We can migrate to AWS Batch when/if we need to for more resources or if | ||
# the job runs longer than the GH Action limit of 6 hours. | ||
runtime: docker | ||
env: | | ||
NEXTSTRAIN_DOCKER_IMAGE: ${{ inputs.ingest_image }} | ||
run: | | ||
nextstrain build \ | ||
--env AWS_ACCESS_KEY_ID \ | ||
--env AWS_SECRET_ACCESS_KEY \ | ||
ingest \ | ||
upload_all \ | ||
--configfile build-configs/nextstrain-automation/config.yaml | ||
# Specifying artifact name to differentiate ingest build outputs from | ||
# the phylogenetic build outputs | ||
artifact-name: ingest-build-output | ||
artifact-paths: | | ||
ingest/results/ | ||
ingest/benchmarks/ | ||
ingest/logs/ | ||
ingest/.snakemake/log/ | ||
|
||
# Check if ingest results include new data by checking for the cache | ||
# of the file with the results' Metadata.sh256sum (which should have been added within upload-to-s3) | ||
# GitHub will remove any cache entries that have not been accessed in over 7 days, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why upload the data in the previous step and then check against a cache of the just-overwritten S3 files' SHA256? It seems simpler to compute the SHA256 of the local ingest files before they're uploaded, diff against the current latest S3 files' SHA256s, and if they match then the ingest step succeeds with "no new data" status, and later rules don't run. This'd avoid the 7-day-cache limitation, as well as not add upload (add) unnecessary data to a versioned S3 bucket. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question! This is actually already happening as you've described within Using the cache in GitHub Action workflow is my work-around to do this check for "no new data" status without designing some sort of "status" JSON/YAML file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We discussed this topic a little bit in today's meeting. While |
||
# so if the workflow has not been run over 7 days then it will trigger phylogenetic. | ||
check-new-data: | ||
needs: [ingest] | ||
runs-on: ubuntu-latest | ||
outputs: | ||
cache-hit: ${{ steps.check-cache.outputs.cache-hit }} | ||
steps: | ||
- name: Get sha256sum | ||
id: get-sha256sum | ||
run: | | ||
s3_urls=( | ||
"s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst" | ||
"s3://nextstrain-data/files/workflows/zika/sequences.fasta.zst" | ||
) | ||
|
||
# Code below is modified from ingest/upload-to-s3 | ||
# https://github.com/nextstrain/ingest/blob/c0b4c6bb5e6ccbba86374d2c09b42077768aac23/upload-to-s3#L23-L29 | ||
|
||
no_hash=0000000000000000000000000000000000000000000000000000000000000000 | ||
|
||
for s3_url in "${s3_urls[@]}"; do | ||
s3path="${s3_url#s3://}" | ||
bucket="${s3path%%/*}" | ||
key="${s3path#*/}" | ||
|
||
s3_hash="$(aws s3api head-object --no-sign-request --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")" | ||
echo "${s3_hash}" >> ingest-output-sha256sum | ||
done | ||
|
||
- name: Check cache | ||
id: check-cache | ||
uses: actions/cache@v4 | ||
with: | ||
path: ingest-output-sha256sum | ||
key: ingest-output-sha256sum-${{ hashFiles('ingest-output-sha256sum') }} | ||
lookup-only: true | ||
|
||
phylogenetic: | ||
needs: [check-new-data] | ||
if: ${{ needs.check-new-data.outputs.cache-hit != 'true' }} | ||
permissions: | ||
id-token: write | ||
uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master | ||
secrets: inherit | ||
with: | ||
# Starting with the default docker runtime | ||
# We can migrate to AWS Batch when/if we need to for more resources or if | ||
# the job runs longer than the GH Action limit of 6 hours. | ||
runtime: docker | ||
env: | | ||
NEXTSTRAIN_DOCKER_IMAGE: ${{ inputs.phylogenetic_image }} | ||
run: | | ||
nextstrain build \ | ||
--env AWS_ACCESS_KEY_ID \ | ||
--env AWS_SECRET_ACCESS_KEY \ | ||
phylogenetic \ | ||
deploy_all \ | ||
--configfile build-configs/nextstrain-automation/config.yaml | ||
# Specifying artifact name to differentiate ingest build outputs from | ||
# the phylogenetic build outputs | ||
artifact-name: phylogenetic-build-output | ||
artifact-paths: | | ||
phylogenetic/auspice/ | ||
phylogenetic/results/ | ||
phylogenetic/benchmarks/ | ||
phylogenetic/logs/ | ||
phylogenetic/.snakemake/log/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This relies on the ordering of metadata & sequences to be identical across runs, right? I don't think any of our ingest steps sort the data, and any (default, multi-threaded) call to
nextclade run
will change the ordering. I also wouldn't think the NCBI data is guaranteed to be in the same order across differentdataset
/entrez
runs, but maybe?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! We do not do any explicit sorting, so change in order from upstream data can result in triggering the phylogenetic workflow. However, we've been using this method for ncov-ingest and mpox with
dataset
and it's worked pretty well.Nextclade can potentially change ordering of the aligned.fasta, but not the sequences.fasta. I don't think it will have an influence on metadata, since usually the joining of the metadata and nextclade data outputs based on metadata order. If we are worried about Nextclade order we can use the
--in-order
flag to guarantee output order is the same as input order. Although, this repo currently does not includenextclade run
in the ingest workflow.