-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automate ingest and phylogenetic workflows #52
Changes from 1 commit
d44f2ae
2c415e7
eb5e76d
65a8acc
77ca1d4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -39,11 +39,49 @@ jobs: | |
ingest/logs/ | ||
ingest/.snakemake/log/ | ||
|
||
# TKTK check if ingest results include new data | ||
# potentially use actions/cache to store Metadata.sha256sum of S3 files | ||
# Check if ingest results include new data by checking for the cache | ||
# of the file with the results' Metadata.sh256sum (which should have been added within upload-to-s3) | ||
# GitHub will remove any cache entries that have not been accessed in over 7 days, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why upload the data in the previous step and then check against a cache of the just-overwritten S3 files' SHA256? It seems simpler to compute the SHA256 of the local ingest files before they're uploaded, diff against the current latest S3 files' SHA256s, and if they match then the ingest step succeeds with "no new data" status, and later rules don't run. This'd avoid the 7-day-cache limitation, as well as not add upload (add) unnecessary data to a versioned S3 bucket. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question! This is actually already happening as you've described within Using the cache in GitHub Action workflow is my work-around to do this check for "no new data" status without designing some sort of "status" JSON/YAML file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We discussed this topic a little bit in today's meeting. While |
||
# so if the workflow has not been run over 7 days then it will trigger phylogenetic. | ||
check-new-data: | ||
needs: [ingest] | ||
runs-on: ubuntu-latest | ||
outputs: | ||
cache-hit: ${{ steps.check-cache.outputs.cache-hit }} | ||
steps: | ||
- name: Get sha256sum | ||
id: get-sha256sum | ||
run: | | ||
s3_urls=( | ||
"s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst" | ||
"s3://nextstrain-data/files/workflows/zika/sequences.fasta.zst" | ||
) | ||
|
||
# Code below is modified from ingest/upload-to-s3 | ||
# https://github.com/nextstrain/ingest/blob/c0b4c6bb5e6ccbba86374d2c09b42077768aac23/upload-to-s3#L23-L29 | ||
|
||
no_hash=0000000000000000000000000000000000000000000000000000000000000000 | ||
|
||
for s3_url in "${s3_urls[@]}"; do | ||
s3path="${s3_url#s3://}" | ||
bucket="${s3path%%/*}" | ||
key="${s3path#*/}" | ||
|
||
s3_hash="$(aws s3api head-object --no-sign-request --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")" | ||
echo "${s3_hash}" >> ingest-output-sha256sum | ||
done | ||
|
||
- name: Check cache | ||
id: check-cache | ||
uses: actions/cache@v4 | ||
with: | ||
path: ingest-output-sha256sum | ||
key: ingest-output-sha256sum-${{ hashFiles('ingest-output-sha256sum') }} | ||
lookup-only: true | ||
|
||
phylogenetic: | ||
needs: [ingest] | ||
needs: [check-new-data] | ||
if: ${{ needs.check-new-data.outputs.cache-hit != 'true' }} | ||
permissions: | ||
id-token: write | ||
uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This relies on the ordering of metadata & sequences to be identical across runs, right? I don't think any of our ingest steps sort the data, and any (default, multi-threaded) call to
nextclade run
will change the ordering. I also wouldn't think the NCBI data is guaranteed to be in the same order across differentdataset
/entrez
runs, but maybe?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! We do not do any explicit sorting, so change in order from upstream data can result in triggering the phylogenetic workflow. However, we've been using this method for ncov-ingest and mpox with
dataset
and it's worked pretty well.Nextclade can potentially change ordering of the aligned.fasta, but not the sequences.fasta. I don't think it will have an influence on metadata, since usually the joining of the metadata and nextclade data outputs based on metadata order. If we are worried about Nextclade order we can use the
--in-order
flag to guarantee output order is the same as input order. Although, this repo currently does not includenextclade run
in the ingest workflow.