Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate ingest and phylogenetic workflows #52

Merged
merged 5 commits into from
Apr 4, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 41 additions & 3 deletions .github/workflows/ingest-to-phylogenetic.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,49 @@ jobs:
ingest/logs/
ingest/.snakemake/log/

# TKTK check if ingest results include new data
# potentially use actions/cache to store Metadata.sha256sum of S3 files
# Check if ingest results include new data by checking for the cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relies on the ordering of metadata & sequences to be identical across runs, right? I don't think any of our ingest steps sort the data, and any (default, multi-threaded) call to nextclade run will change the ordering. I also wouldn't think the NCBI data is guaranteed to be in the same order across different dataset / entrez runs, but maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relies on the ordering of metadata & sequences to be identical across runs, right?

Yes! We do not do any explicit sorting, so change in order from upstream data can result in triggering the phylogenetic workflow. However, we've been using this method for ncov-ingest and mpox with dataset and it's worked pretty well.

any (default, multi-threaded) call to nextclade run will change the ordering.

Nextclade can potentially change ordering of the aligned.fasta, but not the sequences.fasta. I don't think it will have an influence on metadata, since usually the joining of the metadata and nextclade data outputs based on metadata order. If we are worried about Nextclade order we can use the --in-order flag to guarantee output order is the same as input order. Although, this repo currently does not include nextclade run in the ingest workflow.

# of the file with the results' Metadata.sh256sum (which should have been added within upload-to-s3)
# GitHub will remove any cache entries that have not been accessed in over 7 days,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why upload the data in the previous step and then check against a cache of the just-overwritten S3 files' SHA256? It seems simpler to compute the SHA256 of the local ingest files before they're uploaded, diff against the current latest S3 files' SHA256s, and if they match then the ingest step succeeds with "no new data" status, and later rules don't run. This'd avoid the 7-day-cache limitation, as well as not add upload (add) unnecessary data to a versioned S3 bucket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! This is actually already happening as you've described within upload-to-s3. It will only upload the file if the SHA256 does not match.

Using the cache in GitHub Action workflow is my work-around to do this check for "no new data" status without designing some sort of "status" JSON/YAML file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this topic a little bit in today's meeting. While upload-to-s3 knows if the file is different ("new data"), this information is encapsulated within the injest job and we are using the check-new-data job as a work-around to avoid extracting this information from the previous job. We'll probably revisit this in the future?

# so if the workflow has not been run over 7 days then it will trigger phylogenetic.
check-new-data:
needs: [ingest]
runs-on: ubuntu-latest
outputs:
cache-hit: ${{ steps.check-cache.outputs.cache-hit }}
steps:
- name: Get sha256sum
id: get-sha256sum
run: |
s3_urls=(
"s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst"
"s3://nextstrain-data/files/workflows/zika/sequences.fasta.zst"
)

# Code below is modified from ingest/upload-to-s3
# https://github.com/nextstrain/ingest/blob/c0b4c6bb5e6ccbba86374d2c09b42077768aac23/upload-to-s3#L23-L29

no_hash=0000000000000000000000000000000000000000000000000000000000000000

for s3_url in "${s3_urls[@]}"; do
s3path="${s3_url#s3://}"
bucket="${s3path%%/*}"
key="${s3path#*/}"

s3_hash="$(aws s3api head-object --no-sign-request --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")"
echo "${s3_hash}" >> ingest-output-sha256sum
done

- name: Check cache
id: check-cache
uses: actions/cache@v4
with:
path: ingest-output-sha256sum
key: ingest-output-sha256sum-${{ hashFiles('ingest-output-sha256sum') }}
lookup-only: true

phylogenetic:
needs: [ingest]
needs: [check-new-data]
if: ${{ needs.check-new-data.outputs.cache-hit != 'true' }}
permissions:
id-token: write
uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master
Expand Down