Skip to content

Commit

Permalink
git subrepo clone (merge) https://github.com/nextstrain/ingest ingest…
Browse files Browse the repository at this point in the history
…/vendored

subrepo:
  subdir:   "ingest/vendored"
  merged:   "a0faef5"
upstream:
  origin:   "https://github.com/nextstrain/ingest"
  branch:   "main"
  commit:   "a0faef5"
git-subrepo:
  version:  "0.4.6"
  origin:   "https://github.com/ingydotnet/git-subrepo"
  commit:   "110b9eb"
  • Loading branch information
kimandrews committed Jan 18, 2024
1 parent 9fba460 commit 5eb59ea
Show file tree
Hide file tree
Showing 26 changed files with 1,256 additions and 0 deletions.
3 changes: 3 additions & 0 deletions ingest/vendored/.cramrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[cram]
shell = /bin/bash
indent = 2
16 changes: 16 additions & 0 deletions ingest/vendored/.github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
### Description of proposed changes

<!-- What is the goal of this pull request? What does this pull request change? -->

### Related issue(s)

<!-- Link any related issues here. -->

### Checklist

<!-- Make sure checks are successful at the bottom of the PR. -->

- [ ] Checks pass
- [ ] If adding a script, add an entry for it in the README.

<!-- 🙌 Thank you for contributing to Nextstrain! ✨ -->
23 changes: 23 additions & 0 deletions ingest/vendored/.github/workflows/ci.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: CI

on:
push:
branches:
- main
pull_request:
workflow_dispatch:

jobs:
shellcheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: nextstrain/.github/actions/shellcheck@master

cram:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
- run: pip install cram
- run: cram tests/
12 changes: 12 additions & 0 deletions ingest/vendored/.gitrepo
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
; DO NOT EDIT (unless you know what you are doing)
;
; This subdirectory is a git "subrepo", and this file is maintained by the
; git-subrepo command. See https://github.com/ingydotnet/git-subrepo#readme
;
[subrepo]
remote = https://github.com/nextstrain/ingest
branch = main
commit = a0faef53a0c6e7cc4057209454ef0852875dc3a9
parent = 9fba460d1db4d677615d3826cdf061aeffde71a7
method = merge
cmdver = 0.4.6
6 changes: 6 additions & 0 deletions ingest/vendored/.shellcheckrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Use of this file requires Shellcheck v0.7.0 or newer.
#
# SC2064 - We intentionally want variables to expand immediately within traps
# so the trap can not fail due to variable interpolation later.
#
disable=SC2064
140 changes: 140 additions & 0 deletions ingest/vendored/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# ingest

Shared internal tooling for pathogen data ingest. Used by our individual
pathogen repos which produce Nextstrain builds. Expected to be vendored by
each pathogen repo using `git subtree`.

Some tools may only live here temporarily before finding a permanent home in
`augur curate` or Nextstrain CLI. Others may happily live out their days here.

## Vendoring

Nextstrain maintained pathogen repos will use [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to vendor ingest scripts.
(See discussion on this decision in https://github.com/nextstrain/ingest/issues/3)

For a list of Nextstrain repos that are currently using this method, use [this
GitHub code search](https://github.com/search?type=code&q=org%3Anextstrain+subrepo+%22remote+%3D+https%3A%2F%2Fgithub.com%2Fnextstrain%2Fingest%22).

If you don't already have `git subrepo` installed, follow the [git subrepo installation instructions](https://github.com/ingydotnet/git-subrepo#installation).
Then add the latest ingest scripts to the pathogen repo by running:

```
git subrepo clone https://github.com/nextstrain/ingest ingest/vendored
```

Any future updates of ingest scripts can be pulled in with:

```
git subrepo pull ingest/vendored
```

If you run into merge conflicts and would like to pull in a fresh copy of the
latest ingest scripts, pull with the `--force` flag:

```
git subrepo pull ingest/vendored --force
```

> **Warning**
> Beware of rebasing/dropping the parent commit of a `git subrepo` update
`git subrepo` relies on metadata in the `ingest/vendored/.gitrepo` file,
which includes the hash for the parent commit in the pathogen repos.
If this hash no longer exists in the commit history, there will be errors when
running future `git subrepo pull` commands.

If you run into an error similar to the following:
```
$ git subrepo pull ingest/vendored
git-subrepo: Command failed: 'git branch subrepo/ingest/vendored '.
fatal: not a valid object name: ''
```
Check the parent commit hash in the `ingest/vendored/.gitrepo` file and make
sure the commit exists in the commit history. Update to the appropriate parent
commit hash if needed.

## History

Much of this tooling originated in
[ncov-ingest](https://github.com/nextstrain/ncov-ingest) and was passaged thru
[mpox's ingest/](https://github.com/nextstrain/mpox/tree/@/ingest/). It
subsequently proliferated from [mpox][] to other pathogen repos ([rsv][],
[zika][], [dengue][], [hepatitisB][], [forecasts-ncov][]) primarily thru
copying. To [counter that
proliferation](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079),
this repo was made.

[mpox]: https://github.com/nextstrain/mpox
[rsv]: https://github.com/nextstrain/rsv
[zika]: https://github.com/nextstrain/zika/pull/24
[dengue]: https://github.com/nextstrain/dengue/pull/10
[hepatitisB]: https://github.com/nextstrain/hepatitisB
[forecasts-ncov]: https://github.com/nextstrain/forecasts-ncov

## Elsewhere

The creation of this repo, in both the abstract and concrete, and the general
approach to "ingest" has been discussed in various internal places, including:

- https://github.com/nextstrain/private/issues/59
- @joverlee521's [workflows document](https://docs.google.com/document/d/1rLWPvEuj0Ayc8MR0O1lfRJZfj9av53xU38f20g8nU_E/edit#heading=h.4g0d3mjvb89i)
- [5 July 2023 Slack thread](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079)
- [6 July 2023 team meeting](https://docs.google.com/document/d/1FPfx-ON5RdqL2wyvODhkrCcjgOVX3nlXgBwCPhIEsco/edit)
- _…many others_

## Scripts

Scripts for supporting ingest workflow automation that don’t really belong in any of our existing tools.

- [notify-on-diff](notify-on-diff) - Send Slack message with diff of a local file and an S3 object
- [notify-on-job-fail](notify-on-job-fail) - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
- [notify-on-job-start](notify-on-job-start) - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
- [notify-on-record-change](notify-on-recod-change) - Send Slack message with details about line count changes for a file compared to an S3 object's metadata `recordcount`.
If the S3 object's metadata does not have `recordcount`, then will attempt to download S3 object to count lines locally, which only supports `xz` compressed S3 objects.
- [notify-slack](notify-slack) - Send message or file to Slack
- [s3-object-exists](s3-object-exists) - Used to prevent 404 errors during S3 file comparisons in the notify-* scripts
- [trigger](trigger) - Triggers downstream GitHub Actions via the GitHub API using repository_dispatch events.
- [trigger-on-new-data](trigger-on-new-data) - Triggers downstream GitHub Actions if the provided `upload-to-s3` outputs do not contain the `identical_file_message`
A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated.

NCBI interaction scripts that are useful for fetching public metadata and sequences.

- [fetch-from-ncbi-entrez](fetch-from-ncbi-entrez) - Fetch metadata and nucleotide sequences from [NCBI Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25501/) and output to a GenBank file.
Useful for pathogens with metadata and annotations in custom fields that are not part of the standard [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/) outputs.

Historically, some pathogen repos used the undocumented NCBI Virus API through [fetch-from-ncbi-virus](https://github.com/nextstrain/ingest/blob/c97df238518171c2b1574bec0349a55855d1e7a7/fetch-from-ncbi-virus) to fetch data. However we've opted to drop the NCBI Virus scripts due to https://github.com/nextstrain/ingest/issues/18.

Potential Nextstrain CLI scripts

- [sha256sum](sha256sum) - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
- [cloudfront-invalidate](cloudfront-invalidate) - CloudFront invalidation is already supported in the [nextstrain remote command for S3 files](https://github.com/nextstrain/cli/blob/a5dda9c0579ece7acbd8e2c32a4bbe95df7c0bce/nextstrain/cli/remote/s3.py#L104).
This exists as a separate script to support CloudFront invalidation when using the upload-to-s3 script.
- [upload-to-s3](upload-to-s3) - Upload file to AWS S3 bucket with compression based on file extension in S3 URL.
Skips upload if the local file's hash is identical to the S3 object's metadata `sha256sum`.
Adds the following user defined metadata to uploaded S3 object:
- `sha256sum` - hash of the file generated by [sha256sum](sha256sum)
- `recordcount` - the line count of the file
- [download-from-s3](download-from-s3) - Download file from AWS S3 bucket with decompression based on file extension in S3 URL.
Skips download if the local file already exists and has a hash identical to the S3 object's metadata `sha256sum`.

Potential augur curate scripts

- [apply-geolocation-rules](apply-geolocation-rules) - Applies user curated geolocation rules to NDJSON records
- [merge-user-metadata](merge-user-metadata) - Merges user annotations with NDJSON records
- [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.'
- [transform-field-names](transform-field-names) - Rename fields of NDJSON records
- [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/)
- [transform-strain-names](transform-strain-names) - Ordered search for strain names across several fields.

## Software requirements

Some scripts may require Bash ≥4. If you are running these scripts on macOS, the builtin Bash (`/bin/bash`) does not meet this requirement. You can install [Homebrew's Bash](https://formulae.brew.sh/formula/bash) which is more up to date.

## Testing

Most scripts are untested within this repo, relying on "testing in production". That is the only practical testing option for some scripts such as the ones interacting with S3 and Slack.

For more locally testable scripts, Cram-style functional tests live in `tests` and are run as part of CI. To run these locally,

1. Download Cram: `pip install cram`
2. Run the tests: `cram tests/`
Loading

0 comments on commit 5eb59ea

Please sign in to comment.