Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: Add build-configs for CI #56

Open
joverlee521 opened this issue Jul 16, 2024 · 8 comments
Open

ingest: Add build-configs for CI #56

joverlee521 opened this issue Jul 16, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@joverlee521
Copy link
Contributor

joverlee521 commented Jul 16, 2024

We already have the build-configs for CI in the phylogenetic workflow, so I think reasonable to add build-configs for CI in the ingest workflow. This will make it simpler for internal team to set up the GH Action workflow using pathogen-repo-ci.

Things to consider

  1. Use flow syntax for a map ({}) of an empty config file (Originally posted by @tsibley in 44f27a5)
  2. Consider adding a standardized way to "subsample" the ingest data since some ingest workflows can run too long for a responsive CI job.
@joverlee521
Copy link
Contributor Author

Thinking out loud on options to "subsample" ingest data.

  1. Instead of fetching from NCBI, start with an example NCBI Dataset ZIP that is a small subset of the data. This is likely not sustainable as we would have to update the example data every time NCBI Datasets updates their schema.
  2. Filter the NCBI Datasets data during fetch using available CLI options, e.g. --geo-location or --released-after. This doesn't guarantee that we will fetch a "small" subset of data, but at least it's not all data.
  3. Fetch the full NCBI Dataset then filter the outputs locally. It's not clear to me that this would reduce run time as the filtering step might take a while?

@victorlin
Copy link
Member

I'm assuming there are two goals of ingest CI:

a. Ensure an update to the ingest workflow works with existing NCBI data
b. Ensure new data from NCBI works with the existing ingest workflow

Re: the 3 options above

  1. I would consider this if (b) is not a high priority. Or if NCBI Datasets only updates there schema occasionally, it might be fine as long as there are other scheduled runs on the full data that would clearly surface the need to update.
  2. The failure modes of (a) should be considered for any filtering. Would it be weird outliers from unknown/new locations? If yes, then --geo-location wouldn't be a good filter.
  3. Is the goal here to apply downsampling rules that aren't available via NCBI Datasets CLI (e.g. random sampling)? It would be good to see where the time bottleneck is. If data download is slow, then this option wouldn't make a difference whereas the CLI options should (?) filter things server-side which would be faster. Also, what tool would be used to do the filtering? I imagine that once all the data is on disk, downsampling should be fast with the right tool.

@joverlee521
Copy link
Contributor Author

Thanks for the thoughts here @victorlin! I agree we care about (a) more to ensure the ingest workflow runs as expected.

@corneliusroemer has raised nextstrain/measles#46 for not relying on external services in CI, which aligns with option [1].

I'll probably port whatever we implement in measles into this repo.

@corneliusroemer
Copy link
Member

Good discussion! I hadn't seen this here before the measles PR.

I'm not sure how often datasets-cli changes their schema. I don't expect this to happen very often, but I might be wrong.

If the zip package/schema changes, we could just start with test files after all ncbi datasets commands to not rely on stability of NCBI somewhat internal API (not sure how internal the downloaded package is, we'll find out).

@joverlee521
Copy link
Contributor Author

Copying @tsibley 's relevant comments from the measles issue:

I think we should not use ncbi datasets in CI, and instead use an archived zip file that is identical to what datasets would produce.

The tradeoff is that we'll have to keep the example NCBI dataset file up-to-date with upstream changes, else we won't be testing what NCBI's actually providing and will drift over time.

@joverlee521
Copy link
Contributor Author

Stepping back to consider the goals of ingest CI that @victorlin astutely pointed out

a. Ensure an update to the ingest workflow works with existing NCBI data
b. Ensure new data from NCBI works with the existing ingest workflow

I think (a) is the more frequent check within a pathogen repo, while (b) is important to test when we update the NCBI datasets version in docker-base/conda-base. I don't thinks there's a simple way to separate those two concerns with the current pathogen-repo-ci workflow.

@joverlee521
Copy link
Contributor Author

joverlee521 commented Jul 19, 2024

From yesterday's dev chat:

Consensus: we are good with caching ingest downloads for CI purposes, particularly when a repo is also automatically trying to rebuild daily and running “live” ingest during that process

@genehack
Copy link
Contributor

Having reviewed this issue again, in light of my recent question in Slack about including Nextclade and geo data into the embedded example-data for ingest CI, I think doing that is most consistent with the reasoning in this issue and the previous comment from Jover about the dev chat summary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants