WIP - allow running workflow from outside of repo #103

jameshadfield · 2024-11-20T01:05:10Z

Following on from nextstrain/measles#55 partly as a learning exercise and partly because I want to use it for avian-flu. The main differences here are

A much more complicated snakemake workflow and config structure
A workflow which expected to be run with --configfile, i.e. there's no "default configuration values". This is how mpox works too, among others.

Tested using two scenarios:

GISAID builds. Running in a separate directory outside of the avian-flu repo. I used a custom-gisaid.yaml config of

extends: gisaid.yaml
segments:
  - pb2

# NOTE - this reveals a (rather big) issue with our config style as this _won't_
# overwrite the builds as we intend it to, it'll merge in (and thus not change anything)
builds:
  h7n9: 
    - all-time

target_sequences_per_tree: 1000 # for testing

And running via

AVIAN_FLU="path/to/avian-flu/repo"
snakemake --snakefile ${AVIAN_FLU}/Snakefile --cores 3 --configfile custom-gisaid.yaml -pf auspice/avian-flu_h7n9_ha_all-time.json

(The explicit target file was needed due to the config structure - see the note in the YAML above - and this makes the config override of segments irrelevant.)

H5N1-cattle-outbreak. Running in a separate directory outside of the avian-flu repo (and different from the GISAID builds above). I used a config.yaml config of

extends: h5n1-cattle-outbreak.yaml
segments:
  - pb2

And ran via snakemake --snakefile ${AVIAN_FLU}/Snakefile --cores 3 -pf

Overall it worked really well. I'll start some threads in the code to continue discussion of various parts.

Snakefile

jameshadfield · 2024-11-27T23:43:11Z

Snakefile

+    Resolve a relative *path* given in a configuration value. Before resolving
+    any '{x}' substrings are replaced by their corresponding wildcards (if the
+    `wildcards` argument is provided).
+
+    Search order (first match returned):
+    1. Relative to the analysis directory
+    2. Relative to the directory the entry snakefile was in. Typically this
+       is not the Snakefile you are looking at now but (e.g.) the one in
+       avian-flu/gisaid
+    3. Relative to where this Snakefile is (i.e. `avian-flu/`)


@joverlee521 if we do implement a search order like this it opens up the possibility of moving the "default" files from (e.g.) ./config/ to ./gisaid/ or ./gisaid/defaults/. Similarly for custom rules. If we go this direction I think it warrants a rethink of how we want to structure repos.

Snakefile

jameshadfield · 2024-12-02T21:18:26Z

Snakefile

+class InvalidConfigError(Exception):
+    pass
+
+def resolve_config_path(original_path, wildcards=None):


Reminder to me: need to resolve any custom rules using this (?) function, currently they're just

for rule_file in config.get('custom_rules', []): include: rule_file

(See https://github.com/nextstrain/measles/blob/6584e68622dd305592b98d8630fa94e191b4f164/ingest/Snakefile#L53-L59)

There's a decision for us to make here which wasn't needed in measles. The h5n1-cattle-outbreak config uses a custom rule which needs to be searched for within the avian-flu repo. Other workflows use a similar approach for CI - e.g. zika.

Approach 1: Use the same resolving approach as resolve_config_path to search for rulefiles.

Approach 2: Use the approach of measles (custom rules must be relative to the working directory) and replace our usage of custom_rules with a "include" directive within h5n1-cattle-outbreak/Snakefile.

There are pros and cons to each I think. Given that we use custom_rules in CI, approach 1 is nice as it'll allow CI to be run in a separate directory as desired. Approach 2 also has the "gotcha" where you develop a workflow with custom-rules within avian-flu, it all works, but then it won't work when you run it in another directory. There's also the bigger picture issue where if multiple configfiles define custom_rules only one of them will be used because they're lists.

Given the above I think using the search order from Approach 1 is the best, but we should also reference cattle-flu.smk within h5n1-cattle-outbreak/Snakefile so that overlays with custom_rules don't have to specify it.

Hmm. It feels strange to be letting "custom" rule files be sourced from the workflow source, i.e. "stock custom rules", which seems contradictory.

This is why, FWIW, in measles I intentionally only resolved custom rules files based on the working analysis directory.

For the current "build-configs" pattern like CI or our own automation, I'd think to convert those to analysis directories. I didn't demo this for measles to keep changes to a minimum, but it's an easy change and keeps our interfaces consistent.

It feels strange to be letting "custom" rule files be sourced from the workflow source, i.e. "stock custom rules", which seems contradictory

Yeah I see that. I guess a clarifying question is "do we want to allow all workflows to be run from an external analysis directory". If that's the case, and I think it probably is, then we either avoid custom_rules entirely, expand the interface to have something like workflow_rules, or shift our usage of custom rules to their own (sub-)workflows (concrete example). Is the latter what you mean by "I'd think to convert those to analysis directories"?

If we use different workflows for CI rather than "custom_rules" as we do now, then a cattle-outbreak CI might look like avian-flu/phylogenetic/ci/h5n1-cattle-outbreak/Snakefile which would include: ../../h5n1-cattle-outbreak/Snakefile which would include: "../Snakefile". I haven't really though through how inheritance like this will work in practice.

cc @joverlee521

we're answering yes, but saying that CI isn't a "workflow" in and of itself so it's ok that it can't be run from an external analysis directory?

I think we have to cut off at some point. The CI seems like a very Nextstrain/workflow developer specific config that would not need to be run by an external user. If someone really wanted to, they can copy the contents of the CI directory to their own analysis directory.

If someone really wanted to

Just for the absence of doubt, here I'm saying that someone is us as it relates to how we run CI (on GitHub actions). Currently we check out the repo and run with --configfile <ci.yaml>. With these changes we would be doing the same, but with a slightly different invocation syntax, e.g. as Tom sketched out above. But we wouldn't be using an approach where we ran CI in a separate analysis directory with nextstrain run ... as if it were just another workflow. As I said, I'm not involved much in the CI space, but just wanted to explain myself fully here.

Ah gotcha. Yeah, for CI I'd think we'd want to test the source code so it'd be running nextstrain build.

Thanks @joverlee521!

I've updated the PR to search for custom rules relative to the analysis (working) directory. The h5n1-cattle-outbreak/Snakemake workflow now imports it's own "custom" rules as expected.

I haven't moved the "base" Snakefile to rules/base.smk, but this PR is probably the place to do it if we want to.

@jameshadfield: But we wouldn't be using an approach where we ran CI in a separate analysis directory with nextstrain run ... as if it were just another workflow.

@joverlee521: Yeah, for CI I'd think we'd want to test the source code so it'd be running nextstrain build.

We could have this either way. It's very doable to use nextstrain run for this, and we might want to do so in order for it to be an instructive example/regularly test that bit of interface.

tsibley

Submitting comments-in-progress as I'd started reviewing this WIP last week, then the holiday happened and I didn't click submit. (I realize the WIP here has changed again since that time too.)

Although, it seems like some comments I remember writing got lost… ugh.

Snakefile

See <https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1732568407123369> for context. Able to be run via a number of different ways: - From the 'avian-flu' repo: - `snakemake -s gisaid/Snakefile ...` - `cd gisaid && snakemake ...` - From a separate analysis directory, where ${AVIAN_FLU} is the path to the (locally checked out) avian-flu repo - without any config overlays: `snakemake -s ${AVIAN_FLU}/gisaid/Snakefile` - with a `config.yaml` overlay: (same as above) - with a `foo.yaml` overlay: `snakemake -s ${AVIAN_FLU}/gisaid/Snakefile --configfile foo.yaml`

Shifts to a concept where custom-rules are only for use in analysis directories, and the custom snakemake file is sourced relative to that working directory. See <#103 (comment)> for more discussion about the benefits and limitations of this. The cattle-outbreak workflow (`h5n1-cattle-outbreak/Snakefile`) now directly imports the rules it needs rather than using the custom-rules machinery.

By having all phylogenetic workflows start from two lists of inputs (`config.inputs`, `config.additional_inputs`) we enable a broad range of uses with a consistent interface. 1. Using local ingest files is trivial (see added docs) and doesn't need a bunch of special-cased logic that is prone to falling out of date (as it had indeed done) 2. Adding extra / private data follows the similar pattern, with an additional config list being used so that we are explicit that the new data is additional and enforce an ordering which is needed for predictable `augur merge` behaviour. The canonical data can be removed / replaced via step (1) if needed. I considered adding additional data after the subtype-filtering step, which would avoid the need to add subtype in the metadata but requires encoding this in the config overlay. I felt the chosen way was simpler and more powerful. Note that this workflow uses an old version of the CI workflow, <https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240> which copies `example_data`. We could upgrade to the latest version and use a config overlay to swap out the canonical inputs with the example data. Note that one of the side effects of the current implementation is that merged inputs use the same filepath irrespective of the workflow. For instance, both gisaid & h5n1-cattle-outbreak use the intermediate path `results/metadata_merged.tsv`, which means it's not possible to maintain runs of both those analysis concurrently if both were to use merged inputs. Using separate analysis directories, e.g. <#103> will help avoid this shortcoming.

Snakefile

genehack · 2024-12-16T20:08:05Z

Snakefile

+    Resolve a relative *path* given in a configuration value. Returns a
+    function which takes a single argument *wildcards* and returns the resolved
+    path with any '{x}' substrings are replaced by their corresponding wildcards
+    filled in


Couple miscellaneous/nitpicky things here:

Might be worth noting in the docstring that the wildcard expansion is via Snakemake's expand()

The docstring and commentary within the method are very specific about paths being relative — might be a good idea to add a guard for that? Right now if you pass in an absolute path, as long as there's a file at that path, this function will pass (the expanded version of) that path right back to you — and maybe that's okay, or even desired! - but then that deserves some sort of acknowledgement in the docs

Might be nice to let path to be a pathlib object in addition to a string; especially because right now the error message you get back is going to be very confusing, given how pathlib objects stringify

tsibley · 2024-12-17T17:20:21Z

Snakefile

+    Search order (first match returned):
+    1. Relative to the analysis directory
+    2. Relative to the directory the entry snakefile was in. Typically this is
+       not the Snakefile you are looking at now but (e.g.) the one in
+       avian-flu/gisaid
+    3. Relative to where this Snakefile is (i.e. `avian-flu/`)


I've casually skimmed thru parts of this PR but haven't actually reviewed it. I wanted to start discussion early though on this search order. I think it would be better to have a single fallback, e.g. a search order with only two locations: the working analysis directory and a fixed directory based on the workflow source. That is, we'd pick one of workflow.basedir or workflow.current_basedir and use that one consistently, not allow either. The more search paths, the harder the behaviour is to explain, and the easier it is to get tripped up in a tangle you didn't expect as paths in different parts of the repo collide.

jameshadfield commented Nov 20, 2024

View reviewed changes

Snakefile Outdated Show resolved Hide resolved

jameshadfield mentioned this pull request Nov 20, 2024

wip! workflows as programs prototyping nextstrain/measles#55

Draft

1 task

jameshadfield force-pushed the james/run-outside-of-repo branch 2 times, most recently from e6b070d to 275f585 Compare November 22, 2024 02:39

This was referenced Nov 22, 2024

Rewrite config syntax #104

Draft

work needed to allow users to run analyses separate to pathogen workflows nextstrain/pathogen-repo-guide#74

Open

jameshadfield commented Nov 27, 2024

View reviewed changes

jameshadfield force-pushed the james/run-outside-of-repo branch 3 times, most recently from ad0f15c to 90b994c Compare November 28, 2024 03:12

jameshadfield commented Dec 2, 2024

View reviewed changes

Snakefile Outdated Show resolved Hide resolved

jameshadfield commented Dec 2, 2024

View reviewed changes

tsibley reviewed Dec 3, 2024

View reviewed changes

Snakefile Outdated Show resolved Hide resolved

Snakefile Outdated Show resolved Hide resolved

tsibley reviewed Dec 3, 2024

View reviewed changes

Snakefile Outdated Show resolved Hide resolved

jameshadfield force-pushed the james/run-outside-of-repo branch from 90b994c to 5c81b87 Compare December 8, 2024 22:57

jameshadfield added 3 commits December 12, 2024 16:16

WIP - allow running workflow from outside of repo

3858810

jameshadfield force-pushed the james/run-outside-of-repo branch from 5c81b87 to 0bc582d Compare December 12, 2024 03:55

jameshadfield mentioned this pull request Dec 16, 2024

Multiple inputs / overriding inputs #112

Open

genehack reviewed Dec 16, 2024

View reviewed changes

tsibley reviewed Dec 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - allow running workflow from outside of repo #103

WIP - allow running workflow from outside of repo #103

jameshadfield commented Nov 20, 2024

jameshadfield Nov 27, 2024 •

edited

Loading

jameshadfield Dec 2, 2024

jameshadfield Dec 3, 2024

jameshadfield Dec 8, 2024

tsibley Dec 10, 2024

jameshadfield Dec 10, 2024

joverlee521 Dec 11, 2024

jameshadfield Dec 12, 2024

joverlee521 Dec 12, 2024

jameshadfield Dec 12, 2024

tsibley Dec 16, 2024

tsibley left a comment •

edited

Loading

genehack Dec 16, 2024

tsibley Dec 17, 2024

WIP - allow running workflow from outside of repo #103

Are you sure you want to change the base?

WIP - allow running workflow from outside of repo #103

Conversation

jameshadfield commented Nov 20, 2024

jameshadfield Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsibley left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshadfield Nov 27, 2024 •

edited

Loading

tsibley left a comment •

edited

Loading