-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snakemake improvements #70
Comments
In principle I agree with this, but want to be explicit that is will be a significant change in how builds are run. Instead of the current single invocation of As an alternative, we could potentially follow seasonal-flu's example of one config for all 4 subtypes, but that makes the config file more complex...
Unfortunately, it's not possible to have an empty wildcard in Snakemake rule all:
input: expand("output{test_wildcard}.txt", test_wildcard=["_a", "_b", ""])
rule a:
output: touch("output{test_wildcard}.txt") $ nextstrain build snakemake/empty-wildcard/ --forceall
Building DAG of jobs...
MissingInputException in rule all in file /nextstrain/build/Snakefile, line 1:
Missing input files for rule all:
affected files:
output.txt Since we are adding segment builds for the cattle-outbreak, I wonder if the cattle-outbreak builds should be nested under the canonical h5n1 builds. Then |
We can do it, but I haven't thought through all the ramifications. Add the following to your minimal Snakefile:
Oh! Yeah, that may be a dealbreaker |
Part of simplifications described in <#70> This approach avoids using an empty wildcard for time but introduces the need to have a separate rule to map auspice datasets (which won't have 'time' for h5n1-cattle-flu) to a filename which does include a time wildcard.
Part of simplifications described in <#70> This approach avoids using an empty wildcard for time but introduces the need to have a separate rule to map the target Auspice dataset JSON filename (e.g. `auspice/avian-flu_h5n1-cattle-outbreak_pb2.json`) to a results file with a time wildcard (e.g. `results/avian-flu_h5n1-cattle-outbreak_pb2_default.json`). The rest of the pipeline is unchanged.
Part of simplifications described in <#70> This approach avoids using an empty wildcard for time but introduces the need to have a separate rule to map the target Auspice dataset JSON filename (e.g. `auspice/avian-flu_h5n1-cattle-outbreak_pb2.json`) to a results file with a time wildcard (e.g. `results/avian-flu_h5n1-cattle-outbreak_pb2_default.json`). The rest of the pipeline is unchanged.
This is a series of observations about the current situation of the workflow and the complexities we've recently introduced to it as we've adapted it to handle the ongoing cattle-flu outbreak. This includes complexities added in (currently unmerged) Segment builds for H5N1 cattle flu outbreak. The suggestions here are roughly ordered as each builds on previous ones.
Config files
We've probably reached the point where config files would be clarifying rather than confusing. As @joverlee521 notes:
I'm imagining here we would have config files for:
Optional
time
wildcard.The per-segment cattle-flu builds don't have this wildcard, which is why the filename produced by the pipeline doesn't match the desired URL. We can maintain the current ruse of using
time=all-time
and add a final rule to rename the files or somesuch, but the cleaner way would be to make the wildcard optional. In practice (I think, haven't prototyped) this would mean that the three values of time would be"", "_all-time", "_2y"
.Bring genome build into canonical snakemake pipeline
With the above changes, merging these pipelines would be a nice simplification. While they use different parameters, all the rules from tree onwards are the same between the pipelines. I think this could be nicely organised if the input to
align
becomes a function and whenw.subtype=h5n1-cattle-outbreak
that function refers to outputs of arules/h5n1-cattle-outbreak-genome.smk
ruleset.Directory structure within results & config
This is rather minor, but the files counts are starting to become hard to manage! I'd think splitting
config/
into one directory per config file (as described above) would work nicely. Similarly forresults/
.Data files include source in filename
Currently no matter the upstream source (local ingest, GISIAD data, NCBI data) we get the following data files:
data/metadata.tsv
data/{segment}/sequences.fasta
data/metadata_{subtype}.tsv
data/sequences_{subtype}_{segment}fasta
This means when switching between a GISAID build (H5Nx, H5N1, H7N9, H9N2) to a NCBI build (h5n1-cattle-outbreak) these files will be overwritten¹, and then when you switch back to a GISAID build the entire pipeline needs to rerun.
I think it'd be clarifying to have the per-subtype files (3,4) in
results/{subtype}/{metadata.tsv,sequences.fasta}
. The rule that produces these would itself have inputs parameterised by the upstream source. Some pseudocode to explain what I'm thinking:¹ It looks like Snakemake is smart enough to detect when a invocation uses different config params (
s3_src
etc) than those which produced the existing files indata
and regenerates them accordingly. This was a nice surprise to me!The text was updated successfully, but these errors were encountered: