-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make tree for 450bp of the N gene ("N450") #20
Changes from 1 commit
55ea0ce
edbefd5
8343876
d55342b
ab92a1b
f39ba8b
119fbcf
862dbaf
8bea320
a9d2644
bf83a42
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
""" | ||
This part of the workflow prepares sequences for constructing the phylogenetic tree for 450bp of the N gene. | ||
|
||
See Augur's usage docs for these commands for more details. | ||
""" | ||
|
||
rule align_and_extract_N450: | ||
input: | ||
sequences = "data/sequences.fasta", | ||
reference = config["files"]["reference_N450_fasta"] | ||
output: | ||
sequences = "results/sequences_N450.fasta" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's preferable to organise builds as directories within results, e.g. "results/genome/sequences.fasta" and "results/N450/sequences.fasta". As it's an implementation detail, this change doesn't have to be made in this PR. (This comment applies throughout the snakemake files added in this PR.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done in a9d2644 |
||
params: | ||
min_length = config['filter_N450']['min_length'] | ||
shell: | ||
""" | ||
nextclade run \ | ||
-j 1 \ | ||
--input-ref {input.reference} \ | ||
--output-fasta {output.sequences} \ | ||
--min-seed-cover 0.01 \ | ||
--min-length {params.min_length} \ | ||
--silent \ | ||
{input.sequences} | ||
""" | ||
rule filter_N450: | ||
""" | ||
Filtering to | ||
- {params.sequences_per_group} sequence(s) per {params.group_by!s} | ||
- excluding strains in {input.exclude} | ||
- minimum genome length of {params.min_length} | ||
- excluding strains with missing region, country or date metadata | ||
""" | ||
input: | ||
sequences = "results/sequences_N450.fasta", | ||
metadata = "data/metadata.tsv", | ||
exclude = config["files"]["exclude"] | ||
output: | ||
sequences = "results/aligned_N450.fasta" | ||
params: | ||
group_by = config['filter_N450']['group_by'], | ||
subsample_max_sequences = config["filter_N450"]["subsample_max_sequences"], | ||
min_date = config["filter_N450"]["min_date"], | ||
min_length = config['filter_N450']['min_length'], | ||
strain_id = config["strain_id_field"] | ||
shell: | ||
""" | ||
augur filter \ | ||
--sequences {input.sequences} \ | ||
--metadata {input.metadata} \ | ||
--metadata-id-columns {params.strain_id} \ | ||
--exclude {input.exclude} \ | ||
--output {output.sequences} \ | ||
--group-by {params.group_by} \ | ||
--subsample-max-sequences {params.subsample_max_sequences} \ | ||
--min-date {params.min_date} \ | ||
--min-length {params.min_length} | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor, not blocking]
@joverlee521 do you have a canonical way we should structure build-specific rule parameters (e.g.
filter
vsfilter_N450
) when we want to use a single config file for both/all builds?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not (yet?), the pathogen-repo-guide has been based on one build per config file.
Existing workflows use three different patterns
This would look like:
This would look like:
This would look like:
[1] is the most flexible, allowing each build to define its own parameters. This makes it very easy to scan the config file for one build's parameters in a single place. However, since each param has to be defined per build, this can result in very long config files, which is why seasonal-flu has complex array-builds configs to programmatically create the configs during the workflow.
[2] is also pretty flexible, where each build can configure each parameter per rule grouping. There's less repetition of parameters so config files won't be as long, but a single build's config is spread throughout the rules groupings. It is also not very clear which rule configs can be configured per build and which rule configs are shared among builds.
[3] is the least flexible, as it only allows each build to change specific configs. This can also be confusing why some configs are nested while others are not.