-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make tree for 450bp of the N gene ("N450") #20
Conversation
Reference is comprised of 450bp of the N gene from the same sample that is used for the genome tree (NCBI Accession NC_001498.1).
* Add rule to align sequences to N450 reference using nextclade * Add rule to filter by length, date, country
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
colors: "defaults/colors.tsv" | ||
auspice_config: "defaults/auspice_config.json" | ||
filter: | ||
group_by: "country year month" | ||
sequences_per_group: 20 | ||
min_date: 1950 | ||
min_length: 5000 | ||
filter_N450: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor, not blocking]
@joverlee521 do you have a canonical way we should structure build-specific rule parameters (e.g. filter
vs filter_N450
) when we want to use a single config file for both/all builds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not (yet?), the pathogen-repo-guide has been based on one build per config file.
Existing workflows use three different patterns
- seasonal flu has top level per build configs.
This would look like:
builds:
genome:
filter:
group_by: ...
sequences_per_group: ...
[...]
N450:
filter:
group_by: ...
sequences_per_group: ...
[...]
This would look like:
filter:
genome:
group_by: ...
sequences_per_group: ...
[...]
N450:
group_by: ...
sequences_per_group: ...
[...]
This would look like:
filter:
group_by:
genome: ...
N450: ...
sequences_per_group:
genome: ...
N450: ...
[...]
[1] is the most flexible, allowing each build to define its own parameters. This makes it very easy to scan the config file for one build's parameters in a single place. However, since each param has to be defined per build, this can result in very long config files, which is why seasonal-flu has complex array-builds configs to programmatically create the configs during the workflow.
[2] is also pretty flexible, where each build can configure each parameter per rule grouping. There's less repetition of parameters so config files won't be as long, but a single build's config is spread throughout the rules groupings. It is also not very clear which rule configs can be configured per build and which rule configs are shared among builds.
[3] is the least flexible, as it only allows each build to change specific configs. This can also be confusing why some configs are nested while others are not.
sequences = "data/sequences.fasta", | ||
reference = config["files"]["reference_N450_fasta"] | ||
output: | ||
sequences = "results/sequences_N450.fasta" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's preferable to organise builds as directories within results, e.g. "results/genome/sequences.fasta" and "results/N450/sequences.fasta". As it's an implementation detail, this change doesn't have to be made in this PR.
(This comment applies throughout the snakemake files added in this PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in a9d2644
tree = "results/tree.nwk", | ||
node_data = "results/branch_lengths.json" | ||
tree = "results/tree_{gene}.nwk", | ||
node_data = "results/branch_lengths_{gene}.json" | ||
params: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following deprecation notice is present when running refine:
DEPRECATION WARNING. TreeTime.resolve_polytomies: You are resolving
polytomies using the old 'greedy' mode. This is not well suited for large
polytomies. Stochastic resolution will become the default in future
versions. To switch now, rerun with the flag `--stochastic-resolve`. To
keep using the greedy method in the future, run with `--greedy-resolve`
Especially for the N450 build where we will have large polytomies we should use --stochastic-resolve
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in f39ba8b
aa_muts = "results/aa_muts.json", | ||
branch_lengths = "results/branch_lengths_{gene}.json", | ||
nt_muts = "results/nt_muts_{gene}.json", | ||
aa_muts = "results/aa_muts_{gene}.json", | ||
colors = config["files"]["colors"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest removing all the "country" colours from the defaults/colors.tsv
in this PR simply because a number of countries are missing colours. Auspice will pick a better set of colours than the current situation of some colours + some greys. We can then add nicer colours in a subsequent PR, as desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 119fbcf
tree = "results/tree_raw.nwk", | ||
alignment = "results/aligned.fasta", | ||
tree = "results/tree_raw_{gene}.nwk", | ||
alignment = "results/aligned_{gene}.fasta", | ||
metadata = "data/metadata.tsv" | ||
output: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The temporal signal is much better in this build (and the build Trevor showed me) than the one we looked at mid-week - do you know what changed? This gives me more confidence that the (temporal) rooting is working ok for the N450 build - it's broadly similar to the genome build - and so there's no longer a need to start in "unrooted" view if you'd prefer to go back to the more typical rectangular view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lower temporal signal for the tree I showed you last week seems to be related to the subsampling method I had used. Here is the subsampling method I used for that tree:
filter_N450:
group_by: "country year month"
sequences_per_group: 2
min_date: 1950
min_length: 400
And here is the clock view for a tree I generated today using that subsampling method:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed default display back to rooted time-tree in 862dbaf
"type": "continuous" | ||
}, | ||
{ | ||
"key": "author", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Not blocking this PR]
We used to add "author" as a color-by as a workaround so that it appeared as a filtering option in Auspice, however a couple of recent changes have rendered this pattern unnecessary. We can now use "metadata_columns" in the auspice-config JSON (or --metadata-columns
) to export "author" and Auspice will automatically add it as a filtering option. This is nicer than exposing author as a coloring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 8bea320
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest changes look good to me!
I think this is good to merge and deploy the latest genome/N450 builds to nextstrain.org
Description of proposed changes
The goal of this PR is to create a tree using a 450bp region of the N gene ("N450") that is highly represented on NCBI for measles. Addresses github issue #13
General steps include:
Many of these changes follow those made for E gene trees in the dengue repo
Related issue(s)
#13
Checklist