-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make tree for 450bp of the N gene ("N450") #20
Changes from 5 commits
55ea0ce
edbefd5
8343876
d55342b
ab92a1b
f39ba8b
119fbcf
862dbaf
8bea320
a9d2644
bf83a42
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
{ | ||
"title": "Real-time tracking of measles virus evolution", | ||
"maintainers": [ | ||
{"name": "Kim Andrews", "url": "https://bedford.io/team/kim-andrews/"}, | ||
{"name": "the Nextstrain team", "url": "https://nextstrain.org/team"} | ||
], | ||
"build_url": "https://github.com/nextstrain/measles", | ||
"colorings": [ | ||
{ | ||
"key": "gt", | ||
"title": "Genotype", | ||
"type": "categorical" | ||
}, | ||
{ | ||
"key": "num_date", | ||
"title": "Date", | ||
"type": "continuous" | ||
}, | ||
{ | ||
"key": "author", | ||
"title": "Author", | ||
"type": "categorical" | ||
}, | ||
{ | ||
"key": "country", | ||
"title": "Country", | ||
"type": "categorical" | ||
}, | ||
{ | ||
"key": "region", | ||
"title": "Region", | ||
"type": "categorical" | ||
} | ||
], | ||
"geo_resolutions": [ | ||
"country", | ||
"region" | ||
], | ||
"display_defaults": { | ||
"map_triplicate": true, | ||
"distance_measure": "div", | ||
"layout": "unrooted" | ||
}, | ||
"filters": [ | ||
"country", | ||
"region", | ||
"author" | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,13 +2,21 @@ strain_id_field: "accession" | |
files: | ||
exclude: "defaults/dropped_strains.txt" | ||
reference: "defaults/measles_reference.gb" | ||
reference_N450: "defaults/measles_reference_N450.gb" | ||
reference_N450_fasta: "defaults/measles_reference_N450.fasta" | ||
colors: "defaults/colors.tsv" | ||
auspice_config: "defaults/auspice_config.json" | ||
auspice_config_N450: "defaults/auspice_config_N450.json" | ||
filter: | ||
group_by: "country year month" | ||
sequences_per_group: 20 | ||
min_date: 1950 | ||
min_length: 5000 | ||
filter_N450: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [minor, not blocking] @joverlee521 do you have a canonical way we should structure build-specific rule parameters (e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not (yet?), the pathogen-repo-guide has been based on one build per config file. Existing workflows use three different patterns
This would look like:builds:
genome:
filter:
group_by: ...
sequences_per_group: ...
[...]
N450:
filter:
group_by: ...
sequences_per_group: ...
[...] This would look like:filter:
genome:
group_by: ...
sequences_per_group: ...
[...]
N450:
group_by: ...
sequences_per_group: ...
[...] This would look like:filter:
group_by:
genome: ...
N450: ...
sequences_per_group:
genome: ...
N450: ...
[...] [1] is the most flexible, allowing each build to define its own parameters. This makes it very easy to scan the config file for one build's parameters in a single place. However, since each param has to be defined per build, this can result in very long config files, which is why seasonal-flu has complex array-builds configs to programmatically create the configs during the workflow. [2] is also pretty flexible, where each build can configure each parameter per rule grouping. There's less repetition of parameters so config files won't be as long, but a single build's config is spread throughout the rules groupings. It is also not very clear which rule configs can be configured per build and which rule configs are shared among builds. [3] is the least flexible, as it only allows each build to change specific configs. This can also be confusing why some configs are nested while others are not. |
||
group_by: "country year" | ||
subsample_max_sequences: 3000 | ||
min_date: 1950 | ||
min_length: 400 | ||
refine: | ||
coalescent: "opt" | ||
date_inference: "marginal" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
>lcl|NC_001498.1_cds_NP_056918.1_1 [gene=N] [locus_tag=MeVgp1] [db_xref=GeneID:1489804] [protein=nucleocapsid protein] [protein_id=NP_056918.1] [location=1233..1682] [gbkey=CDS] | ||
GTCAGTTCCACATTGGCATCCGAACTCGGTATCACTGCCGAGGATGCAAGGCTTGTTTCAGAGAT | ||
TGCAATGCATACTACTGAGGACAGGATCAGTAGAGCGGTCGGACCCAGACAAGCCCAAGTGTCATTTCTA | ||
CACGGTGATCAAAGTGAGAATGAGCTACCAGGATTGGGGGGCAAGGAAGATAGGAGGGTCAAACAGGGTC | ||
GGGGAGAAGCCAGGGAGAGCTACAGAGAAACCGGGTCCAGCAGAGCAAGTGATGCGAGAGCTGCCCATCC | ||
TCCAACCAGCATGCCCCTAGACATTGACACTGCATCGGAGTCAGGCCAAGATCCGCAGGACAGTCGAAGG | ||
TCAGCTGACGCCCTGCTCAGGCTGCAAGCCATGGCAGGAATCTTGGAAGAACAAGGCTCAGACACGGACA | ||
CCCCTAGGGTATACAATGACAGAGATCTTCTAGAC |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
LOCUS NC_001498 450 bp cRNA linear VRL 13-AUG-2018 | ||
DEFINITION Measles virus, complete genome. | ||
ACCESSION NC_001498 REGION: 1233..1682 | ||
VERSION NC_001498.1 | ||
DBLINK Project: 15025 | ||
BioProject: PRJNA485481 | ||
KEYWORDS RefSeq. | ||
SOURCE Measles morbillivirus | ||
ORGANISM Measles morbillivirus | ||
Viruses; Riboviria; Orthornavirae; Negarnaviricota; | ||
Haploviricotina; Monjiviricetes; Mononegavirales; Paramyxoviridae; | ||
Orthoparamyxovirinae; Morbillivirus; Morbillivirus hominis. | ||
REFERENCE 1 (sites) | ||
AUTHORS Rima,B.K. and Duprex,W.P. | ||
TITLE The measles virus replication cycle | ||
JOURNAL Curr. Top. Microbiol. Immunol. 329, 77-102 (2009) | ||
PUBMED 19198563 | ||
REFERENCE 2 | ||
AUTHORS Takeuchi,K., Miyajima,N., Kobune,F. and Tashiro,M. | ||
TITLE Comparative nucleotide sequence analyses of the entire genomes of | ||
B95a cell-isolated and vero cell-isolated measles viruses from the | ||
same patient | ||
JOURNAL Virus Genes 20 (3), 253-257 (2000) | ||
PUBMED 10949953 | ||
REFERENCE 3 (bases 1 to 450) | ||
CONSRTM NCBI Genome Project | ||
TITLE Direct Submission | ||
JOURNAL Submitted (01-AUG-2000) National Center for Biotechnology | ||
Information, NIH, Bethesda, MD 20894, USA | ||
REFERENCE 4 (bases 1 to 450) | ||
AUTHORS Takeuchi,K., Tanabayashi,K. and Tashiro,M. | ||
TITLE Direct Submission | ||
JOURNAL Submitted (10-JUL-1998) Kaoru Takeuchi, National Institute of | ||
Infectious Diseases, Viral Disease and Vaccine Contorol; 4-7-1 | ||
Gakuen, Musashi-murayama, Tokyo 208-0011, Japan | ||
(E-mail:[email protected], Tel:81-42-561-0771(ex.530), | ||
Fax:81-42-567-5631) | ||
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The | ||
reference sequence was derived from AB016162. | ||
Sequence updated (21-Jul-1998) | ||
Sequence updated (11-Dec-1998). | ||
COMPLETENESS: full length. | ||
FEATURES Location/Qualifiers | ||
source 1..450 | ||
/organism="Measles morbillivirus" | ||
/mol_type="viral cRNA" | ||
/strain="Ichinose-B95a" | ||
/db_xref="taxon:11234" | ||
CDS <1..>450 | ||
jameshadfield marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/gene="N" | ||
/codon_start=1 | ||
/product="nucleocapsid protein" | ||
/protein_id="NP_056918.1" | ||
/db_xref="GeneID:1489804" | ||
/translation="VSSTLASELGITAEDAR | ||
LVSEIAMHTTEDRISRAVGPRQAQVSFLHGDQSENELPGLGGKEDRRVKQGRGEARES | ||
YRETGSSRASDARAAHPPTSMPLDIDTASESGQDPQDSRRSADALLRLQAMAGILEEQ | ||
GSDTDTPRVYNDRDLLD" | ||
ORIGIN | ||
1 gtcagttcca cattggcatc cgaactcggt atcactgccg aggatgcaag gcttgtttca | ||
61 gagattgcaa tgcatactac tgaggacagg atcagtagag cggtcggacc cagacaagcc | ||
121 caagtgtcat ttctacacgg tgatcaaagt gagaatgagc taccaggatt ggggggcaag | ||
181 gaagatagga gggtcaaaca gggtcgggga gaagccaggg agagctacag agaaaccggg | ||
241 tccagcagag caagtgatgc gagagctgcc catcctccaa ccagcatgcc cctagacatt | ||
301 gacactgcat cggagtcagg ccaagatccg caggacagtc gaaggtcagc tgacgccctg | ||
361 ctcaggctgc aagccatggc aggaatcttg gaagaacaag gctcagacac ggacacccct | ||
421 agggtataca atgacagaga tcttctagac | ||
// | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,9 +7,9 @@ See Augur's usage docs for these commands for more details. | |
rule tree: | ||
"""Building tree""" | ||
input: | ||
alignment = "results/aligned.fasta" | ||
alignment = "results/aligned_{gene}.fasta" | ||
output: | ||
tree = "results/tree_raw.nwk" | ||
tree = "results/tree_raw_{gene}.nwk" | ||
shell: | ||
""" | ||
augur tree \ | ||
|
@@ -26,12 +26,12 @@ rule refine: | |
- filter tips more than {params.clock_filter_iqd} IQDs from clock expectation | ||
""" | ||
input: | ||
tree = "results/tree_raw.nwk", | ||
alignment = "results/aligned.fasta", | ||
tree = "results/tree_raw_{gene}.nwk", | ||
alignment = "results/aligned_{gene}.fasta", | ||
metadata = "data/metadata.tsv" | ||
output: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The temporal signal is much better in this build (and the build Trevor showed me) than the one we looked at mid-week - do you know what changed? This gives me more confidence that the (temporal) rooting is working ok for the N450 build - it's broadly similar to the genome build - and so there's no longer a need to start in "unrooted" view if you'd prefer to go back to the more typical rectangular view. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The lower temporal signal for the tree I showed you last week seems to be related to the subsampling method I had used. Here is the subsampling method I used for that tree:
And here is the clock view for a tree I generated today using that subsampling method: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed default display back to rooted time-tree in 862dbaf |
||
tree = "results/tree.nwk", | ||
node_data = "results/branch_lengths.json" | ||
tree = "results/tree_{gene}.nwk", | ||
node_data = "results/branch_lengths_{gene}.json" | ||
params: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The following deprecation notice is present when running refine: DEPRECATION WARNING. TreeTime.resolve_polytomies: You are resolving
polytomies using the old 'greedy' mode. This is not well suited for large
polytomies. Stochastic resolution will become the default in future
versions. To switch now, rerun with the flag `--stochastic-resolve`. To
keep using the greedy method in the future, run with `--greedy-resolve` Especially for the N450 build where we will have large polytomies we should use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done in f39ba8b |
||
coalescent = config["refine"]["coalescent"], | ||
date_inference = config["refine"]["date_inference"], | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,15 +8,17 @@ See Augur's usage docs for these commands for more details. | |
rule export: | ||
"""Exporting data files for for auspice""" | ||
input: | ||
tree = "results/tree.nwk", | ||
tree = "results/tree_{gene}.nwk", | ||
metadata = "data/metadata.tsv", | ||
branch_lengths = "results/branch_lengths.json", | ||
nt_muts = "results/nt_muts.json", | ||
aa_muts = "results/aa_muts.json", | ||
branch_lengths = "results/branch_lengths_{gene}.json", | ||
nt_muts = "results/nt_muts_{gene}.json", | ||
aa_muts = "results/aa_muts_{gene}.json", | ||
colors = config["files"]["colors"], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd suggest removing all the "country" colours from the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done in 119fbcf |
||
auspice_config = config["files"]["auspice_config"] | ||
auspice_config = lambda wildcard: "defaults/auspice_config.json" if wildcard.gene in ["genome"] else "defaults/auspice_config_N450.json" | ||
|
||
output: | ||
auspice_json = rules.all.input.auspice_json | ||
auspice_json = "auspice/measles_{gene}.json", | ||
root_sequence = "auspice/measles_{gene}_root-sequence.json" | ||
params: | ||
strain_id = config["strain_id_field"] | ||
shell: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
""" | ||
This part of the workflow prepares sequences for constructing the phylogenetic tree for 450bp of the N gene. | ||
|
||
See Augur's usage docs for these commands for more details. | ||
""" | ||
|
||
rule align_and_extract_N450: | ||
input: | ||
sequences = "data/sequences.fasta", | ||
reference = config["files"]["reference_N450_fasta"] | ||
output: | ||
sequences = "results/sequences_N450.fasta" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's preferable to organise builds as directories within results, e.g. "results/genome/sequences.fasta" and "results/N450/sequences.fasta". As it's an implementation detail, this change doesn't have to be made in this PR. (This comment applies throughout the snakemake files added in this PR.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done in a9d2644 |
||
params: | ||
min_length = config['filter_N450']['min_length'] | ||
shell: | ||
""" | ||
nextclade run \ | ||
-j 1 \ | ||
--input-ref {input.reference} \ | ||
--output-fasta {output.sequences} \ | ||
--min-seed-cover 0.01 \ | ||
--min-length {params.min_length} \ | ||
--silent \ | ||
{input.sequences} | ||
""" | ||
rule filter_N450: | ||
""" | ||
Filtering to | ||
- {params.sequences_per_group} sequence(s) per {params.group_by!s} | ||
- excluding strains in {input.exclude} | ||
- minimum genome length of {params.min_length} | ||
- excluding strains with missing region, country or date metadata | ||
""" | ||
input: | ||
sequences = "results/sequences_N450.fasta", | ||
metadata = "data/metadata.tsv", | ||
exclude = config["files"]["exclude"] | ||
output: | ||
sequences = "results/aligned_N450.fasta" | ||
params: | ||
group_by = config['filter_N450']['group_by'], | ||
subsample_max_sequences = config["filter_N450"]["subsample_max_sequences"], | ||
min_date = config["filter_N450"]["min_date"], | ||
min_length = config['filter_N450']['min_length'], | ||
strain_id = config["strain_id_field"] | ||
shell: | ||
""" | ||
augur filter \ | ||
--sequences {input.sequences} \ | ||
--metadata {input.metadata} \ | ||
--metadata-id-columns {params.strain_id} \ | ||
--exclude {input.exclude} \ | ||
--output {output.sequences} \ | ||
--group-by {params.group_by} \ | ||
--subsample-max-sequences {params.subsample_max_sequences} \ | ||
--min-date {params.min_date} \ | ||
--min-length {params.min_length} | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Not blocking this PR]
We used to add "author" as a color-by as a workaround so that it appeared as a filtering option in Auspice, however a couple of recent changes have rendered this pattern unnecessary. We can now use "metadata_columns" in the auspice-config JSON (or
--metadata-columns
) to export "author" and Auspice will automatically add it as a filtering option. This is nicer than exposing author as a coloring.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 8bea320