Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tree for 450bp of the N gene ("N450") #20

Merged
merged 11 commits into from
Apr 1, 2024
1 change: 1 addition & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# CHANGELOG
* 1 April 2024: Create a tree using the 450 nucleotides encoding the carboxyl-terminal 150 amino acids of the nucleoprotein (N450), which is highly represented on NCBI for measles. [PR #20](https://github.com/nextstrain/measles/pull/20)
* 15 March 2024: Connect ingest and phylogenetic workflows to follow the pathogen-repo-guide by uploading ingest output to S3, downloading ingest output from S3 to phylogenetic directory, using "accession" column as the ID column, and using a color scheme that matches the new region name format. [PR #19](https://github.com/nextstrain/measles/pull/19)
* 1 March 2024: Add phylogenetic directory to follow the pathogen-repo-guide, and update the CI workflow to match the new file structure. [PR #18](https://github.com/nextstrain/measles/pull/18)
* 14 February 2024: Add ingest directory from pathogen-repo-guide and make measles-specific modifications. [PR #10](https://github.com/nextstrain/measles/pull/10)
Expand Down
5 changes: 4 additions & 1 deletion phylogenetic/Snakefile
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
genes = ['N450', 'genome']

configfile: "defaults/config.yaml"

rule all:
input:
auspice_json = "auspice/measles.json",
auspice_json = expand("auspice/measles_{gene}.json", gene=genes)

include: "rules/prepare_sequences.smk"
include: "rules/prepare_sequences_N450.smk"
include: "rules/construct_phylogeny.smk"
include: "rules/annotate_phylogeny.smk"
include: "rules/export.smk"
Expand Down
8 changes: 3 additions & 5 deletions phylogenetic/defaults/auspice_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,6 @@
"title": "Date",
"type": "continuous"
},
{
"key": "author",
"title": "Author",
"type": "categorical"
},
{
"key": "country",
"title": "Country",
Expand All @@ -43,5 +38,8 @@
"country",
"region",
"author"
],
"metadata_columns": [
"author"
]
}
45 changes: 45 additions & 0 deletions phylogenetic/defaults/auspice_config_N450.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
{
"title": "Real-time tracking of measles virus evolution",
"maintainers": [
{"name": "Kim Andrews", "url": "https://bedford.io/team/kim-andrews/"},
{"name": "the Nextstrain team", "url": "https://nextstrain.org/team"}
],
"build_url": "https://github.com/nextstrain/measles",
"colorings": [
{
"key": "gt",
"title": "Genotype",
"type": "categorical"
},
{
"key": "num_date",
"title": "Date",
"type": "continuous"
},
{
"key": "country",
"title": "Country",
"type": "categorical"
},
{
"key": "region",
"title": "Region",
"type": "categorical"
}
],
"geo_resolutions": [
"country",
"region"
],
"display_defaults": {
"map_triplicate": true
},
"filters": [
"country",
"region",
"author"
],
"metadata_columns": [
"author"
]
}
20 changes: 0 additions & 20 deletions phylogenetic/defaults/colors.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,3 @@ region Africa #8ABB6A
region Europe #BEBB48
region South America #E29E39
region North America #E2562B

country india #511EA8
country china #4333BE
country vietnam #3F4ECB
country south korea #4169CF
country japan #4682C9
country australia #4F96BB
country new zealand #5AA5A8
country russia #68AF92
country gambia #78B77D
country sudan #8BBB6A
country morocco #9EBE59
country italy #B3BD4D
country germany #C5B945
country france #D5B03F
country netherlands #E0A23A
country united kingdom #E68D36
country brazil #E67231
country usa #E1502A
country canada #DC2F24
8 changes: 8 additions & 0 deletions phylogenetic/defaults/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,21 @@ strain_id_field: "accession"
files:
exclude: "defaults/dropped_strains.txt"
reference: "defaults/measles_reference.gb"
reference_N450: "defaults/measles_reference_N450.gb"
reference_N450_fasta: "defaults/measles_reference_N450.fasta"
colors: "defaults/colors.tsv"
auspice_config: "defaults/auspice_config.json"
auspice_config_N450: "defaults/auspice_config_N450.json"
filter:
group_by: "country year month"
sequences_per_group: 20
min_date: 1950
min_length: 5000
filter_N450:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor, not blocking]

@joverlee521 do you have a canonical way we should structure build-specific rule parameters (e.g. filter vs filter_N450) when we want to use a single config file for both/all builds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not (yet?), the pathogen-repo-guide has been based on one build per config file.

Existing workflows use three different patterns

  1. seasonal flu has top level per build configs.
This would look like:
builds:
    genome:
        filter:
            group_by: ...
            sequences_per_group: ...
            [...]
    N450:
        filter:
            group_by: ...
            sequences_per_group: ...
            [...]
  1. ncov has build configs nested within rule groupings.
This would look like:
filter:
    genome:
        group_by: ...
        sequences_per_group: ...
        [...]
    N450:
        group_by: ...
        sequences_per_group: ...
        [...]
  1. rsv nests build names within specific config parameters.
This would look like:
filter:
    group_by:
        genome: ...
        N450: ...
    sequences_per_group:
        genome: ...
        N450: ...
    [...]

[1] is the most flexible, allowing each build to define its own parameters. This makes it very easy to scan the config file for one build's parameters in a single place. However, since each param has to be defined per build, this can result in very long config files, which is why seasonal-flu has complex array-builds configs to programmatically create the configs during the workflow.

[2] is also pretty flexible, where each build can configure each parameter per rule grouping. There's less repetition of parameters so config files won't be as long, but a single build's config is spread throughout the rules groupings. It is also not very clear which rule configs can be configured per build and which rule configs are shared among builds.

[3] is the least flexible, as it only allows each build to change specific configs. This can also be confusing why some configs are nested while others are not.

group_by: "country year"
subsample_max_sequences: 3000
min_date: 1950
min_length: 400
refine:
coalescent: "opt"
date_inference: "marginal"
Expand Down
8 changes: 8 additions & 0 deletions phylogenetic/defaults/measles_reference_N450.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
>lcl|NC_001498.1_cds_NP_056918.1_1 [gene=N] [locus_tag=MeVgp1] [db_xref=GeneID:1489804] [protein=nucleocapsid protein] [protein_id=NP_056918.1] [location=1233..1682] [gbkey=CDS]
GTCAGTTCCACATTGGCATCCGAACTCGGTATCACTGCCGAGGATGCAAGGCTTGTTTCAGAGAT
TGCAATGCATACTACTGAGGACAGGATCAGTAGAGCGGTCGGACCCAGACAAGCCCAAGTGTCATTTCTA
CACGGTGATCAAAGTGAGAATGAGCTACCAGGATTGGGGGGCAAGGAAGATAGGAGGGTCAAACAGGGTC
GGGGAGAAGCCAGGGAGAGCTACAGAGAAACCGGGTCCAGCAGAGCAAGTGATGCGAGAGCTGCCCATCC
TCCAACCAGCATGCCCCTAGACATTGACACTGCATCGGAGTCAGGCCAAGATCCGCAGGACAGTCGAAGG
TCAGCTGACGCCCTGCTCAGGCTGCAAGCCATGGCAGGAATCTTGGAAGAACAAGGCTCAGACACGGACA
CCCCTAGGGTATACAATGACAGAGATCTTCTAGAC
69 changes: 69 additions & 0 deletions phylogenetic/defaults/measles_reference_N450.gb
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
LOCUS NC_001498 450 bp cRNA linear VRL 13-AUG-2018
DEFINITION Measles virus, complete genome.
ACCESSION NC_001498 REGION: 1233..1682
VERSION NC_001498.1
DBLINK Project: 15025
BioProject: PRJNA485481
KEYWORDS RefSeq.
SOURCE Measles morbillivirus
ORGANISM Measles morbillivirus
Viruses; Riboviria; Orthornavirae; Negarnaviricota;
Haploviricotina; Monjiviricetes; Mononegavirales; Paramyxoviridae;
Orthoparamyxovirinae; Morbillivirus; Morbillivirus hominis.
REFERENCE 1 (sites)
AUTHORS Rima,B.K. and Duprex,W.P.
TITLE The measles virus replication cycle
JOURNAL Curr. Top. Microbiol. Immunol. 329, 77-102 (2009)
PUBMED 19198563
REFERENCE 2
AUTHORS Takeuchi,K., Miyajima,N., Kobune,F. and Tashiro,M.
TITLE Comparative nucleotide sequence analyses of the entire genomes of
B95a cell-isolated and vero cell-isolated measles viruses from the
same patient
JOURNAL Virus Genes 20 (3), 253-257 (2000)
PUBMED 10949953
REFERENCE 3 (bases 1 to 450)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (01-AUG-2000) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 4 (bases 1 to 450)
AUTHORS Takeuchi,K., Tanabayashi,K. and Tashiro,M.
TITLE Direct Submission
JOURNAL Submitted (10-JUL-1998) Kaoru Takeuchi, National Institute of
Infectious Diseases, Viral Disease and Vaccine Contorol; 4-7-1
Gakuen, Musashi-murayama, Tokyo 208-0011, Japan
(E-mail:[email protected], Tel:81-42-561-0771(ex.530),
Fax:81-42-567-5631)
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from AB016162.
Sequence updated (21-Jul-1998)
Sequence updated (11-Dec-1998).
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..450
/organism="Measles morbillivirus"
/mol_type="viral cRNA"
/strain="Ichinose-B95a"
/db_xref="taxon:11234"
CDS <1..>450
jameshadfield marked this conversation as resolved.
Show resolved Hide resolved
/gene="N"
/codon_start=1
/product="nucleocapsid protein"
/protein_id="NP_056918.1"
/db_xref="GeneID:1489804"
/translation="VSSTLASELGITAEDAR
LVSEIAMHTTEDRISRAVGPRQAQVSFLHGDQSENELPGLGGKEDRRVKQGRGEARES
YRETGSSRASDARAAHPPTSMPLDIDTASESGQDPQDSRRSADALLRLQAMAGILEEQ
GSDTDTPRVYNDRDLLD"
ORIGIN
1 gtcagttcca cattggcatc cgaactcggt atcactgccg aggatgcaag gcttgtttca
61 gagattgcaa tgcatactac tgaggacagg atcagtagag cggtcggacc cagacaagcc
121 caagtgtcat ttctacacgg tgatcaaagt gagaatgagc taccaggatt ggggggcaag
181 gaagatagga gggtcaaaca gggtcgggga gaagccaggg agagctacag agaaaccggg
241 tccagcagag caagtgatgc gagagctgcc catcctccaa ccagcatgcc cctagacatt
301 gacactgcat cggagtcagg ccaagatccg caggacagtc gaaggtcagc tgacgccctg
361 ctcaggctgc aagccatggc aggaatcttg gaagaacaag gctcagacac ggacacccct
421 agggtataca atgacagaga tcttctagac
//

14 changes: 7 additions & 7 deletions phylogenetic/rules/annotate_phylogeny.smk
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ See Augur's usage docs for these commands for more details.
rule ancestral:
"""Reconstructing ancestral sequences and mutations"""
input:
tree = "results/tree.nwk",
alignment = "results/aligned.fasta"
tree = "results/{gene}/tree.nwk",
alignment = "results/{gene}/aligned.fasta"
output:
node_data = "results/nt_muts.json"
node_data = "results/{gene}/nt_muts.json"
params:
inference = config["ancestral"]["inference"]
shell:
Expand All @@ -26,11 +26,11 @@ rule ancestral:
rule translate:
"""Translating amino acid sequences"""
input:
tree = "results/tree.nwk",
node_data = "results/nt_muts.json",
reference = config["files"]["reference"]
tree = "results/{gene}/tree.nwk",
node_data = "results/{gene}/nt_muts.json",
reference = lambda wildcard: "defaults/measles_reference.gb" if wildcard.gene in ["genome"] else "defaults/measles_reference_{gene}.gb"
output:
node_data = "results/aa_muts.json"
node_data = "results/{gene}/aa_muts.json"
shell:
"""
augur translate \
Expand Down
15 changes: 8 additions & 7 deletions phylogenetic/rules/construct_phylogeny.smk
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ See Augur's usage docs for these commands for more details.
rule tree:
"""Building tree"""
input:
alignment = "results/aligned.fasta"
alignment = "results/{gene}/aligned.fasta"
output:
tree = "results/tree_raw.nwk"
tree = "results/{gene}/tree_raw.nwk"
shell:
"""
augur tree \
Expand All @@ -26,12 +26,12 @@ rule refine:
- filter tips more than {params.clock_filter_iqd} IQDs from clock expectation
"""
input:
tree = "results/tree_raw.nwk",
alignment = "results/aligned.fasta",
tree = "results/{gene}/tree_raw.nwk",
alignment = "results/{gene}/aligned.fasta",
metadata = "data/metadata.tsv"
output:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The temporal signal is much better in this build (and the build Trevor showed me) than the one we looked at mid-week - do you know what changed? This gives me more confidence that the (temporal) rooting is working ok for the N450 build - it's broadly similar to the genome build - and so there's no longer a need to start in "unrooted" view if you'd prefer to go back to the more typical rectangular view.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lower temporal signal for the tree I showed you last week seems to be related to the subsampling method I had used. Here is the subsampling method I used for that tree:

filter_N450:
group_by: "country year month"
sequences_per_group: 2
min_date: 1950
min_length: 400

And here is the clock view for a tree I generated today using that subsampling method:

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed default display back to rooted time-tree in 862dbaf

tree = "results/tree.nwk",
node_data = "results/branch_lengths.json"
tree = "results/{gene}/tree.nwk",
node_data = "results/{gene}/branch_lengths.json"
params:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following deprecation notice is present when running refine:

        DEPRECATION WARNING. TreeTime.resolve_polytomies: You are resolving
        polytomies using the old 'greedy' mode. This is not well suited for large
        polytomies. Stochastic resolution will become the default in future
        versions. To switch now, rerun with the flag `--stochastic-resolve`. To
        keep using the greedy method in the future, run with `--greedy-resolve`

Especially for the N450 build where we will have large polytomies we should use --stochastic-resolve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f39ba8b

coalescent = config["refine"]["coalescent"],
date_inference = config["refine"]["date_inference"],
Expand All @@ -50,6 +50,7 @@ rule refine:
--coalescent {params.coalescent} \
--date-confidence \
--date-inference {params.date_inference} \
--clock-filter-iqd {params.clock_filter_iqd}
--clock-filter-iqd {params.clock_filter_iqd} \
--stochastic-resolve
"""

14 changes: 8 additions & 6 deletions phylogenetic/rules/export.smk
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,17 @@ See Augur's usage docs for these commands for more details.
rule export:
"""Exporting data files for for auspice"""
input:
tree = "results/tree.nwk",
tree = "results/{gene}/tree.nwk",
metadata = "data/metadata.tsv",
branch_lengths = "results/branch_lengths.json",
nt_muts = "results/nt_muts.json",
aa_muts = "results/aa_muts.json",
branch_lengths = "results/{gene}/branch_lengths.json",
nt_muts = "results/{gene}/nt_muts.json",
aa_muts = "results/{gene}/aa_muts.json",
colors = config["files"]["colors"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest removing all the "country" colours from the defaults/colors.tsv in this PR simply because a number of countries are missing colours. Auspice will pick a better set of colours than the current situation of some colours + some greys. We can then add nicer colours in a subsequent PR, as desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 119fbcf

auspice_config = config["files"]["auspice_config"]
auspice_config = lambda wildcard: "defaults/auspice_config.json" if wildcard.gene in ["genome"] else "defaults/auspice_config_N450.json"

output:
auspice_json = rules.all.input.auspice_json
auspice_json = "auspice/measles_{gene}.json",
root_sequence = "auspice/measles_{gene}_root-sequence.json"
params:
strain_id = config["strain_id_field"]
shell:
Expand Down
6 changes: 3 additions & 3 deletions phylogenetic/rules/prepare_sequences.smk
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ rule filter:
metadata = "data/metadata.tsv",
exclude = config["files"]["exclude"]
output:
sequences = "results/filtered.fasta"
sequences = "results/genome/filtered.fasta"
params:
group_by = config["filter"]["group_by"],
sequences_per_group = config["filter"]["sequences_per_group"],
Expand All @@ -71,10 +71,10 @@ rule align:
- filling gaps with N
"""
input:
sequences = "results/filtered.fasta",
sequences = "results/genome/filtered.fasta",
reference = config["files"]["reference"]
output:
alignment = "results/aligned.fasta"
alignment = "results/genome/aligned.fasta"
shell:
"""
augur align \
Expand Down
58 changes: 58 additions & 0 deletions phylogenetic/rules/prepare_sequences_N450.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
"""
This part of the workflow prepares sequences for constructing the phylogenetic tree for 450bp of the N gene.

See Augur's usage docs for these commands for more details.
"""

rule align_and_extract_N450:
input:
sequences = "data/sequences.fasta",
reference = config["files"]["reference_N450_fasta"]
output:
sequences = "results/N450/sequences.fasta"
params:
min_length = config['filter_N450']['min_length']
shell:
"""
nextclade run \
-j 1 \
--input-ref {input.reference} \
--output-fasta {output.sequences} \
--min-seed-cover 0.01 \
--min-length {params.min_length} \
--silent \
{input.sequences}
"""
rule filter_N450:
"""
Filtering to
- {params.sequences_per_group} sequence(s) per {params.group_by!s}
- excluding strains in {input.exclude}
- minimum genome length of {params.min_length}
- excluding strains with missing region, country or date metadata
"""
input:
sequences = "results/N450/sequences.fasta",
metadata = "data/metadata.tsv",
exclude = config["files"]["exclude"]
output:
sequences = "results/N450/aligned.fasta"
params:
group_by = config['filter_N450']['group_by'],
subsample_max_sequences = config["filter_N450"]["subsample_max_sequences"],
min_date = config["filter_N450"]["min_date"],
min_length = config['filter_N450']['min_length'],
strain_id = config["strain_id_field"]
shell:
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--metadata-id-columns {params.strain_id} \
--exclude {input.exclude} \
--output {output.sequences} \
--group-by {params.group_by} \
--subsample-max-sequences {params.subsample_max_sequences} \
--min-date {params.min_date} \
--min-length {params.min_length}
"""