Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make tree for 450bp of the N gene ("N450") #20

Merged
merged 11 commits into from
Apr 1, 2024
5 changes: 4 additions & 1 deletion phylogenetic/Snakefile
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
genes = ['N450', 'genome']

configfile: "defaults/config.yaml"

rule all:
input:
auspice_json = "auspice/measles.json",
auspice_json = expand("auspice/measles_{gene}.json", gene=genes)

include: "rules/prepare_sequences.smk"
include: "rules/prepare_sequences_N450.smk"
include: "rules/construct_phylogeny.smk"
include: "rules/annotate_phylogeny.smk"
include: "rules/export.smk"
Expand Down
49 changes: 49 additions & 0 deletions phylogenetic/defaults/auspice_config_N450.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
{
"title": "Real-time tracking of measles virus evolution",
"maintainers": [
{"name": "Kim Andrews", "url": "https://bedford.io/team/kim-andrews/"},
{"name": "the Nextstrain team", "url": "https://nextstrain.org/team"}
],
"build_url": "https://github.com/nextstrain/measles",
"colorings": [
{
"key": "gt",
"title": "Genotype",
"type": "categorical"
},
{
"key": "num_date",
"title": "Date",
"type": "continuous"
},
{
"key": "author",
Copy link
Member

@jameshadfield jameshadfield Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Not blocking this PR]

We used to add "author" as a color-by as a workaround so that it appeared as a filtering option in Auspice, however a couple of recent changes have rendered this pattern unnecessary. We can now use "metadata_columns" in the auspice-config JSON (or --metadata-columns) to export "author" and Auspice will automatically add it as a filtering option. This is nicer than exposing author as a coloring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8bea320

"title": "Author",
"type": "categorical"
},
{
"key": "country",
"title": "Country",
"type": "categorical"
},
{
"key": "region",
"title": "Region",
"type": "categorical"
}
],
"geo_resolutions": [
"country",
"region"
],
"display_defaults": {
"map_triplicate": true,
"distance_measure": "div",
"layout": "unrooted"
},
"filters": [
"country",
"region",
"author"
]
}
8 changes: 8 additions & 0 deletions phylogenetic/defaults/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,21 @@ strain_id_field: "accession"
files:
exclude: "defaults/dropped_strains.txt"
reference: "defaults/measles_reference.gb"
reference_N450: "defaults/measles_reference_N450.gb"
reference_N450_fasta: "defaults/measles_reference_N450.fasta"
colors: "defaults/colors.tsv"
auspice_config: "defaults/auspice_config.json"
auspice_config_N450: "defaults/auspice_config_N450.json"
filter:
group_by: "country year month"
sequences_per_group: 20
min_date: 1950
min_length: 5000
filter_N450:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor, not blocking]

@joverlee521 do you have a canonical way we should structure build-specific rule parameters (e.g. filter vs filter_N450) when we want to use a single config file for both/all builds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not (yet?), the pathogen-repo-guide has been based on one build per config file.

Existing workflows use three different patterns

  1. seasonal flu has top level per build configs.
This would look like:
builds:
    genome:
        filter:
            group_by: ...
            sequences_per_group: ...
            [...]
    N450:
        filter:
            group_by: ...
            sequences_per_group: ...
            [...]
  1. ncov has build configs nested within rule groupings.
This would look like:
filter:
    genome:
        group_by: ...
        sequences_per_group: ...
        [...]
    N450:
        group_by: ...
        sequences_per_group: ...
        [...]
  1. rsv nests build names within specific config parameters.
This would look like:
filter:
    group_by:
        genome: ...
        N450: ...
    sequences_per_group:
        genome: ...
        N450: ...
    [...]

[1] is the most flexible, allowing each build to define its own parameters. This makes it very easy to scan the config file for one build's parameters in a single place. However, since each param has to be defined per build, this can result in very long config files, which is why seasonal-flu has complex array-builds configs to programmatically create the configs during the workflow.

[2] is also pretty flexible, where each build can configure each parameter per rule grouping. There's less repetition of parameters so config files won't be as long, but a single build's config is spread throughout the rules groupings. It is also not very clear which rule configs can be configured per build and which rule configs are shared among builds.

[3] is the least flexible, as it only allows each build to change specific configs. This can also be confusing why some configs are nested while others are not.

group_by: "country year"
subsample_max_sequences: 3000
min_date: 1950
min_length: 400
refine:
coalescent: "opt"
date_inference: "marginal"
Expand Down
8 changes: 8 additions & 0 deletions phylogenetic/defaults/measles_reference_N450.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
>lcl|NC_001498.1_cds_NP_056918.1_1 [gene=N] [locus_tag=MeVgp1] [db_xref=GeneID:1489804] [protein=nucleocapsid protein] [protein_id=NP_056918.1] [location=1233..1682] [gbkey=CDS]
GTCAGTTCCACATTGGCATCCGAACTCGGTATCACTGCCGAGGATGCAAGGCTTGTTTCAGAGAT
TGCAATGCATACTACTGAGGACAGGATCAGTAGAGCGGTCGGACCCAGACAAGCCCAAGTGTCATTTCTA
CACGGTGATCAAAGTGAGAATGAGCTACCAGGATTGGGGGGCAAGGAAGATAGGAGGGTCAAACAGGGTC
GGGGAGAAGCCAGGGAGAGCTACAGAGAAACCGGGTCCAGCAGAGCAAGTGATGCGAGAGCTGCCCATCC
TCCAACCAGCATGCCCCTAGACATTGACACTGCATCGGAGTCAGGCCAAGATCCGCAGGACAGTCGAAGG
TCAGCTGACGCCCTGCTCAGGCTGCAAGCCATGGCAGGAATCTTGGAAGAACAAGGCTCAGACACGGACA
CCCCTAGGGTATACAATGACAGAGATCTTCTAGAC
69 changes: 69 additions & 0 deletions phylogenetic/defaults/measles_reference_N450.gb
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
LOCUS NC_001498 450 bp cRNA linear VRL 13-AUG-2018
DEFINITION Measles virus, complete genome.
ACCESSION NC_001498 REGION: 1233..1682
VERSION NC_001498.1
DBLINK Project: 15025
BioProject: PRJNA485481
KEYWORDS RefSeq.
SOURCE Measles morbillivirus
ORGANISM Measles morbillivirus
Viruses; Riboviria; Orthornavirae; Negarnaviricota;
Haploviricotina; Monjiviricetes; Mononegavirales; Paramyxoviridae;
Orthoparamyxovirinae; Morbillivirus; Morbillivirus hominis.
REFERENCE 1 (sites)
AUTHORS Rima,B.K. and Duprex,W.P.
TITLE The measles virus replication cycle
JOURNAL Curr. Top. Microbiol. Immunol. 329, 77-102 (2009)
PUBMED 19198563
REFERENCE 2
AUTHORS Takeuchi,K., Miyajima,N., Kobune,F. and Tashiro,M.
TITLE Comparative nucleotide sequence analyses of the entire genomes of
B95a cell-isolated and vero cell-isolated measles viruses from the
same patient
JOURNAL Virus Genes 20 (3), 253-257 (2000)
PUBMED 10949953
REFERENCE 3 (bases 1 to 450)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (01-AUG-2000) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 4 (bases 1 to 450)
AUTHORS Takeuchi,K., Tanabayashi,K. and Tashiro,M.
TITLE Direct Submission
JOURNAL Submitted (10-JUL-1998) Kaoru Takeuchi, National Institute of
Infectious Diseases, Viral Disease and Vaccine Contorol; 4-7-1
Gakuen, Musashi-murayama, Tokyo 208-0011, Japan
(E-mail:[email protected], Tel:81-42-561-0771(ex.530),
Fax:81-42-567-5631)
COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from AB016162.
Sequence updated (21-Jul-1998)
Sequence updated (11-Dec-1998).
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..450
/organism="Measles morbillivirus"
/mol_type="viral cRNA"
/strain="Ichinose-B95a"
/db_xref="taxon:11234"
CDS <1..>450
jameshadfield marked this conversation as resolved.
Show resolved Hide resolved
/gene="N"
/codon_start=1
/product="nucleocapsid protein"
/protein_id="NP_056918.1"
/db_xref="GeneID:1489804"
/translation="VSSTLASELGITAEDAR
LVSEIAMHTTEDRISRAVGPRQAQVSFLHGDQSENELPGLGGKEDRRVKQGRGEARES
YRETGSSRASDARAAHPPTSMPLDIDTASESGQDPQDSRRSADALLRLQAMAGILEEQ
GSDTDTPRVYNDRDLLD"
ORIGIN
1 gtcagttcca cattggcatc cgaactcggt atcactgccg aggatgcaag gcttgtttca
61 gagattgcaa tgcatactac tgaggacagg atcagtagag cggtcggacc cagacaagcc
121 caagtgtcat ttctacacgg tgatcaaagt gagaatgagc taccaggatt ggggggcaag
181 gaagatagga gggtcaaaca gggtcgggga gaagccaggg agagctacag agaaaccggg
241 tccagcagag caagtgatgc gagagctgcc catcctccaa ccagcatgcc cctagacatt
301 gacactgcat cggagtcagg ccaagatccg caggacagtc gaaggtcagc tgacgccctg
361 ctcaggctgc aagccatggc aggaatcttg gaagaacaag gctcagacac ggacacccct
421 agggtataca atgacagaga tcttctagac
//

14 changes: 7 additions & 7 deletions phylogenetic/rules/annotate_phylogeny.smk
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ See Augur's usage docs for these commands for more details.
rule ancestral:
"""Reconstructing ancestral sequences and mutations"""
input:
tree = "results/tree.nwk",
alignment = "results/aligned.fasta"
tree = "results/tree_{gene}.nwk",
alignment = "results/aligned_{gene}.fasta"
output:
node_data = "results/nt_muts.json"
node_data = "results/nt_muts_{gene}.json"
params:
inference = config["ancestral"]["inference"]
shell:
Expand All @@ -26,11 +26,11 @@ rule ancestral:
rule translate:
"""Translating amino acid sequences"""
input:
tree = "results/tree.nwk",
node_data = "results/nt_muts.json",
reference = config["files"]["reference"]
tree = "results/tree_{gene}.nwk",
node_data = "results/nt_muts_{gene}.json",
reference = lambda wildcard: "defaults/measles_reference.gb" if wildcard.gene in ["genome"] else "defaults/measles_reference_{gene}.gb"
output:
node_data = "results/aa_muts.json"
node_data = "results/aa_muts_{gene}.json"
shell:
"""
augur translate \
Expand Down
12 changes: 6 additions & 6 deletions phylogenetic/rules/construct_phylogeny.smk
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ See Augur's usage docs for these commands for more details.
rule tree:
"""Building tree"""
input:
alignment = "results/aligned.fasta"
alignment = "results/aligned_{gene}.fasta"
output:
tree = "results/tree_raw.nwk"
tree = "results/tree_raw_{gene}.nwk"
shell:
"""
augur tree \
Expand All @@ -26,12 +26,12 @@ rule refine:
- filter tips more than {params.clock_filter_iqd} IQDs from clock expectation
"""
input:
tree = "results/tree_raw.nwk",
alignment = "results/aligned.fasta",
tree = "results/tree_raw_{gene}.nwk",
alignment = "results/aligned_{gene}.fasta",
metadata = "data/metadata.tsv"
output:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The temporal signal is much better in this build (and the build Trevor showed me) than the one we looked at mid-week - do you know what changed? This gives me more confidence that the (temporal) rooting is working ok for the N450 build - it's broadly similar to the genome build - and so there's no longer a need to start in "unrooted" view if you'd prefer to go back to the more typical rectangular view.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lower temporal signal for the tree I showed you last week seems to be related to the subsampling method I had used. Here is the subsampling method I used for that tree:

filter_N450:
group_by: "country year month"
sequences_per_group: 2
min_date: 1950
min_length: 400

And here is the clock view for a tree I generated today using that subsampling method:

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed default display back to rooted time-tree in 862dbaf

tree = "results/tree.nwk",
node_data = "results/branch_lengths.json"
tree = "results/tree_{gene}.nwk",
node_data = "results/branch_lengths_{gene}.json"
params:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following deprecation notice is present when running refine:

        DEPRECATION WARNING. TreeTime.resolve_polytomies: You are resolving
        polytomies using the old 'greedy' mode. This is not well suited for large
        polytomies. Stochastic resolution will become the default in future
        versions. To switch now, rerun with the flag `--stochastic-resolve`. To
        keep using the greedy method in the future, run with `--greedy-resolve`

Especially for the N450 build where we will have large polytomies we should use --stochastic-resolve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f39ba8b

coalescent = config["refine"]["coalescent"],
date_inference = config["refine"]["date_inference"],
Expand Down
14 changes: 8 additions & 6 deletions phylogenetic/rules/export.smk
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,17 @@ See Augur's usage docs for these commands for more details.
rule export:
"""Exporting data files for for auspice"""
input:
tree = "results/tree.nwk",
tree = "results/tree_{gene}.nwk",
metadata = "data/metadata.tsv",
branch_lengths = "results/branch_lengths.json",
nt_muts = "results/nt_muts.json",
aa_muts = "results/aa_muts.json",
branch_lengths = "results/branch_lengths_{gene}.json",
nt_muts = "results/nt_muts_{gene}.json",
aa_muts = "results/aa_muts_{gene}.json",
colors = config["files"]["colors"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest removing all the "country" colours from the defaults/colors.tsv in this PR simply because a number of countries are missing colours. Auspice will pick a better set of colours than the current situation of some colours + some greys. We can then add nicer colours in a subsequent PR, as desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 119fbcf

auspice_config = config["files"]["auspice_config"]
auspice_config = lambda wildcard: "defaults/auspice_config.json" if wildcard.gene in ["genome"] else "defaults/auspice_config_N450.json"

output:
auspice_json = rules.all.input.auspice_json
auspice_json = "auspice/measles_{gene}.json",
root_sequence = "auspice/measles_{gene}_root-sequence.json"
params:
strain_id = config["strain_id_field"]
shell:
Expand Down
2 changes: 1 addition & 1 deletion phylogenetic/rules/prepare_sequences.smk
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ rule align:
sequences = "results/filtered.fasta",
reference = config["files"]["reference"]
output:
alignment = "results/aligned.fasta"
alignment = "results/aligned_genome.fasta"
shell:
"""
augur align \
Expand Down
58 changes: 58 additions & 0 deletions phylogenetic/rules/prepare_sequences_N450.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
"""
This part of the workflow prepares sequences for constructing the phylogenetic tree for 450bp of the N gene.

See Augur's usage docs for these commands for more details.
"""

rule align_and_extract_N450:
input:
sequences = "data/sequences.fasta",
reference = config["files"]["reference_N450_fasta"]
output:
sequences = "results/sequences_N450.fasta"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's preferable to organise builds as directories within results, e.g. "results/genome/sequences.fasta" and "results/N450/sequences.fasta". As it's an implementation detail, this change doesn't have to be made in this PR.

(This comment applies throughout the snakemake files added in this PR.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in a9d2644

params:
min_length = config['filter_N450']['min_length']
shell:
"""
nextclade run \
-j 1 \
--input-ref {input.reference} \
--output-fasta {output.sequences} \
--min-seed-cover 0.01 \
--min-length {params.min_length} \
--silent \
{input.sequences}
"""
rule filter_N450:
"""
Filtering to
- {params.sequences_per_group} sequence(s) per {params.group_by!s}
- excluding strains in {input.exclude}
- minimum genome length of {params.min_length}
- excluding strains with missing region, country or date metadata
"""
input:
sequences = "results/sequences_N450.fasta",
metadata = "data/metadata.tsv",
exclude = config["files"]["exclude"]
output:
sequences = "results/aligned_N450.fasta"
params:
group_by = config['filter_N450']['group_by'],
subsample_max_sequences = config["filter_N450"]["subsample_max_sequences"],
min_date = config["filter_N450"]["min_date"],
min_length = config['filter_N450']['min_length'],
strain_id = config["strain_id_field"]
shell:
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--metadata-id-columns {params.strain_id} \
--exclude {input.exclude} \
--output {output.sequences} \
--group-by {params.group_by} \
--subsample-max-sequences {params.subsample_max_sequences} \
--min-date {params.min_date} \
--min-length {params.min_length}
"""