Make tree for 450bp of the N gene ("N450") #20

kimandrews · 2024-03-22T22:22:32Z

Description of proposed changes

The goal of this PR is to create a tree using a 450bp region of the N gene ("N450") that is highly represented on NCBI for measles. Addresses github issue #13

General steps include:

Add a reference sequence that is 450bp of the N gene from the reference that is used for the whole-genome tree
Align all sequences against the N450 reference using Nextclade
Filter by alignment length
Subsample by date and geography
Use Snakemake wildcards to generate and export phylogeny

Many of these changes follow those made for E gene trees in the dengue repo

Related issue(s)

#13

Checklist

Checks pass

Reference is comprised of 450bp of the N gene from the same sample that is used for the genome tree (NCBI Accession NC_001498.1).

* Add rule to align sequences to N450 reference using nextclade * Add rule to filter by length, date, country

j23414

Worked on my computer:

git clone https://github.com/nextstrain/measles.git
cd measles
git checkout add-N450-tree
nextstrain build phylogenetic
nextstrain view phylogenetic/auspice

And I can see many more geolocations represented in the N450 tree.

jameshadfield · 2024-03-25T00:20:04Z

phylogenetic/defaults/config.yaml

    colors: "defaults/colors.tsv"
    auspice_config: "defaults/auspice_config.json"
 filter: 
    group_by: "country year month"
    sequences_per_group: 20
    min_date: 1950
    min_length: 5000
+filter_N450:


[minor, not blocking]

@joverlee521 do you have a canonical way we should structure build-specific rule parameters (e.g. filter vs filter_N450) when we want to use a single config file for both/all builds?

I do not (yet?), the pathogen-repo-guide has been based on one build per config file.

Existing workflows use three different patterns

seasonal flu has top level per build configs.

This would look like:

builds: genome: filter: group_by: ... sequences_per_group: ... [...] N450: filter: group_by: ... sequences_per_group: ... [...]

ncov has build configs nested within rule groupings.

This would look like:

filter: genome: group_by: ... sequences_per_group: ... [...] N450: group_by: ... sequences_per_group: ... [...]

rsv nests build names within specific config parameters.

This would look like:

filter: group_by: genome: ... N450: ... sequences_per_group: genome: ... N450: ... [...]

[1] is the most flexible, allowing each build to define its own parameters. This makes it very easy to scan the config file for one build's parameters in a single place. However, since each param has to be defined per build, this can result in very long config files, which is why seasonal-flu has complex array-builds configs to programmatically create the configs during the workflow.

[2] is also pretty flexible, where each build can configure each parameter per rule grouping. There's less repetition of parameters so config files won't be as long, but a single build's config is spread throughout the rules groupings. It is also not very clear which rule configs can be configured per build and which rule configs are shared among builds.

[3] is the least flexible, as it only allows each build to change specific configs. This can also be confusing why some configs are nested while others are not.

jameshadfield · 2024-03-25T00:21:58Z

phylogenetic/rules/prepare_sequences_N450.smk

+        sequences = "data/sequences.fasta",
+        reference = config["files"]["reference_N450_fasta"]
+    output:
+        sequences = "results/sequences_N450.fasta"


I think it's preferable to organise builds as directories within results, e.g. "results/genome/sequences.fasta" and "results/N450/sequences.fasta". As it's an implementation detail, this change doesn't have to be made in this PR.

(This comment applies throughout the snakemake files added in this PR.)

+1 from https://github.com/nextstrain/private/issues/102#issuecomment-1981727993

done in a9d2644

jameshadfield · 2024-03-25T00:39:35Z

phylogenetic/rules/construct_phylogeny.smk

-        tree = "results/tree.nwk",
-        node_data = "results/branch_lengths.json"
+        tree = "results/tree_{gene}.nwk",
+        node_data = "results/branch_lengths_{gene}.json"
    params:


The following deprecation notice is present when running refine:

DEPRECATION WARNING. TreeTime.resolve_polytomies: You are resolving polytomies using the old 'greedy' mode. This is not well suited for large polytomies. Stochastic resolution will become the default in future versions. To switch now, rerun with the flag `--stochastic-resolve`. To keep using the greedy method in the future, run with `--greedy-resolve`

Especially for the N450 build where we will have large polytomies we should use --stochastic-resolve.

Done in f39ba8b

jameshadfield · 2024-03-25T00:47:30Z

phylogenetic/rules/export.smk

-        aa_muts = "results/aa_muts.json",
+        branch_lengths = "results/branch_lengths_{gene}.json",
+        nt_muts = "results/nt_muts_{gene}.json",
+        aa_muts = "results/aa_muts_{gene}.json",
        colors = config["files"]["colors"],


I'd suggest removing all the "country" colours from the defaults/colors.tsv in this PR simply because a number of countries are missing colours. Auspice will pick a better set of colours than the current situation of some colours + some greys. We can then add nicer colours in a subsequent PR, as desired.

Done in 119fbcf

jameshadfield · 2024-03-25T00:57:16Z

phylogenetic/rules/construct_phylogeny.smk

-        tree = "results/tree_raw.nwk",
-        alignment = "results/aligned.fasta",
+        tree = "results/tree_raw_{gene}.nwk",
+        alignment = "results/aligned_{gene}.fasta",
        metadata = "data/metadata.tsv"
    output:


The temporal signal is much better in this build (and the build Trevor showed me) than the one we looked at mid-week - do you know what changed? This gives me more confidence that the (temporal) rooting is working ok for the N450 build - it's broadly similar to the genome build - and so there's no longer a need to start in "unrooted" view if you'd prefer to go back to the more typical rectangular view.

The lower temporal signal for the tree I showed you last week seems to be related to the subsampling method I had used. Here is the subsampling method I used for that tree:

filter_N450:
group_by: "country year month"
sequences_per_group: 2
min_date: 1950
min_length: 400

And here is the clock view for a tree I generated today using that subsampling method:

Changed default display back to rooted time-tree in 862dbaf

phylogenetic/defaults/measles_reference_N450.gb

jameshadfield · 2024-03-25T01:17:14Z

phylogenetic/defaults/auspice_config_N450.json

+      "type": "continuous"
+    },
+    {
+      "key": "author",


[Not blocking this PR]

We used to add "author" as a color-by as a workaround so that it appeared as a filtering option in Auspice, however a couple of recent changes have rendered this pattern unnecessary. We can now use "metadata_columns" in the auspice-config JSON (or --metadata-columns) to export "author" and Auspice will automatically add it as a filtering option. This is nicer than exposing author as a coloring.

Done in 8bea320

joverlee521

Latest changes look good to me!

I think this is good to merge and deploy the latest genome/N450 builds to nextstrain.org

kimandrews added 5 commits March 21, 2024 15:55

Add reference files for N450 region

55ea0ce

Reference is comprised of 450bp of the N gene from the same sample that is used for the genome tree (NCBI Accession NC_001498.1).

Prepare N450 sequences for phylogenetic analysis

edbefd5

* Add rule to align sequences to N450 reference using nextclade * Add rule to filter by length, date, country

Construct phylogeny for N450 region

8343876

Annotate phylogeny for N450 region

d55342b

Export phylogeny for N450 region

ab92a1b

kimandrews requested a review from a team March 22, 2024 22:22

j23414 approved these changes Mar 23, 2024

View reviewed changes

jameshadfield reviewed Mar 25, 2024

View reviewed changes

kimandrews added 5 commits March 28, 2024 11:44

Use --stochastic-resolve option for augur refine

f39ba8b

Remove "country" colors from defaults/colors.tsv

119fbcf

Change default display to rooted time-tree

862dbaf

Use --metadata-columns to export "author" in auspice_config.json

8bea320

Organize builds as directories within results

a9d2644

kimandrews requested review from jameshadfield and joverlee521 March 28, 2024 21:42

joverlee521 approved these changes Apr 1, 2024

View reviewed changes

Update Changelog

bf83a42

kimandrews merged commit 9c1fea2 into main Apr 1, 2024
32 checks passed

kimandrews deleted the add-N450-tree branch April 1, 2024 23:16

kimandrews mentioned this pull request Apr 25, 2024

Consider building gene-specific phylogenies #13

Closed

j23414 mentioned this pull request May 13, 2024

Use gene reference files to generate E gene trees nextstrain/dengue#48

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make tree for 450bp of the N gene ("N450") #20

Make tree for 450bp of the N gene ("N450") #20

kimandrews commented Mar 22, 2024

j23414 left a comment

jameshadfield Mar 25, 2024

joverlee521 Mar 25, 2024

jameshadfield Mar 25, 2024

joverlee521 Mar 25, 2024

kimandrews Mar 28, 2024

jameshadfield Mar 25, 2024

kimandrews Mar 28, 2024

jameshadfield Mar 25, 2024

kimandrews Mar 28, 2024

jameshadfield Mar 25, 2024

kimandrews Mar 25, 2024

kimandrews Mar 28, 2024

jameshadfield Mar 25, 2024 •

edited

Loading

kimandrews Mar 28, 2024

joverlee521 left a comment

Make tree for 450bp of the N gene ("N450") #20

Make tree for 450bp of the N gene ("N450") #20

Conversation

kimandrews commented Mar 22, 2024

Description of proposed changes

Related issue(s)

Checklist

j23414 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshadfield Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joverlee521 left a comment

Choose a reason for hiding this comment

jameshadfield Mar 25, 2024 •

edited

Loading