Skip to content

Commit

Permalink
Add ref tree workflow
Browse files Browse the repository at this point in the history
  • Loading branch information
corneliusroemer committed Jan 12, 2024
1 parent ea03721 commit bc42d3b
Show file tree
Hide file tree
Showing 23 changed files with 21,376 additions and 64 deletions.
3 changes: 0 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
Version 2 dataset are not compatible with Nextclade v3 and need migrating.
See our [dataset migration guide](migration-guide-v3.md) for tips and tools on how to create v3 dataset from your existing v2 datasets.


## Create new datasets
You can create your own NextClade datasets!
We provide a [dataset creation tutorial](dataset-creation-guide.md) to help you assemble datasets for your virus of interest. To help you to get going, there is a [minimal dataset](minimal-dataset) and a [script](example-workflow/scripts/generate_from_genbank.py) to walk you through the creation of a suitable annotation using data from RefSeq.
Expand All @@ -13,5 +12,3 @@ We provide a [dataset creation tutorial](dataset-creation-guide.md) to help you

If you want your dataset to be included in those served by the Nextclade Web application, you can create a pull request in this repository.
The [dataset curation guide](dataset-curation-guide.md) describes the relevant steps.


140 changes: 97 additions & 43 deletions docs/dataset-creation-guide.md

Large diffs are not rendered by default.

7 changes: 7 additions & 0 deletions docs/example-workflow/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.snakemake/
data/
results/
bin/
test_out/
dataset.zip
tmp/
17 changes: 17 additions & 0 deletions docs/example-workflow/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Nextclade dataset creation example workflow

This repository contains a simple example workflow for creating a Nextclade dataset, in this case for Zika virus.

For a more detailed guide on dataset creation, see the [dataset creation tutorial](../dataset-creation-guide.md).

The quickest way to run this workflow is to [install the Nextstrain toolchain](https://docs.nextstrain.org/en/latest/install.html) and then run the following commands:

```bash
nextstrain build . test
```

Basic configuration is supported by editing the constants at the top of the `Snakefile`.

You may also want to edit `resources/auspice_config.json` and `resources/pathogen.json`.

If you struggle with customizing the workflow, please post your questions in the [Nextstrain discussion forum](https://discussion.nextstrain.org/).
290 changes: 290 additions & 0 deletions docs/example-workflow/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
REFERENCE_ACCESSION = "NC_035889.1"
TAXON_ID = 64320
GENES = ["CA", "PRO", "MP", "ENV", "NS1", "NS2A", "NS2B", "NS3", "NS4A", "NS4B", "NS5"]
ALLOWED_DIVERGENCE = "500"
MIN_DATE = "2010-01-01"
MIN_LENGTH = "8000"
MAX_SEQS = "100"
GFF_PATH = "../minimal-dataset/genome_annotation.gff3"
GENBANK_PATH = "resources/reference.gb"
REFERENCE_PATH = "../minimal-dataset/reference.fasta"
README_PATH = "../minimal-dataset/README.md"
CHANGELOG_PATH = "../minimal-dataset/CHANGELOG.md"

FETCH_SEQUENCES = True

if FETCH_SEQUENCES == True:

include: "rules/fetch_from_ncbi.smk"


rule add_reference_to_include:
"""
Create an include file for augur filter
"""
input:
include="resources/include.txt",
output:
"results/include.txt",
shell:
"""
echo "{REFERENCE_ACCESSION}" >> results/include.txt
"""


rule filter:
"""
Exclude sequences from before 2015 and subsample to 100 sequences
"""
input:
sequences="data/sequences.fasta",
metadata="data/metadata.tsv",
include="results/include.txt",
output:
filtered_sequences="results/filtered_sequences_raw.fasta",
filtered_metadata="results/filtered_metadata_raw.tsv",
params:
min_date="" if MIN_DATE == "" else "--min-date " + MIN_DATE,
min_length="" if MIN_LENGTH == "" else "--min-length " + MIN_LENGTH,
max_seqs=MAX_SEQS,
shell:
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
{params.min_length} \
{params.min_date} \
--include {input.include} \
--subsample-max-sequences {params.max_seqs} \
--output {output.filtered_sequences} \
--output-metadata {output.filtered_metadata}
"""


rule align:
input:
sequences="results/filtered_sequences_raw.fasta",
reference=REFERENCE_PATH,
annotation=GFF_PATH,
output:
alignment="results/aligned.fasta",
tsv="results/nextclade.tsv",
params:
translation_template=lambda w: "results/translations/cds_{cds}.translation.fasta",
shell:
"""
nextclade3 run \
{input.sequences} \
--input-ref {input.reference} \
--input-annotation {input.annotation} \
--output-translations {params.translation_template} \
--output-tsv {output.tsv} \
--output-fasta {output.alignment}
"""


rule get_outliers:
"""
Automatically identify sequences with >150 substitutions
(likely to be sequencing errors) and put them in outliers.txt
"""
input:
nextclade="results/nextclade.tsv",
output:
outliers="results/outliers.txt",
tmp="tmp/outliers.txt",
params:
allowed_divergence=lambda w: ALLOWED_DIVERGENCE,
shell:
"""
tsv-filter -H -v --is-numeric totalSubstitutions {input.nextclade} \
> {output.tmp}
tsv-filter -H \
--is-numeric totalSubstitutions \
--gt totalSubstitutions:{params.allowed_divergence} \
{input.nextclade} \
| tail -n +2 >> {output.tmp}
cat {output.tmp} \
| tsv-select -H -f seqName \
| tail -n +2 > {output.outliers}
"""


rule exclude:
"""
Rule to allow for manual and automatic exclusion of sequences
without triggering a new subsampling that could
surface new bad sequences resulting in an infinite loop
"""
input:
sequences="results/aligned.fasta",
metadata="data/metadata.tsv",
exclude="resources/exclude.txt",
outliers="results/outliers.txt",
output:
filtered_sequences="results/filtered_aligned.fasta",
filtered_metadata="results/filtered_metadata.tsv",
strains="results/tree_strains.txt",
shell:
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--exclude {input.exclude} {input.outliers} \
--output {output.filtered_sequences} \
--output-metadata {output.filtered_metadata} \
--output-strains {output.strains}
"""


rule tree:
input:
alignment="results/filtered_aligned.fasta",
output:
tree="results/tree_raw.nwk",
shell:
"""
augur tree \
--alignment {input.alignment} \
--output {output.tree} \
"""


rule refine:
input:
tree="results/tree_raw.nwk",
alignment="results/filtered_aligned.fasta",
output:
tree="results/tree.nwk",
node_data="results/branch_lengths.json",
shell:
"""
augur refine \
--tree {input.tree} \
--alignment {input.alignment} \
--root {REFERENCE_ACCESSION} \
--keep-polytomies \
--divergence-units mutations \
--output-node-data {output.node_data} \
--output-tree {output.tree}
"""


rule ancestral:
input:
tree="results/tree.nwk",
alignment="results/filtered_aligned.fasta",
annotation=GENBANK_PATH,
output:
node_data="results/muts.json",
params:
translation_template=r"results/translations/cds_%GENE.translation.fasta",
output_translation_template=r"results/translations/cds_%GENE.ancestral.fasta",
genes=" ".join(GENES),
shell:
"""
augur ancestral \
--tree {input.tree} \
--alignment {input.alignment} \
--annotation {input.annotation} \
--genes {params.genes} \
--translations {params.translation_template} \
--output-node-data {output.node_data} \
--output-translations {params.output_translation_template}
"""


rule dummy_clades:
"""
Nextclade requires clade membership to be specified for each node
in the tree. This rule creates a dummy clade membership for each node
"""
input:
"results/branch_lengths.json",
output:
"results/dummy_clades.json",
shell:
"""
jq '.nodes |= map_values({{"clade_membership": "dummy"}})' {input} > {output}
"""


rule export:
input:
tree="results/tree.nwk",
metadata="results/filtered_metadata.tsv",
mutations="results/muts.json",
branch_lengths="results/branch_lengths.json",
clades="results/dummy_clades.json",
auspice_config="resources/auspice_config.json",
output:
auspice="results/auspice.json",
shell:
"""
augur export v2 \
--tree {input.tree} \
--metadata {input.metadata} \
--auspice-config {input.auspice_config} \
--node-data {input.mutations} {input.branch_lengths} {input.clades} \
--output {output.auspice}
"""


rule subsample_example_sequences:
input:
all_sequences="data/sequences.fasta",
tree_strains="results/tree_strains.txt",
output:
example_sequences="results/example_sequences.fasta",
shell:
"""
# Exclude tree sequences from all sequences
seqkit grep -v -f {input.tree_strains} {input.all_sequences} \
| seqkit sample -n 100 -s 42 > {output.example_sequences}
"""


rule assemble_dataset:
input:
tree="results/auspice.json",
reference=REFERENCE_PATH,
annotation=GFF_PATH,
sequences="results/example_sequences.fasta",
pathogen="resources/pathogen.json",
readme=README_PATH,
changelog=CHANGELOG_PATH,
output:
tree="dataset/tree.json",
reference="dataset/reference.fasta",
annotation="dataset/genome_annotation.gff3",
sequences="dataset/sequences.fasta",
pathogen="dataset/pathogen.json",
readme="dataset/README.md",
changelog="dataset/CHANGELOG.md",
dataset_zip="dataset.zip",
shell:
"""
cp {input.tree} {output.tree}
cp {input.reference} {output.reference}
cp {input.annotation} {output.annotation}
cp {input.sequences} {output.sequences}
cp {input.pathogen} {output.pathogen}
cp {input.readme} {output.readme}
cp {input.changelog} {output.changelog}
zip -rj dataset.zip dataset/*
"""


rule test:
input:
dataset="dataset.zip",
sequences="dataset/sequences.fasta",
output:
output=directory("test_out"),
shell:
"""
nextclade3 run \
--input-dataset {input.dataset} \
--output-all {output.output} \
{input.sequences}
"""
3 changes: 3 additions & 0 deletions docs/example-workflow/dataset/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## Unreleased

Initial release.
3 changes: 3 additions & 0 deletions docs/example-workflow/dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Example dataset for Zika virus

This is a minimal example dataset for demonstration purposes.
19 changes: 19 additions & 0 deletions docs/example-workflow/dataset/genome_annotation.gff3
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region NC_035889.1 1 10808
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=64320
NC_035889.1 RefSeq region 1 10808 . + . ID=NC_035889.1:1..10808;Dbxref=taxon:64320;collection-date=2015;country=Brazil: Rio Grande do Norte%2C Natal;gbkey=Src;genome=genomic;isolation-source=fetus' brain autopsy;mol_type=genomic RNA;nat-host=Homo sapiens;strain=Natal RGN
NC_035889.1 RefSeq region 1 10808 . + . ID=id-NC_035889.1:1..10808;Note=Mature peptides were annotated by RefSeq staff using homoloy to NC_012532.1 and the cleavage sites reported in Kuno and Chang%2C 2007 (PMID 17195954). Questions about the annotation of this sequence should be directed to [email protected].;gbkey=Comment
NC_035889.1 RefSeq CDS 108 473 . + . Name=CA;gbkey=Prot;protein_id=YP_009430295.1;ID=id-YP_009428568.1:1..122;product=anchored capsid protein C
NC_035889.1 RefSeq CDS 474 752 . + . Name=PRO;gbkey=Prot;product=protein pr;protein_id=YP_009430298.1;ID=id-YP_009428568.1:123..215
NC_035889.1 RefSeq CDS 753 977 . + . Name=MP;gbkey=Prot;protein_id=YP_009430299.1;product=membrane glycoprotein M;ID=id-YP_009428568.1:216..290
NC_035889.1 RefSeq CDS 978 2489 . + . Name=ENV;gbkey=Prot;protein_id=YP_009430300.1;product=envelope protein E;ID=id-YP_009428568.1:291..794
NC_035889.1 RefSeq CDS 2490 3545 . + . Name=NS1;gbkey=Prot;protein_id=YP_009430301.1;product=nonstructural protein NS1;ID=id-YP_009428568.1:795..1146
NC_035889.1 RefSeq CDS 3546 4223 . + . Name=NS2A;gbkey=Prot;protein_id=YP_009430302.1;product=nonstructural protein NS2A;ID=id-YP_009428568.1:1147..1372
NC_035889.1 RefSeq CDS 4224 4613 . + . Name=NS2B;gbkey=Prot;protein_id=YP_009430303.1;product=nonstructural protein NS2B;ID=id-YP_009428568.1:1373..1502
NC_035889.1 RefSeq CDS 4614 6464 . + . Name=NS3;gbkey=Prot;protein_id=YP_009430304.1;product=nonstructural protein NS3;ID=id-YP_009428568.1:1503..2119
NC_035889.1 RefSeq CDS 6465 6845 . + . Name=NS4A;gbkey=Prot;protein_id=YP_009430305.1;product=nonstructural protein NS4A;ID=id-YP_009428568.1:2120..2246
NC_035889.1 RefSeq CDS 6846 6914 . + . Name=2K;gbkey=Prot;product=protein 2K;protein_id=YP_009430306.1;ID=id-YP_009428568.1:2247..2269
NC_035889.1 RefSeq CDS 6915 7667 . + . Name=NS4B;gbkey=Prot;protein_id=YP_009430307.1;product=nonstructural protein NS4B;ID=id-YP_009428568.1:2270..2520
NC_035889.1 RefSeq CDS 7668 10376 . + . Name=NS5;gbkey=Prot;protein_id=YP_009430308.1;ID=id-YP_009428568.1:2521..3423;product=RNA-dependent RNA polymerase NS5
Loading

0 comments on commit bc42d3b

Please sign in to comment.