Skip to content

Commit

Permalink
Merge pull request #100 from nextstrain/test-v3-flu
Browse files Browse the repository at this point in the history
flu: update h3n2 ha dataset
  • Loading branch information
rneher authored Nov 12, 2023
2 parents 1c48275 + 4ec183d commit aa5dc0b
Show file tree
Hide file tree
Showing 13 changed files with 1,626 additions and 89,387 deletions.
8 changes: 3 additions & 5 deletions data/nextstrain/flu/h3n2/ha/EPI1857216/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
## Unreleased
# 2023-11-09

Initial release for Nextclade v3!
- Aliasing of G.1.3.1.1 as subclade H

This dataset is converted from the corresponding older dataset for Nextclade v2. You can find old versions of datasets here: https://github.com/nextstrain/nextclade_data/tree/2023-08-17--15-51-24--UTC/data/datasets

Read more about Nextclade datasets in the documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
# 2023-08-28: Initial definition of subclades
21 changes: 21 additions & 0 deletions data/nextstrain/flu/h3n2/ha/EPI1857216/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Nextclade dataset for "Influenza A H3N2 HA" based on reference "A/Darwin/6/2021" (flu_h3n2_ha/EPI1857216)

This dataset uses a recent reference sequence (A/Darwin/6/2021) and is suitable for the analysis of circulating viruses.

## Dataset attributes

Expand All @@ -9,6 +10,26 @@
| reference | EPI1857216 | A/Darwin/6/2021 |


## Features
This dataset supports

* Assignment to clades and subclades based on the nomenclature defined in [github.com/influenza-clade-nomenclature/seasonal_A-H3N2_HA/](https://github.com/influenza-clade-nomenclature/seasonal_A-H3N2_HA/)
* Identification of glycosilation motifs
* Counting of mutations in the RBD
* Sequence QC
* Phylogenetic placement

## Clades of seasonal influenza viruses

The WHO Collaborating centers define "clades" as genetic groups of viruses with signature mutations to facilitate discussion of circulating diversity of the viruses.
Clade demarcation do not always coincide with significantly different antigenic properties of the viruses.
Clade names are structured as _Number-Letter_ binomials (with exceptions) separated by periods as in `3C.2a1b.2a.2a.1a`. These sometimes get shortened by omission of leading binomials like `2a.1`.

In addition to these clades, "subclades" are defined to break down diversity at higher resolution and allow following the spread of different viral groups.
These follow a Pango-like nomenclature consisting of a letter followed by a numbers separated by periods as in `G.1.3.1`.
The leading letter is an alias of a previous name.
Details of the nomenclature system can be found at [github.com/influenza-clade-nomenclature/seasonal_A-H3N2_HA/](https://github.com/influenza-clade-nomenclature/seasonal_A-H3N2_HA/).

## What is Nextclade dataset

Read more about Nextclade datasets in Nextclade documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
203 changes: 106 additions & 97 deletions data/nextstrain/flu/h3n2/ha/EPI1857216/pathogen.json
Original file line number Diff line number Diff line change
@@ -1,29 +1,12 @@
{
"aaMotifs": [
{
"description": "N-linked glycosylation motifs (N-X-S/T with X any amino acid other than P)",
"includeGenes": [
{
"gene": "HA1"
},
{
"gene": "HA2",
"ranges": [
{
"begin": 0,
"end": 186
}
]
}
],
"motifs": [
"N[^P][ST]"
],
"name": "glycosylation",
"nameFriendly": "Glycosylation",
"nameShort": "Glyc."
}
],
"schemaVersion": "3.0.0",
"alignmentParams": {
"excessBandwidth": 9,
"terminalBandwidth": 100,
"allowedMismatches": 4,
"gapAlignmentSide": "right",
"minSeedCover": 0.1
},
"compatibility": {
"cli": "3.0.0-alpha.0",
"web": "3.0.0-alpha.0"
Expand All @@ -38,18 +21,84 @@
"reference": "reference.fasta",
"treeJson": "tree.json"
},
"qc": {
"privateMutations": {
"enabled": true,
"typical": 5,
"cutoff": 15,
"weightLabeledSubstitutions": 2,
"weightReversionSubstitutions": 1,
"weightUnlabeledSubstitutions": 1
},
"missingData": {
"enabled": false,
"missingDataThreshold": 100,
"scoreBias": 10
},
"snpClusters": {
"enabled": false,
"windowSize": 100,
"clusterCutOff": 5,
"scoreWeight": 50
},
"mixedSites": {
"enabled": true,
"mixedSitesThreshold": 4
},
"frameShifts": {
"enabled": true
},
"stopCodons": {
"enabled": true,
"ignoredStopCodons": []
}
},
"geneOrderPreference": [
"HA1",
"HA2"
],
"maintenance": {
"website": [
"https://nextstrain.org",
"https://clades.nextstrain.org"
],
"documentation": [
"https://github.com/nextstrain/seasonal-flu"
],
"source code": [
"https://github.com/nextstrain/seasonal_flu"
],
"issues": [
"https://github.com/nextstrain/seasonal_flu/issues"
],
"organizations": [
"Nextstrain"
],
"authors": [
"Nextstrain team <https://nextstrain.org>"
]
},
"nucMutLabelMap": {},
"nucMutLabelMapReverse": {},
"phenotypeData": [
{
"name": "RBD",
"nameFriendly": "RBD mutations",
"description": "This column displays the number of differences between the sequence and the reference at positions identified by Koel et al. (145, 155, 156, 158, 159, 189, and 193 in HA1)",
"gene": "HA1",
"aaRange": {
"begin": 100,
"end": 200
},
"ignore": {
"clades": [
"outgroup"
]
},
"data": [
{
"name": "differences",
"weight": 1,
"locations": {
"145": {
"default": 1
Expand All @@ -72,84 +121,44 @@
"193": {
"default": 1
}
},
"name": "differences",
"weight": 1
}
}
],
"description": "This column displays the number of differences between the sequence and the reference at positions identified by Koel et al. (145, 155, 156, 158, 159, 189, and 193 in HA1)",
"gene": "HA1",
"ignore": {
"clades": [
"outgroup"
]
},
"name": "RBD",
"nameFriendly": "RBD mutations"
]
}
],
"qc": {
"frameShifts": {
"enabled": true
},
"missingData": {
"enabled": false,
"missingDataThreshold": 100,
"scoreBias": 10
},
"mixedSites": {
"enabled": true,
"mixedSitesThreshold": 4
},
"privateMutations": {
"cutoff": 15,
"enabled": true,
"typical": 5,
"weightLabeledSubstitutions": 2,
"weightReversionSubstitutions": 1,
"weightUnlabeledSubstitutions": 1
},
"snpClusters": {
"clusterCutOff": 5,
"enabled": false,
"scoreWeight": 50,
"windowSize": 100
},
"stopCodons": {
"enabled": true
"aaMotifs": [
{
"name": "glycosylation",
"nameShort": "Glyc.",
"nameFriendly": "Glycosylation",
"description": "N-linked glycosylation motifs (N-X-S/T with X any amino acid other than P)",
"includeGenes": [
{
"gene": "HA1",
"ranges": []
},
{
"gene": "HA2",
"ranges": [
{
"begin": 0,
"end": 186
}
]
}
],
"motifs": [
"N[^P][ST]"
]
}
},
"schemaVersion": "3.0.0",
"version": {
"tag": "unreleased"
},
],
"attributes": {
"name": "Influenza A H3N2 HA",
"reference name": "A/Darwin/6/2021",
"reference accession": "EPI1857216"
"segment": "ha",
"reference accession": "EPI1857216",
"reference name": "A/Darwin/6/2021"
},
"maintenance": {
"website": [
"https://nextstrain.org",
"https://clades.nextstrain.org"
],
"documentation": [
"https://github.com/nextstrain/nextclade_data",
"https://docs.nextstrain.org/projects/nextclade"
],
"source code": [
"https://github.com/nextstrain/nextclade_data",
"https://github.com/neherlab/nextclade_data_workflows"
],
"issues": [
"https://github.com/nextstrain/nextclade_data",
"https://github.com/nextstrain/nextclade_data/issues"
],
"organizations": [
"Nextstrain"
],
"authors": [
"Nextstrain team <https://nextstrain.org>"
]
"version": {
"tag": "unreleased"
}
}
44 changes: 22 additions & 22 deletions data/nextstrain/flu/h3n2/ha/EPI1857216/reference.fasta
Original file line number Diff line number Diff line change
@@ -1,23 +1,23 @@
>EPI_ISL_1563628 | A/Darwin/6/2021 | A / H3N2 | | 2021-03-16
atgaagactatcattgctttgagcaacattctatgtcttgttttcgctcaaaaaatacctggaaatgacaatagcacggc
aacgctgtgccttgggcaccatgcagtaccaaacggaacgatagtgaaaacaatcacaaatgaccgaattgaagttacta
atgctactgagttggttcagaattcatcaataggtgaaatatgcggcagtcctcatcagatccttgatggagggaactgc
acactaatagatgctctattgggggaccctcagtgtgacggctttcaaaataaggaatgggacctttttgttgaaagaag
cagagccaacagcaactgttacccttatgatgtgccggattatgcctcccttaggtcactagttgcctcatccggcacac
tggagtttaaaaatgaaagcttcaattggactggagtcaaacaaaacggaacaagttctgcgtgcataaggggatctagt
agtagtttttttagtagattaaattggttgaccagcttaaacaacatatatccagcacagaacgtgactatgccaaacaa
ggaacaatttgacaaattgtacatttggggggttcaccacccggatacggacaagaaccaaatctccctgtttgctcaat
catcaggaagaatcacagtatctaccaaaagaagccaacaagctgtaatcccaaatatcggatctagacccagaataagg
gatatccctagcagaataagcatctattggacaatagtaaaaccgggagacatacttttgattaacagcacagggaatct
aattgctcctaggggttacttcaaaatacgaagtgggaaaagctcaataatgagatcagatgcacccattggcaaatgta
agtctgaatgcatcactccaaatggaagcattcccaatgacaaaccgttccaaaatgtaaacaggatcacatacggggcc
tgtcccagatatgttaagcaaagcaccctgaaattggcaacaggaatgcgaaatgtaccagagaaacaaaccagaggcat
atttggcgcaatagcgggtttcatagaaaatggatgggagggaatggtggatggttggtacggtttcaggcatcaaaatt
ctgagggaagaggacaagcagcagatctcaaaagcactcaagcagcaatcgatcaaatcaatgggaagctgaatcgattg
atcggaaaaaccaacgagaaattccatcagattgaaaaagaattctcagaagtagaaggaagagttcaagaccttgagaa
atatgttgaggacactaaaatagatctctggtcatacaacgcggagcttcttgttgccctggagaaccaacatacgattg
acctaactgactcagaaatgaacaaactgtttgaaaaaacaaagaagcaactgagggaaaatgctgaggatatgggaaat
ggttgtttcaaaatataccacaaatgtgacaatgcctgcataggatcaataagaaatgaaacttatgaccacaatgtgta
cagggatgaagcattaaacaaccggttccagatcaagggagttgagctgaagtcagggtacaaagattggatcctatgga
tttcctttgccatgtcatgttttttgctttgtattgctttgttggggttcatcatgtgggcctgccaaaagggcaacatt
agatgcaacatttgcatttgagtgcattaattaaaaac
ATGAAGACTATCATTGCTTTGAGCAACATTCTATGTCTTGTTTTCGCTCAAAAAATACCTGGAAATGACAATAGCACGGC
AACGCTGTGCCTTGGGCACCATGCAGTACCAAACGGAACGATAGTGAAAACAATCACAAATGACCGAATTGAAGTTACTA
ATGCTACTGAGTTGGTTCAGAATTCATCAATAGGTGAAATATGCGGCAGTCCTCATCAGATCCTTGATGGAGGGAACTGC
ACACTAATAGATGCTCTATTGGGGGACCCTCAGTGTGACGGCTTTCAAAATAAGGAATGGGACCTTTTTGTTGAAAGAAG
CAGAGCCAACAGCAACTGTTACCCTTATGATGTGCCGGATTATGCCTCCCTTAGGTCACTAGTTGCCTCATCCGGCACAC
TGGAGTTTAAAAATGAAAGCTTCAATTGGACTGGAGTCAAACAAAACGGAACAAGTTCTGCGTGCATAAGGGGATCTAGT
AGTAGTTTTTTTAGTAGATTAAATTGGTTGACCAGCTTAAACAACATATATCCAGCACAGAACGTGACTATGCCAAACAA
GGAACAATTTGACAAATTGTACATTTGGGGGGTTCACCACCCGGATACGGACAAGAACCAAATCTCCCTGTTTGCTCAAT
CATCAGGAAGAATCACAGTATCTACCAAAAGAAGCCAACAAGCTGTAATCCCAAATATCGGATCTAGACCCAGAATAAGG
GATATCCCTAGCAGAATAAGCATCTATTGGACAATAGTAAAACCGGGAGACATACTTTTGATTAACAGCACAGGGAATCT
AATTGCTCCTAGGGGTTACTTCAAAATACGAAGTGGGAAAAGCTCAATAATGAGATCAGATGCACCCATTGGCAAATGTA
AGTCTGAATGCATCACTCCAAATGGAAGCATTCCCAATGACAAACCGTTCCAAAATGTAAACAGGATCACATACGGGGCC
TGTCCCAGATATGTTAAGCAAAGCACCCTGAAATTGGCAACAGGAATGCGAAATGTACCAGAGAAACAAACCAGAGGCAT
ATTTGGCGCAATAGCGGGTTTCATAGAAAATGGATGGGAGGGAATGGTGGATGGTTGGTACGGTTTCAGGCATCAAAATT
CTGAGGGAAGAGGACAAGCAGCAGATCTCAAAAGCACTCAAGCAGCAATCGATCAAATCAATGGGAAGCTGAATCGATTG
ATCGGAAAAACCAACGAGAAATTCCATCAGATTGAAAAAGAATTCTCAGAAGTAGAAGGAAGAGTTCAAGACCTTGAGAA
ATATGTTGAGGACACTAAAATAGATCTCTGGTCATACAACGCGGAGCTTCTTGTTGCCCTGGAGAACCAACATACGATTG
ACCTAACTGACTCAGAAATGAACAAACTGTTTGAAAAAACAAAGAAGCAACTGAGGGAAAATGCTGAGGATATGGGAAAT
GGTTGTTTCAAAATATACCACAAATGTGACAATGCCTGCATAGGATCAATAAGAAATGAAACTTATGACCACAATGTGTA
CAGGGATGAAGCATTAAACAACCGGTTCCAGATCAAGGGAGTTGAGCTGAAGTCAGGGTACAAAGATTGGATCCTATGGA
TTTCCTTTGCCATGTCATGTTTTTTGCTTTGTATTGCTTTGTTGGGGTTCATCATGTGGGCCTGCCAAAAGGGCAACATT
AGATGCAACATTTGCATTTGAGTGCATTAATTAAAAAC
Loading

0 comments on commit aa5dc0b

Please sign in to comment.