Skip to content

Commit

Permalink
Add v3 mpxv datasets, still some wip (#104)
Browse files Browse the repository at this point in the history
* Add v3 mpxv datasets, still some wip

* fix: Update nextstrain collection with new mpox
dataset paths

* chore: rebuild [skip ci]

* Update mpox datasets

* chore: rebuild [skip ci]

---------

Co-authored-by: nextstrain-bot <[email protected]>
  • Loading branch information
corneliusroemer and nextstrain-bot authored Nov 21, 2023
1 parent 5e2742c commit 2008684
Show file tree
Hide file tree
Showing 81 changed files with 5,661 additions and 352,298 deletions.
6 changes: 3 additions & 3 deletions data/nextstrain/collection.json
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,9 @@
"nextstrain/flu/yam/ha/JN993010",
"nextstrain/rsv/a/EPI_ISL_412866",
"nextstrain/rsv/b/EPI_ISL_1653999",
"nextstrain/mpx/hmpxv-b1/pseudo_ON563414",
"nextstrain/mpx/hmpxv/NC_063383.1",
"nextstrain/mpx/mpxv/ancestral",
"nextstrain/mpox/all-clades",
"nextstrain/mpox/clade-iib",
"nextstrain/mpox/lineage-b.1",
"nextstrain/ebola/zaire",
"nextstrain/enterovirus/d68/fermon",
"nextstrain/hiv/1",
Expand Down
13 changes: 13 additions & 0 deletions data/nextstrain/mpox/all-clades/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
## Unreleased

Initial release of this dataset. This dataset is similar to the v2 dataset [`MPXV/ancestral`](https://github.com/nextstrain/nextclade_data/tree/2023-08-17--15-51-24--UTC/data/datasets/MPXV/references/ancestral/versions/2023-08-01T12%3A00%3A00Z/files) with some differences.

### New and changed gene names

Some genes have been renamed and one has been added. The new annotation is based on NCBI refseq annotations that were released in November 2022. The v2 dataset predates this refseq:

- The 4 genes in the inverted terminal repeat segment (ITR) on both ends of the genome (OPG001, OPG002, OPG003,OPG015) are now all included. The genes on the 3' end (~positions 190000-197000) now have an `_dup` appended to distinguish them.
- The gene previously named `NBT03_gp052` is now called `OPG073`
- The gene previously named `NBT03_gp174` is now called `OPG016`
- The gene previously named `NBT03_gp175` is now called `OPG015_dup`
- Gene `OPG166` has been added
23 changes: 23 additions & 0 deletions data/nextstrain/mpox/all-clades/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Nextclade dataset for "Mpox virus (All Clades)"

| Key | Value |
| ---------------------- | --------------------------------------------------------------------------------------------------------------------- |
| authors | [Cornelius Roemer](https://neherlab.org), [Richard Neher](https://neherlab.org), [Nextstrain](https://nextstrain.org) |
| data source | Genbank |
| workflow | [github.com/nextstrain/mpox/nextclade](https://github.com/nextstrain/mpox/nextclade) |
| nextclade dataset path | nextstrain/mpox/all-clades |
| annotation | [NC_063383.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_063383) |
| clade definitions | [github.com/mpxv-lineages/lineage-designation](https://github.com/mpxv-lineages/lineage-designation) |
| related datasets | Mpox virus (Clade IIb): `nextstrain/mpox/clade-iib`<br> Mpox virus (Lineage B.1) `nextstrain/mpox/lineage-b.1` |

This dataset is for Mpox viruses of all clades (I, IIa and IIb). For a focused analysis of sequences from clade IIb, you may want to use the more specific dataset: "Clade IIb" (`nextstrain/mpox/clade-iib`). For an even more focused analysis of 2022-2023 outbreak sequences (lineage B.1 and sublineages), you may want to use the even more specific dataset: "Lineage B.1" (`nextstrain/mpox/lineage-b.1`).

The lineage system used is defined in [Happi et al. (2022)](https://doi.org/10.1371/journal.pbio.3001769). Lineage definitions are available at [github.com/mpxv-lineages/lineage-designation](https://github.com/nextstrain/mpox/nextclade).

The reference used in this dataset is the clade IIb NCBI refseq `NC_063383.1` (Isolate `MPXV-M5312_HM12_Rivers`).

The reference tree consists of around 500 sequences with representatives from all clades and lineages.

## Further reading

Read more about Nextclade datasets in Nextclade documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
391 changes: 391 additions & 0 deletions data/nextstrain/mpox/all-clades/genome_annotation.gff3

Large diffs are not rendered by default.

212 changes: 212 additions & 0 deletions data/nextstrain/mpox/all-clades/pathogen.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
{
"alignmentParams": {
"excessBandwidth": 100,
"terminalBandwidth": 300,
"allowedMismatches": 8,
"windowSize": 40,
"minSeedCover": 0.1,
"gapAlignmentSide": "left"
},
"attributes": {
"name": "Mpox virus (All clades)",
"reference accession": "NC_063383.1",
"reference name": "MPXV-M5312_HM12_Rivers"
},
"compatibility": {
"cli": "3.0.0-alpha.0",
"web": "3.0.0-alpha.0"
},
"deprecated": false,
"enabled": true,
"experimental": false,
"files": {
"changelog": "CHANGELOG.md",
"examples": "sequences.fasta",
"genomeAnnotation": "genome_annotation.gff3",
"pathogenJson": "pathogen.json",
"readme": "README.md",
"reference": "reference.fasta",
"treeJson": "tree.json"
},
"official": true,
"qc": {
"frameShifts": {
"enabled": true,
"ignoredFrameShifts": [
{
"codonRange": {
"begin": 3,
"end": 589
},
"geneName": "OPG003"
},
{
"codonRange": {
"begin": 3,
"end": 589
},
"geneName": "OPG003_dup"
},
{
"codonRange": {
"begin": 281,
"end": 285
},
"geneName": "OPG105"
},
{
"codonRange": {
"begin": 227,
"end": 229
},
"geneName": "OPG164"
},
{
"codonRange": {
"begin": 554,
"end": 560
},
"geneName": "OPG180"
},
{
"codonRange": {
"begin": 42,
"end": 101
},
"geneName": "OPG197"
},
{
"codonRange": {
"begin": 443,
"end": 443
},
"geneName": "OPG037"
},
{
"codonRange": {
"begin": 57,
"end": 443
},
"geneName": "OPG037"
},
{
"codonRange": {
"begin": 481,
"end": 483
},
"geneName": "OPG047"
},
{
"codonRange": {
"begin": 482,
"end": 483
},
"geneName": "OPG047"
},
{
"codonRange": {
"begin": 72,
"end": 76
},
"geneName": "OPG050"
},
{
"codonRange": {
"begin": 369,
"end": 371
},
"geneName": "OPG153"
},
{
"codonRange": {
"begin": 370,
"end": 371
},
"geneName": "OPG153"
},
{
"codonRange": {
"begin": 166,
"end": 169
},
"geneName": "OPG191"
},
{
"codonRange": {
"begin": 72,
"end": 222
},
"geneName": "OPG195"
},
{
"codonRange": {
"begin": 208,
"end": 222
},
"geneName": "OPG195"
},
{
"codonRange": {
"begin": 289,
"end": 346
},
"geneName": "OPG174"
}
],
"scoreWeight": 20
},
"missingData": {
"enabled": true,
"missingDataThreshold": 20000,
"scoreBias": 1000
},
"mixedSites": {
"enabled": true,
"mixedSitesThreshold": 40
},
"privateMutations": {
"cutoff": 300,
"enabled": true,
"typical": 50,
"weightLabeledSubstitutions": 6,
"weightReversionSubstitutions": 6,
"weightUnlabeledSubstitutions": 1
},
"snpClusters": {
"clusterCutOff": 10,
"enabled": false,
"scoreWeight": 10,
"windowSize": 100
},
"stopCodons": {
"enabled": true,
"ignoredStopCodons": [
{
"codon": 187,
"geneName": "OPG015_dup"
},
{
"codon": 187,
"geneName": "OPG015"
},
{
"codon": 21,
"geneName": "OPG176"
},
{
"codon": 299,
"geneName": "OPG187"
},
{
"codon": 48,
"geneName": "OPG059"
}
],
"scoreWeight": 20
}
},
"schemaVersion": "3.0.0",
"version": {
"tag": "unreleased"
}
}
2 changes: 2 additions & 0 deletions data/nextstrain/mpox/all-clades/reference.fasta

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/nextstrain/mpox/all-clades/tree.json

Large diffs are not rendered by default.

13 changes: 13 additions & 0 deletions data/nextstrain/mpox/clade-iib/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
## Unreleased

Initial release of this dataset. This dataset is similar to the v2 dataset [`hMPXV/NC_063383.1`](https://github.com/nextstrain/nextclade_data/tree/2023-08-17--15-51-24--UTC/data/datasets/hMPXV/references/NC_063383.1/versions/2023-08-01T12%3A00%3A00Z/files) with some differences.

### New and changed gene names

Some genes have been renamed and one has been added. The new annotation is based on NCBI refseq annotations that were released in November 2022. The v2 dataset predates this refseq:

- The 4 genes in the inverted terminal repeat segment (ITR) on both ends of the genome (OPG001, OPG002, OPG003,OPG015) are now all included. The genes on the 3' end (~positions 190000-197000) now have an `_dup` appended to distinguish them.
- The gene previously named `NBT03_gp052` is now called `OPG073`
- The gene previously named `NBT03_gp174` is now called `OPG016`
- The gene previously named `NBT03_gp175` is now called `OPG015_dup`
- Gene `OPG166` has been added
23 changes: 23 additions & 0 deletions data/nextstrain/mpox/clade-iib/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Nextclade dataset for "Mpox virus (Clade IIb)"

| Key | Value |
| ---------------------- | --------------------------------------------------------------------------------------------------------------------- |
| authors | [Cornelius Roemer](https://neherlab.org), [Richard Neher](https://neherlab.org), [Nextstrain](https://nextstrain.org) |
| data source | Genbank |
| workflow | [github.com/nextstrain/mpox/nextclade](https://github.com/nextstrain/mpox/nextclade) |
| nextclade dataset path | nextstrain/mpox/clade-iib |
| annotation | [NC_063383.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_063383) |
| clade definitions | [github.com/mpxv-lineages/lineage-designation](https://github.com/mpxv-lineages/lineage-designation) |
| related datasets | Mpox virus (All clades): `nextstrain/mpox/all-clades`<br> Mpox virus (Lineage B.1) `nextstrain/mpox/lineage-b.1` |

This dataset is for Mpox viruses of clade IIb. A more specific dataset just for outbreak lineage B.1 is available as `nextstrain/mpox/lineage-b.1`. There is also a broader dataset for all clades I, IIa and IIb under `nextstrain/mpox/all-clades`.

The lineage system used is defined in [Happi et al. (2022)](https://doi.org/10.1371/journal.pbio.3001769). Lineage definitions are available at [github.com/mpxv-lineages/lineage-designation](https://github.com/nextstrain/mpox/nextclade).

The reference used in this dataset is the clade IIb NCBI refseq `NC_063383.1` (Isolate `MPXV-M5312_HM12_Rivers`).

The reference tree consists of around 5000 sequences with representatives from all clade IIb lineages.

## Further reading

Read more about Nextclade datasets in Nextclade documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
Loading

0 comments on commit 2008684

Please sign in to comment.