Skip to content

Commit

Permalink
Update sequence count dataset
Browse files Browse the repository at this point in the history
Switch to https://github.com/nextstrain/forecasts-ncov to provision data. Limit dataset to March 1, 2022 to March 1, 2023.
  • Loading branch information
trvrb committed Jun 20, 2023
1 parent d314ba2 commit 8bc0da6
Show file tree
Hide file tree
Showing 10 changed files with 49,284 additions and 33,449 deletions.
58 changes: 43 additions & 15 deletions mlr-fitness/data/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,53 @@
## Pango lineages dataset

This dataset only includes Pango lineages with >150 sequences in US dataset. Rarer Pango lineages are collapsed into parental lineages, ie BM.1 was collapsed into BA.2.75.3.
This dataset only includes Pango lineages with >150 sequences in US dataset. Rarer Pango lineages are collapsed into parental lineages, ie BM.1 was collapsed into BA.2.75.3. This collects sequences between 2022-03-01 and 2023-03-01.

Data preparation followed:
Dataset was provisioned by running https://github.com/nextstrain/forecasts-ncov with the following modified config:

1. Nextstrain-curated metadata TSV of GISAID database was downloaded. Uncompressing and renaming this file resulted in `gisaid_metadata.tsv` via:
```
nextstrain remote download s3://nextstrain-ncov-private/metadata.tsv.gz
gzip -d metadata.tsv.gz -c > gisaid_metadata.tsv
data_provenances:
- gisaid
variant_classifications:
- pango_lineages
geo_resolutions:
- global
# Params for the prepare data scripts
# Define params for each data_provenance / variant_classification / geo_resolution combination
# Include max_date if you don't want to use today as the max date
prepare_data:
gisaid:
pango_lineages:
global:
max_date: 2023-03-01
included_days: 365
location_min_seq: 500000
location_min_seq_days: 365
excluded_locations: "defaults/global_excluded_locations.txt"
prune_seq_days: 12
clade_min_seq: 1
clade_min_seq_days: 365
collapse_threshold: 150
# Params for the model run scripts
models:
pango_lineages:
global:
pivot: "BA.2"
# Model configs
mlr_config: "config/mlr-config.yaml"
```

2. The metadata file was pruned to only relevant columns via:
```
tsv-select -H -f strain,date,country,division,QC_overall_status,Nextclade_pango gisaid_metadata.tsv > gisaid_metadata_pruned.tsv
```
I also added `United Kingdom` to `defaults/global_excluded_locations.txt`.

This produces the file `data/gisaid/pango_lineages/global/collapsed_seq_counts.tsv` that can be used in subsequent analyses. This file was renamed to `pango_seq_counts.tsv` and versioned here.

3. This `gisaid_metadata_pruned.tsv` is processed in Mathematica by running the notebook `pango_data-prep.nb`. This results in the files `pango_location-variant-sequence-counts.tsv` and `pango_variant-relationships.tsv` versioned here. These files represent heavily derived GISAID data and are equivalent to downloadable results from [outbreak.info](https://outbreak.info), [cov-spectrum.org](https://cov-spectrum.org) and [covariants.org](https://covariants.org). This use is allowable under the [GISAID Terms of Use](https://www.gisaid.org/registration/terms-of-use/).
These files represent heavily derived GISAID data and are equivalent to downloadable results from [outbreak.info](https://outbreak.info), [cov-spectrum.org](https://cov-spectrum.org) and [covariants.org](https://covariants.org). This use is allowable under the [GISAID Terms of Use](https://www.gisaid.org/registration/terms-of-use/).

There will be dates that are missing sequence counts. These should be assumed to be 0.
------------------------------------------

The file `pango_variant-relationships.tsv` looks like
The file `pango_variant_relationships.tsv` is produced by running `pango-relationships.nb` and looks like:
```
BG.4 BA.2.12.1
BG.5 BA.2.12.1
Expand All @@ -28,8 +56,8 @@ BQ.1 BE.1.1.1
BQ.1.1 BQ.1
BU.2 BA.2
```
and contains a mapping of each collapsed Pango lineage to its parent lineage. I've set the reference here as BA.2, so that we have BA.2 → BA.1, BA.2 → BA.4 and BA.2 → BA.5 contrasts even if this is was not the historical evolutionary path. I've chosen this because we have immune escape and ACE-2 binding measured relative to BA.2 and so it makes the most sense to have BA.2 as baseline in the MLR model. Pictorial view below:
This contains a mapping of each collapsed Pango lineage to its parent lineage. I've set the reference here as BA.2, so that we have BA.2 → BA.1, BA.2 → BA.4 and BA.2 → BA.5 contrasts even if this is was not the historical evolutionary path. I've chosen this because we have immune escape and ACE-2 binding measured relative to BA.2 and so it makes the most sense to have BA.2 as baseline in the MLR model. Pictorial view below:

![](pango_variant-tree.png)
![](pango_variant_tree.png)

![](pango_variant-graph.png)
![](pango_variant_graph.png)
Loading

0 comments on commit 8bc0da6

Please sign in to comment.