Update sequence count dataset

Switch to https://github.com/nextstrain/forecasts-ncov to provision data. Limit dataset to March 1, 2022 to March 1, 2023.
blab · Jun 20, 2023 · 8bc0da6 · 8bc0da6
1 parent d314ba2
commit 8bc0da6
Show file tree

Hide file tree

Showing 10 changed files with 49,284 additions and 33,449 deletions.
diff --git a/mlr-fitness/data/README.md b/mlr-fitness/data/README.md
@@ -1,25 +1,53 @@
 ## Pango lineages dataset
 
-This dataset only includes Pango lineages with >150 sequences in US dataset. Rarer Pango lineages are collapsed into parental lineages, ie BM.1 was collapsed into BA.2.75.3.
+This dataset only includes Pango lineages with >150 sequences in US dataset. Rarer Pango lineages are collapsed into parental lineages, ie BM.1 was collapsed into BA.2.75.3. This collects sequences between 2022-03-01 and 2023-03-01.
 
-Data preparation followed:
+Dataset was provisioned by running https://github.com/nextstrain/forecasts-ncov with the following modified config:
 
-1. Nextstrain-curated metadata TSV of GISAID database was downloaded. Uncompressing and renaming this file resulted in `gisaid_metadata.tsv` via:
 ```
-nextstrain remote download s3://nextstrain-ncov-private/metadata.tsv.gz
-gzip -d metadata.tsv.gz -c > gisaid_metadata.tsv
+data_provenances:
+  - gisaid
+variant_classifications:
+  - pango_lineages
+geo_resolutions:
+  - global
+
+# Params for the prepare data scripts
+# Define params for each data_provenance / variant_classification / geo_resolution combination
+# Include max_date if you don't want to use today as the max date
+prepare_data:
+  gisaid:
+    pango_lineages:
+      global:
+        max_date: 2023-03-01
+        included_days: 365
+        location_min_seq: 500000
+        location_min_seq_days: 365
+        excluded_locations: "defaults/global_excluded_locations.txt"
+        prune_seq_days: 12
+        clade_min_seq: 1
+        clade_min_seq_days: 365
+        collapse_threshold: 150
+
+# Params for the model run scripts
+models:
+    pango_lineages:
+      global:
+        pivot: "BA.2"
+
+# Model configs
+mlr_config: "config/mlr-config.yaml"
 ```
 
-2. The metadata file was pruned to only relevant columns via:
-```
-tsv-select -H -f strain,date,country,division,QC_overall_status,Nextclade_pango gisaid_metadata.tsv > gisaid_metadata_pruned.tsv
-```
+I also added `United Kingdom` to `defaults/global_excluded_locations.txt`.
+
+This produces the file `data/gisaid/pango_lineages/global/collapsed_seq_counts.tsv` that can be used in subsequent analyses. This file was renamed to `pango_seq_counts.tsv` and versioned here.
 
-3. This `gisaid_metadata_pruned.tsv` is processed in Mathematica by running the notebook `pango_data-prep.nb`. This results in the files `pango_location-variant-sequence-counts.tsv` and `pango_variant-relationships.tsv` versioned here. These files represent heavily derived GISAID data and are equivalent to downloadable results from [outbreak.info](https://outbreak.info), [cov-spectrum.org](https://cov-spectrum.org) and [covariants.org](https://covariants.org). This use is allowable under the [GISAID Terms of Use](https://www.gisaid.org/registration/terms-of-use/).
+These files represent heavily derived GISAID data and are equivalent to downloadable results from [outbreak.info](https://outbreak.info), [cov-spectrum.org](https://cov-spectrum.org) and [covariants.org](https://covariants.org). This use is allowable under the [GISAID Terms of Use](https://www.gisaid.org/registration/terms-of-use/).
 
-There will be dates that are missing sequence counts. These should be assumed to be 0.
+------------------------------------------
 
-The file `pango_variant-relationships.tsv` looks like
+The file `pango_variant_relationships.tsv` is produced by running `pango-relationships.nb` and looks like:
 ```
 BG.4	 BA.2.12.1
 BG.5	 BA.2.12.1
@@ -28,8 +56,8 @@ BQ.1	 BE.1.1.1
 BQ.1.1 BQ.1
 BU.2	 BA.2
 ```
-and contains a mapping of each collapsed Pango lineage to its parent lineage. I've set the reference here as BA.2, so that we have BA.2 → BA.1, BA.2 → BA.4 and BA.2 → BA.5 contrasts even if this is was not the historical evolutionary path. I've chosen this because we have immune escape and ACE-2 binding measured relative to BA.2 and so it makes the most sense to have BA.2 as baseline in the MLR model. Pictorial view below:
+This contains a mapping of each collapsed Pango lineage to its parent lineage. I've set the reference here as BA.2, so that we have BA.2 → BA.1, BA.2 → BA.4 and BA.2 → BA.5 contrasts even if this is was not the historical evolutionary path. I've chosen this because we have immune escape and ACE-2 binding measured relative to BA.2 and so it makes the most sense to have BA.2 as baseline in the MLR model. Pictorial view below:
 
-![](pango_variant-tree.png)
+![](pango_variant_tree.png)
 
-![](pango_variant-graph.png)
+![](pango_variant_graph.png)