Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at making snakemake workflow for innovation model #11

Merged
merged 54 commits into from
Dec 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
a9a2134
Title and introduction
marlinfiggins Dec 21, 2023
a93002d
First pass snakemake workflow
marlinfiggins Feb 13, 2024
80049d1
Update description of `run-innovation-model.py`.
marlinfiggins Feb 14, 2024
b262d4e
Refactoring `compare_natural.ipynb` just to generate phenotypes. Upda…
marlinfiggins Feb 14, 2024
766dcda
Fixing config
marlinfiggins Feb 14, 2024
3be1176
Fixing syntax errors
marlinfiggins Feb 15, 2024
7d3c186
Passing dry-run
marlinfiggins Feb 15, 2024
613a8eb
Typo in `prepare-pango-relationships.py`
marlinfiggins Feb 15, 2024
c2b02df
Adding in pango_variant_relationships for generating phenotype differ…
marlinfiggins Feb 15, 2024
6345511
Adding windowed observation script
marlinfiggins Jul 26, 2024
39020a5
Allowing users to specify start and end of windows
marlinfiggins Jul 26, 2024
398d100
Letting model run script save posteriors.
marlinfiggins Jul 26, 2024
e6157e5
Updating `run_models.smk`.
marlinfiggins Jul 26, 2024
097c29e
Updating snakemake workflow
marlinfiggins Jul 26, 2024
afd8d78
Variant-Parents are the same by location
marlinfiggins Aug 8, 2024
ae8079e
Adding mlr_innovation-regression
marlinfiggins Aug 9, 2024
d731268
Innovation Regression example
marlinfiggins Aug 9, 2024
5bb0675
Adding predictors argument
marlinfiggins Aug 9, 2024
c3b9caa
Innovation Regression example
marlinfiggins Aug 9, 2024
20543f5
Fix small bugs in Snakemake
trvrb Oct 6, 2024
0ffb1e8
Manuscript + Methods updates
marlinfiggins Oct 24, 2024
8465ecf
Merge branch 'rewrite' of https://github.com/blab/ncov-escape into re…
marlinfiggins Oct 24, 2024
882d54b
Updated methods section and additional section on PGLS.
marlinfiggins Oct 24, 2024
71d5173
Include instructions for provisioning data locally
trvrb Nov 4, 2024
1f034c8
Improve sequence count prep and drop case data
trvrb Nov 4, 2024
80c0d38
Fix bug in Pango relationships script
trvrb Nov 4, 2024
64371af
Fix wildcard reference
trvrb Nov 4, 2024
1561ac0
Adding script to provision files from `forecasts-ncov`
marlinfiggins Nov 6, 2024
8d6e5e7
Working snakemake pipeline
marlinfiggins Nov 7, 2024
80494f1
Adding ability to select predictors by name. Saving processed data as…
marlinfiggins Nov 7, 2024
3392519
Clearly specifying min-date, max-date, and pivot for single analyses
marlinfiggins Nov 7, 2024
107f4a6
Starting windowed analysis
marlinfiggins Nov 7, 2024
b69bb66
Adding informed model
marlinfiggins Nov 7, 2024
d9b63d0
Fixing syntax error
marlinfiggins Nov 7, 2024
3b32fd8
Adding provisioning rule
marlinfiggins Nov 7, 2024
3091e7f
Adding provisioning phenotype script
marlinfiggins Nov 7, 2024
7a7f698
Updating provision phenotypes
marlinfiggins Nov 23, 2024
8b87b37
Saving and retrieving phenotypes
marlinfiggins Nov 23, 2024
74930cb
JSON string magic for fixing predictor specification
marlinfiggins Nov 23, 2024
5e502ef
Working informed model
marlinfiggins Nov 23, 2024
a753036
Working windowed analyses
marlinfiggins Nov 24, 2024
1410e00
Adding index=False
marlinfiggins Nov 24, 2024
b7f9c9c
Actually compute `delta`
marlinfiggins Nov 24, 2024
39d641c
Cleaning up analysis; starting figure making
marlinfiggins Nov 24, 2024
32f19a8
Notebooks for main analyis
marlinfiggins Nov 26, 2024
9730017
Retrieving phenotypes from main
marlinfiggins Nov 26, 2024
12bafdc
Adding out of sample validation
marlinfiggins Nov 27, 2024
efc0eb7
First pass at instructions
marlinfiggins Dec 17, 2024
297b9b5
Include brief installation instructions
trvrb Dec 19, 2024
bdc866a
Switching to `lineage` -> `clade` to be compatiable with SARS-CoV-2-s…
marlinfiggins Dec 19, 2024
944a624
Removing windowed analysis for now
marlinfiggins Dec 19, 2024
f4d524d
Fixing provisioning and prep for `gisaid_metadata.tsv.gz`
marlinfiggins Dec 19, 2024
a61b301
Adding windowed analysis back
marlinfiggins Dec 19, 2024
09b7ac6
Small updates to the readme
trvrb Dec 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
# (or at least out of _this_ git repo).
benchmarks/
logs/
data/
results/
build/
auspice/
Expand Down
63 changes: 61 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,64 @@
# Analysis of SARS-CoV-2 antigenic drift and evolutionary fitness

## Nextstrain workflow
## Installation

There is a Nextstrain workflow forked from v12 of the canonical Nextstrain [ncov](https://github.com/nextstrain/ncov) workflow in the [`ncov-workflow`](ncov-workflow/) directory.
All dependencies can be met by running through the Nextstrain runtime.
See [docs.nextstrain.org](https://docs.nextstrain.org/en/latest/install.html) for installation instructions.

## Provision metadata locally

Windowed analyses require access to SARS-CoV-2 metadata.
This can be acquired via

```bash
aws s3 cp s3://nextstrain-ncov-private/metadata.tsv.gz data/gisaid_metadata.tsv.gz
zstd -c -d data/gisaid_metadata.tsv.gz \
| tsv-select -H -f strain,date,date_submitted,country,clade_nextstrain,Nextclade_pango,QC_overall_status \
| gzip -c > data/gisaid_metadata_filtered.tsv.gz
```

Non-windowed analyses will provision sequence counts from [forecasts-ncov](https://github.com/nextstrain/forecasts-ncov).


## Workflow

Once metadata is provision locally, run the workflow with

```
nextstrain build . all
```

This will generate sequence count files for all analyses specificied in `./config/config.yaml`.
Under `analysis_period` in the config file, you can specify an analysis name as well as minimum and maximum dates for the analysis, the pivot variant of interest, lineages to forcibly include, and predictors to use for the regression-prior model e.g.

```
analysis_period:
xbb15:
min_date: "2023-01-01"
max_date: "2023-12-01"
pivot: "XBB.1.5"
force_include: "defaults/xbb15/force_include_lineages.txt"
predictor_names:
- "spike pseudovirus DMS human sera escape relative to XBB.1.5"
- "spike pseudovirus DMS ACE2 binding relative to XBB.1.5"
- "RBD yeast-display DMS ACE2 affinity relative to XBB.1.5"
- "RBD yeast-display DMS RBD expression relative to XBB.1.5"
- "RBD yeast-display DMS escape relative to XBB.1.5"
```

### Sequence counts and variant relationships

It produces sequence count files for windowed analyses, non-windowed analyses, collapsing lineages into their parents based on count thresholds specified in `./config/config.yaml`, as well as generating variant relationship files, so that the innovation model can be fit.

Collapsed sequence counts follow the form `data/{analysis_period}/collapsed_seq_counts.tsv` and variant relationships follow the form `data/{analysis_period}/pango_variant_relationships.tsv`.

### MLR innovation model

The collapsed sequence count files and variant relationships are used to estimate relative fitness using the uninformed (normal-prior) innovation model.

For each analysis period, this produces `results/{analysis_period}/growth_advantage.tsv` and `results/{analysis_periods}/growth_advantage_delta.tsv`.

The model posteriors will also be saved under `results/{analysis_period}/posteriors/data_{location}.pkl` and `results/{analysis_period}/posteriors/samples_{location}.pkl`.

If predictor names are provided, they are used to fit the regression prior innovation model.
These predictor-informed results are stored under `results/{analysis_period}/informed/growth_advantages.tsv` and `results/{analysis_period}/informed/growth_advantages.tsv`.
59 changes: 59 additions & 0 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
import os
import pandas as pd

if not config:
configfile: "config/config.yaml"

wildcard_constraints:
data_provenance="[A-Za-z0-9_-]+", # Allow letters, numbers, underscores, and dashes
analysis_period="[A-Za-z0-9_-]+", # Allow letters, numbers, underscores, and dashes
geo_resolution="[A-Za-z0-9_-]+", # Allow letters, numbers, underscores, and dashes
variant_classifications="[A-Za-z0-9_-]+" # Allow letters, numbers, underscores, and dashes

def _get_date_range(ap):
obs_date_min = config["analysis_period"].get(ap, {}).get('obs_date_min')
obs_date_max = config["analysis_period"].get(ap, {}).get('obs_date_max')
obs_date_interval = config["analysis_period"].get(ap, {}).get('interval')
date_range = pd.date_range(
start=obs_date_min,
end=obs_date_max,
freq=obs_date_interval,
).strftime("%Y-%m-%d").tolist()

return date_range

def _get_all_input(w):
data_provenances = config["data_provenances"] if isinstance(config["data_provenances"], list) else [config["data_provenances"]]
variant_classifications = config["variant_classifications"] if isinstance(config["variant_classifications"], list) else [config["variant_classifications"]]
geo_resolutions = config["geo_resolutions"] if isinstance(config["geo_resolutions"], list) else [config["geo_resolutions"]]
analysis_periods = config["analysis_period"]

# date_ranges = {analysis: generate_dates(analysis_periods[analysis]) for analysis in analysis_periods}
all_input = [
# Non-windowed analyses
*expand(
"results/{analysis_period}/growth_advantages.tsv",
analysis_period=[ap for ap, cfg in analysis_periods.items() if not cfg.get("windowed", False)],
),
*expand(
"results/{analysis_period}/informed/growth_advantages.tsv",
analysis_period=[ap for ap, cfg in analysis_periods.items() if (not cfg.get("windowed", False)) and (cfg.get("predictor_names", False))],
)
]
# Windowed analyses
for ap, cfg in analysis_periods.items():
if cfg.get("windowed", False):
all_input += expand(
"results/{analysis_period}/growth_advantages_{obs_date}.tsv",
analysis_period=[ap],
obs_date=_get_date_range(ap)
)
return all_input

rule all:
input: _get_all_input

include: "workflow/snakemake_rules/prepare_data.smk"
include: "workflow/snakemake_rules/retrieve_phenotypes.smk"
include: "workflow/snakemake_rules/run_models.smk"
include: "workflow/snakemake_rules/run_models_over_period.smk"
54 changes: 54 additions & 0 deletions config/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
data_provenances:
- gisaid
variant_classifications:
- pango_lineages
geo_resolutions:
- global
analysis_periods:
- xbb15

# Params for the prepare data scripts
# Define params for each data_provenance / variant_classification / geo_resolution combination
# Include `max_date` if you don't want to use today as the max date
prepare_data:
gisaid:
pango_lineages:
global:
location_min_seq: 1000
excluded_locations: "defaults/global_excluded_locations.txt"
clade_min_seq: 1
collapse_threshold: 200

# In case we want to run models for various different pandemic periods
analysis_period:
ba2:
min_date: "2021-11-01"
max_date: "2022-11-01"
pivot: "BA.2"
xbb15:
min_date: "2023-01-01"
max_date: "2023-12-01"
pivot: "XBB.1.5"
force_include: "defaults/xbb15/force_include_lineages.txt"
predictor_names:
- "spike pseudovirus DMS human sera escape relative to XBB.1.5"
- "spike pseudovirus DMS ACE2 binding relative to XBB.1.5"
- "RBD yeast-display DMS ACE2 affinity relative to XBB.1.5"
- "RBD yeast-display DMS RBD expression relative to XBB.1.5"
- "RBD yeast-display DMS escape relative to XBB.1.5"
xbb15_test:
min_date: "2024-01-01"
max_date: "2024-6-30"
pivot: "XBB.1.5"
force_include: "defaults/xbb15/force_include_lineages.txt"
jn1:
min_date: "2024-01-01"
max_date: "2024-10-01"
pivot: "JN.1"
xbb15_windowed:
windowed: true
obs_date_min: "2023-01-01"
obs_date_max: "2023-12-01"
interval: "2M"
pivot: "XBB.1.5"
num_days_context: 90
5 changes: 5 additions & 0 deletions defaults/global_excluded_locations.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Austria
Czech Republic
Lithuania
Luxembourg
Slovakia
3 changes: 3 additions & 0 deletions defaults/xbb15/force_include_lineages.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
XBB
XBB.1.5
XBB.2
13 changes: 11 additions & 2 deletions manuscript/ncov-escape.bib
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@

%% Saved with string encoding Unicode (UTF-8)



@article{hadfield2018nextstrain,
author = {Hadfield, James and Megill, Colin and Bell, Sidney M and Huddleston, John and Potter, Barney and Callender, Charlton and Sagulenko, Pavel and Bedford, Trevor and Neher, Richard A},
date-added = {2022-11-22 14:10:43 -0800},
Expand All @@ -18,3 +16,14 @@ @article{hadfield2018nextstrain
title = {Nextstrain: real-time tracking of pathogen evolution},
volume = {34},
year = {2018}}

@article{khare2021gisaid,
author = {Khare, Shruti and Gurry, C{\'e}line and Freitas, Lucas and Schultz, Mark B and Bach, Gunter and Diallo, Amadou and Akite, Nancy and Ho, Joses and Lee, Raphael TC and Yeo, Winston and others},
date-added = {2023-11-30 14:24:46 +0100},
date-modified = {2023-11-30 14:24:59 +0100},
journal = {China CDC weekly},
pages = {1049},
title = {GISAID's role in pandemic response},
volume = {3},
year = {2021}}

Loading