First pass at making snakemake workflow for innovation model #11

marlinfiggins · 2024-02-13T22:50:56Z

This PR implements a Snakemake workflow for provisioning sequence counts with similar methods used in forecasts-ncov. This is my first time working with Snakemake, so any suggestions, comments, or questions are appreciated.

Major changes include:

Specifying several analysis periods with different pivot variants which growth advantages are estimated relative to. (See config.yaml for specification here)
Provisioning data sets for each analysis period (See workflow/snakemake_rules/prepare_data.smk)
Running the innovation model with the specified period for all qualifying locations (See scripts/run-innovation-model)

This to be accomplished still:

Checking Python implementation of mlr-fitness/data/pango-relationships.nb for correctness
Testing of workflow in general.
Implementing phenotype prediction using DMS data / snakemake.

Note: I've borrowed the files scripts/prepare-data.py and scripts/collapse-lineages.py directly from forecasts-ncov. Let me know if there's a better way of doing this.

…ting `prepare_data.smk`.

marlinfiggins · 2024-02-14T22:15:34Z

I've tried to convert compare_natural notebook to a Python script scripts/compute-phenotype which can be used to generate phenotype files in specified in prepare_data.smk (corresponding rule is compute_phenotype).

There's a couple of remaining tasks for this section:

Ensuring pairs for predictor differences are generated with the parent-variant relationships in pango_variant_relationships
Figure out best way of specifying which phenotypes to generate. I imagine a section in config.yaml would be ideal?
Processing phenotypes to a single (or collection of) predictors files.
Testing in general

…ences.

I couldn't run Snakemake due to a couple bugs in the Snakefile 1. run_models.smk missing comma 2. Snakefile missing pandas import 3. Some confusion in analysis_periods of iterating over list vs dictionary. Should be solved by passing analysis_periods.keys() to expand However, I couldn't test workflow due to missing data/ files so I'm not 100% sure this fixes things.

trvrb · 2024-10-06T02:59:16Z

Thanks so much for diving in here @marlinfiggins. I'm sorry that I didn't notice this PR before you pointed it out to me last month. I just tried working from rewrite and I managed to fix some initial Snakemake bugs that were throwing errors. However, I now encounter

nextstrain build --cpus 1 . data/gisaid/pango_lineages/global/xbb15/prepared_cases.tsv
Building DAG of jobs...
MissingInputException in rule prepare_clade_data in file /nextstrain/build/workflow/snakemake_rules/prepare_data.smk, line 43:
Missing input files for rule prepare_clade_data:
    output: data/gisaid/pango_lineages/global/xbb15/prepared_cases.tsv, data/gisaid/pango_lineages/global/xbb15/prepared_seq_counts.tsv
    wildcards: data_provenance=gisaid, variant_classification=pango_lineages, geo_resolution=global, analysis_period=xbb15
    affected files:
        data/cases/global.tsv.gz
        data/gisaid/pango_lineages/global.tsv.gz

I think that you're assuming that local files like data/gisaid/pango_lineages/global.tsv.gz exist? Can you please add to the README.md what steps need to be taken to provision data before running Snakemake?

I just did almost exactly this over here: https://github.com/blab/fitness-dynamics?tab=readme-ov-file#provision-metadata-locally.

I can continue review once I know how to provision local data.

Also, separate question: can we just drop data/{data_provenance}/{variant_classification}/{geo_resolution}/{analysis_period}/prepared_cases.tsv? I don't think cases feed into any of the MLR analyses.

…write

This copies over logic from https://github.com/blab/fitness-dynamics where the prepare_clade_data rule that calls scripts/prepare-data.py is based on a defined analysis window of min_date and max_date rather than defining included_days. This is significantly cleaner for performing historical analyses. Additionally, drop references to cases (and the requirement of inputting cases to prepare-data.py). We're not uses cases in the MLR analysis and they just add unused overhead.

trvrb

@marlinfiggins ---

Please update README.md with examples of how to actually run this workflow. I'd like to be able to work through the readme to be understand how to interact with the workflow. See https://github.com/blab/fitness-flux for a pretty minimal example of what I mean.

trvrb · 2024-12-17T19:19:15Z

@marlinfiggins: Thanks for providing instructions on how to run workflow. However, I'm running into errors. If I reduce analysis_period in config.yaml to just:

analysis_period:
  xbb15:
    min_date: "2023-01-01"
    max_date: "2023-12-01"
    pivot: "XBB.1.5"
    force_include: "defaults/xbb15/force_include_lineages.txt"
    predictor_names:
      - "spike pseudovirus DMS human sera escape relative to XBB.1.5"
      - "spike pseudovirus DMS ACE2 binding relative to XBB.1.5"
      - "RBD yeast-display DMS ACE2 affinity relative to XBB.1.5"
      - "RBD yeast-display DMS RBD expression relative to XBB.1.5"
      - "RBD yeast-display DMS escape relative to XBB.1.5"

Update to the latest Docker image with nextstrain update and run with nextstrain build . -j 1 all I get the following error:

python ./scripts/run-innovation-model.py             --seq-counts data/xbb15/collapsed_seq_counts.tsv             --pango-relationships data/xbb15/pango_variant_relationships.tsv             --growth-advantage-path results/xbb15/growth_advantages.tsv             --growth-advantage-delta-path results/xbb15/growth_advantages_delta.tsv             --posterior-path results/xbb15/posteriors             --pivot XBB.1.5
        
Traceback (most recent call last):
  File "/nextstrain/build/./scripts/run-innovation-model.py", line 6, in <module>
    import evofr as ef
  File "/usr/local/lib/python3.10/site-packages/evofr/__init__.py", line 13, in <module>
    from .infer import InferFullRank  # , BlackJaxHandler; InferBlackJax,
  File "/usr/local/lib/python3.10/site-packages/evofr/infer/__init__.py", line 2, in <module>
    from .InferMCMC import InferMCMC, InferNUTS
  File "/usr/local/lib/python3.10/site-packages/evofr/infer/InferMCMC.py", line 10, in <module>
    from .MCMC_handler import MCMCHandler
  File "/usr/local/lib/python3.10/site-packages/evofr/infer/MCMC_handler.py", line 5, in <module>
    from jax._src.random import KeyArray
ImportError: cannot import name 'KeyArray' from 'jax._src.random' (/usr/local/lib/python3.10/site-packages/jax/_src/random.py)

Are there dependencies that need updating for this to run?

trvrb · 2024-12-17T19:55:19Z

This can be resolved via blab/evofr#44.

Suggest to update ncov-escape README.md with current installation instructions. This will be nextstrain update + running nextstrain shell and updating evofr to the latest version ahead of things cascading down to the default Docker image. The readme should have installation instructions in any case.

I've made this revision in the below commit.

trvrb · 2024-12-19T20:47:11Z

@marlinfiggins ---

After the update to the Nextstrain docker image, I'm now longer getting the above jax error. However, if run attempt to run the repo in it's current state without touching config.yaml I get the following error:

MissingInputException in rule process_metadata in file /nextstrain/build/workflow/snakemake_rules/run_models_over_period.smk, line 2:
Missing input files for rule process_metadata:
    output: data/xbb15_windowed/sequence_counts_by_submission.tsv
    wildcards: analysis_period=xbb15_windowed
    affected files:
        data/gisaid_metadata.tsv.gz

The config looks like it should just use what's in analysis_period.xbb15. However, if I take config.yaml and delete other entries in analysis_period and keep only xbb15 I get different behavior with a different error after it runs through a number of steps:

rule download_phenotypes:
    output: data/xbb15/phenotypes/mutation_phenotypes.csv, data/xbb15/phenotypes/mutation_phenotypes_randomized.csv, data/xbb15/phenotypes/lineage_phenotypes.csv, data/xbb15/phenotypes/lineage_phenotypes_randomized.csv
    jobid: 7
    reason: Missing output files: data/xbb15/phenotypes/lineage_phenotypes.csv
    wildcards: analysis_period=xbb15
    resources: tmpdir=/tmp


        python ./scripts/provision-phenotypes.py             --config-path config.yaml             --base-url https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main             --output-path data/xbb15/phenotypes
        
No `output` key in config. Falling back to default files.
Downloading https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/mutation_phenotypes.csv...
Saved to data/xbb15/phenotypes/mutation_phenotypes.csv
Downloading https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/mutation_phenotypes_randomized.csv...
Saved to data/xbb15/phenotypes/mutation_phenotypes_randomized.csv
Downloading https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/lineage_phenotypes.csv...
Failed to download https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/lineage_phenotypes.csv: 404 Client Error: Not Found for url: https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/lineage_phenotypes.csv
Downloading https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/lineage_phenotypes_randomized.csv...
Failed to download https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/lineage_phenotypes_randomized.csv: 404 Client Error: Not Found for url: https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/lineage_phenotypes_randomized.csv

Marlin, this doesn't have to be a correct / complete analysis, but I really would like the default analysis and what's specified in the readme to at least run. Can you take another look at this? Please try working from a fresh repo to make sure things at least run.

Noting here that if I swap xbb15 to the provided jn1 analysis period, the workflow completes successfully. I'd be 100% of okay merging this PR if you can just reduce scope to something like this and then we'll get phenotypic predictors working in a subsequent PR.

trvrb

Blocking now is having the default config.yaml / instructions in the readme complete without error. See above comment.

…pike-predictors-phenos main branch

marlinfiggins · 2024-12-19T21:12:25Z

Sorry about this, I've made some changes.

The bug you were running into was with running the regression prior model. It was attempting to fetch the phenotypes from the wrong files.

The issue here was that I had to switch the code to use the old version of the predictors clade_phenotypes.csv present on the main branch of SARS2-spike-predictor-phenos. There were several differences in that version of the phenotypes and the one that lives on the lineage phenotypes PR which is named lineage_phenotypes.csv.

Previously, I just grabbed this by hand from the main branch and renamed the relevant columns to be consistent since I expected this PR to be merged by the time we tested this.

For now, I've changed this and the relevant code to use the phenotypes from the main branch, but I'll just revert these changes once the lineage phenotypes PR is properly merged.

marlinfiggins · 2024-12-19T21:20:26Z

I've also removed the window analysis for now until I fix the location of the gisaid_metadata file for the windowed analysis.

marlinfiggins · 2024-12-19T21:43:39Z

I've also added the window analysis back and fixed the instructions for provisioning the metadata.

trvrb · 2024-12-24T00:04:23Z

@marlinfiggins --- Thanks so much for humoring me with the detailed instructions. I was able to run this workflow for different analysis periods and everything worked as expected. I like the set up for data/{analysis_period}/ for sequence counts and variant relationships and the corresponding results/{analysis_period}/ for model outputs.

trvrb

Everything looks great at this point.

marlinfiggins added 2 commits December 21, 2023 11:28

Title and introduction

a9a2134

First pass snakemake workflow

a93002d

marlinfiggins assigned marlinfiggins and trvrb and unassigned marlinfiggins Feb 13, 2024

marlinfiggins added 2 commits February 14, 2024 12:48

Update description of run-innovation-model.py.

80049d1

Refactoring compare_natural.ipynb just to generate phenotypes. Upda…

b262d4e

…ting `prepare_data.smk`.

marlinfiggins and others added 16 commits February 14, 2024 15:09

Fixing config

766dcda

Fixing syntax errors

3be1176

Passing dry-run

7d3c186

Typo in prepare-pango-relationships.py

613a8eb

Adding in pango_variant_relationships for generating phenotype differ…

c2b02df

…ences.

Adding windowed observation script

6345511

Allowing users to specify start and end of windows

39020a5

Letting model run script save posteriors.

398d100

Updating run_models.smk.

e6157e5

Updating snakemake workflow

097c29e

Variant-Parents are the same by location

afd8d78

Adding mlr_innovation-regression

ae8079e

Innovation Regression example

d731268

Adding predictors argument

5bb0675

Innovation Regression example

c3b9caa

marlinfiggins and others added 5 commits October 24, 2024 15:23

Manuscript + Methods updates

0ffb1e8

Merge branch 'rewrite' of https://github.com/blab/ncov-escape into re…

8465ecf

…write

Updated methods section and additional section on PGLS.

882d54b

Include instructions for provisioning data locally

71d5173

marlinfiggins added 10 commits November 23, 2024 15:39

Saving and retrieving phenotypes

8b87b37

JSON string magic for fixing predictor specification

74930cb

Working informed model

5e502ef

Working windowed analyses

a753036

Adding index=False

1410e00

Actually compute delta

b7f9c9c

Cleaning up analysis; starting figure making

39d641c

Notebooks for main analyis

32f19a8

Retrieving phenotypes from main

9730017

Adding out of sample validation

12bafdc

trvrb requested changes Dec 17, 2024

View reviewed changes

First pass at instructions

efc0eb7

Include brief installation instructions

297b9b5

trvrb requested changes Dec 19, 2024

View reviewed changes

Switching to lineage -> clade to be compatiable with SARS-CoV-2-s…

bdc866a

…pike-predictors-phenos main branch

Removing windowed analysis for now

944a624

marlinfiggins added 2 commits December 19, 2024 13:41

Fixing provisioning and prep for gisaid_metadata.tsv.gz

f4d524d

Adding windowed analysis back

a61b301

Small updates to the readme

09b7ac6

trvrb self-requested a review December 24, 2024 00:04

trvrb approved these changes Dec 24, 2024

View reviewed changes

trvrb merged commit 6abc677 into master Dec 24, 2024

trvrb deleted the rewrite branch December 24, 2024 00:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First pass at making snakemake workflow for innovation model #11

First pass at making snakemake workflow for innovation model #11

marlinfiggins commented Feb 13, 2024 •

edited

Loading

marlinfiggins commented Feb 14, 2024 •

edited

Loading

trvrb commented Oct 6, 2024

trvrb left a comment

trvrb commented Dec 17, 2024

trvrb commented Dec 17, 2024 •

edited

Loading

trvrb commented Dec 19, 2024

trvrb left a comment •

edited

Loading

marlinfiggins commented Dec 19, 2024 •

edited

Loading

marlinfiggins commented Dec 19, 2024

marlinfiggins commented Dec 19, 2024

trvrb commented Dec 24, 2024

trvrb left a comment

First pass at making snakemake workflow for innovation model #11

First pass at making snakemake workflow for innovation model #11

Conversation

marlinfiggins commented Feb 13, 2024 • edited Loading

marlinfiggins commented Feb 14, 2024 • edited Loading

trvrb commented Oct 6, 2024

trvrb left a comment

Choose a reason for hiding this comment

trvrb commented Dec 17, 2024

trvrb commented Dec 17, 2024 • edited Loading

trvrb commented Dec 19, 2024

trvrb left a comment • edited Loading

Choose a reason for hiding this comment

marlinfiggins commented Dec 19, 2024 • edited Loading

marlinfiggins commented Dec 19, 2024

marlinfiggins commented Dec 19, 2024

trvrb commented Dec 24, 2024

trvrb left a comment

Choose a reason for hiding this comment

marlinfiggins commented Feb 13, 2024 •

edited

Loading

marlinfiggins commented Feb 14, 2024 •

edited

Loading

trvrb commented Dec 17, 2024 •

edited

Loading

trvrb left a comment •

edited

Loading

marlinfiggins commented Dec 19, 2024 •

edited

Loading