Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at making snakemake workflow for innovation model #11

Merged
merged 54 commits into from
Dec 24, 2024
Merged

Conversation

marlinfiggins
Copy link

@marlinfiggins marlinfiggins commented Feb 13, 2024

This PR implements a Snakemake workflow for provisioning sequence counts with similar methods used in forecasts-ncov. This is my first time working with Snakemake, so any suggestions, comments, or questions are appreciated.

Major changes include:

  • Specifying several analysis periods with different pivot variants which growth advantages are estimated relative to. (See config.yaml for specification here)
  • Provisioning data sets for each analysis period (See workflow/snakemake_rules/prepare_data.smk)
  • Running the innovation model with the specified period for all qualifying locations (See scripts/run-innovation-model)

This to be accomplished still:

  • Checking Python implementation of mlr-fitness/data/pango-relationships.nb for correctness
  • Testing of workflow in general.
  • Implementing phenotype prediction using DMS data / snakemake.

Note: I've borrowed the files scripts/prepare-data.py and scripts/collapse-lineages.py directly from forecasts-ncov. Let me know if there's a better way of doing this.

@marlinfiggins
Copy link
Author

marlinfiggins commented Feb 14, 2024

I've tried to convert compare_natural notebook to a Python script scripts/compute-phenotype which can be used to generate phenotype files in specified in prepare_data.smk (corresponding rule is compute_phenotype).

There's a couple of remaining tasks for this section:

  • Ensuring pairs for predictor differences are generated with the parent-variant relationships in pango_variant_relationships
  • Figure out best way of specifying which phenotypes to generate. I imagine a section in config.yaml would be ideal?
  • Processing phenotypes to a single (or collection of) predictors files.
  • Testing in general

@trvrb
Copy link
Member

trvrb commented Oct 6, 2024

Thanks so much for diving in here @marlinfiggins. I'm sorry that I didn't notice this PR before you pointed it out to me last month. I just tried working from rewrite and I managed to fix some initial Snakemake bugs that were throwing errors. However, I now encounter

nextstrain build --cpus 1 . data/gisaid/pango_lineages/global/xbb15/prepared_cases.tsv
Building DAG of jobs...
MissingInputException in rule prepare_clade_data in file /nextstrain/build/workflow/snakemake_rules/prepare_data.smk, line 43:
Missing input files for rule prepare_clade_data:
    output: data/gisaid/pango_lineages/global/xbb15/prepared_cases.tsv, data/gisaid/pango_lineages/global/xbb15/prepared_seq_counts.tsv
    wildcards: data_provenance=gisaid, variant_classification=pango_lineages, geo_resolution=global, analysis_period=xbb15
    affected files:
        data/cases/global.tsv.gz
        data/gisaid/pango_lineages/global.tsv.gz

I think that you're assuming that local files like data/gisaid/pango_lineages/global.tsv.gz exist? Can you please add to the README.md what steps need to be taken to provision data before running Snakemake?

I just did almost exactly this over here: https://github.com/blab/fitness-dynamics?tab=readme-ov-file#provision-metadata-locally.

I can continue review once I know how to provision local data.


Also, separate question: can we just drop data/{data_provenance}/{variant_classification}/{geo_resolution}/{analysis_period}/prepared_cases.tsv? I don't think cases feed into any of the MLR analyses.

marlinfiggins and others added 5 commits October 24, 2024 15:23
This copies over logic from https://github.com/blab/fitness-dynamics where the prepare_clade_data rule that calls scripts/prepare-data.py is based on a defined analysis window of min_date and max_date rather than defining included_days. This is significantly cleaner for performing historical analyses.

Additionally, drop references to cases (and the requirement of inputting cases to prepare-data.py). We're not uses cases in the MLR analysis and they just add unused overhead.
Copy link
Member

@trvrb trvrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marlinfiggins ---

Please update README.md with examples of how to actually run this workflow. I'd like to be able to work through the readme to be understand how to interact with the workflow. See https://github.com/blab/fitness-flux for a pretty minimal example of what I mean.

@trvrb
Copy link
Member

trvrb commented Dec 17, 2024

@marlinfiggins: Thanks for providing instructions on how to run workflow. However, I'm running into errors. If I reduce analysis_period in config.yaml to just:

analysis_period:
  xbb15:
    min_date: "2023-01-01"
    max_date: "2023-12-01"
    pivot: "XBB.1.5"
    force_include: "defaults/xbb15/force_include_lineages.txt"
    predictor_names:
      - "spike pseudovirus DMS human sera escape relative to XBB.1.5"
      - "spike pseudovirus DMS ACE2 binding relative to XBB.1.5"
      - "RBD yeast-display DMS ACE2 affinity relative to XBB.1.5"
      - "RBD yeast-display DMS RBD expression relative to XBB.1.5"
      - "RBD yeast-display DMS escape relative to XBB.1.5"

Update to the latest Docker image with nextstrain update and run with nextstrain build . -j 1 all I get the following error:

python ./scripts/run-innovation-model.py             --seq-counts data/xbb15/collapsed_seq_counts.tsv             --pango-relationships data/xbb15/pango_variant_relationships.tsv             --growth-advantage-path results/xbb15/growth_advantages.tsv             --growth-advantage-delta-path results/xbb15/growth_advantages_delta.tsv             --posterior-path results/xbb15/posteriors             --pivot XBB.1.5
        
Traceback (most recent call last):
  File "/nextstrain/build/./scripts/run-innovation-model.py", line 6, in <module>
    import evofr as ef
  File "/usr/local/lib/python3.10/site-packages/evofr/__init__.py", line 13, in <module>
    from .infer import InferFullRank  # , BlackJaxHandler; InferBlackJax,
  File "/usr/local/lib/python3.10/site-packages/evofr/infer/__init__.py", line 2, in <module>
    from .InferMCMC import InferMCMC, InferNUTS
  File "/usr/local/lib/python3.10/site-packages/evofr/infer/InferMCMC.py", line 10, in <module>
    from .MCMC_handler import MCMCHandler
  File "/usr/local/lib/python3.10/site-packages/evofr/infer/MCMC_handler.py", line 5, in <module>
    from jax._src.random import KeyArray
ImportError: cannot import name 'KeyArray' from 'jax._src.random' (/usr/local/lib/python3.10/site-packages/jax/_src/random.py)

Are there dependencies that need updating for this to run?

@trvrb
Copy link
Member

trvrb commented Dec 17, 2024

This can be resolved via blab/evofr#44.

Suggest to update ncov-escape README.md with current installation instructions. This will be nextstrain update + running nextstrain shell and updating evofr to the latest version ahead of things cascading down to the default Docker image. The readme should have installation instructions in any case.

I've made this revision in the below commit.

@trvrb
Copy link
Member

trvrb commented Dec 19, 2024

@marlinfiggins ---

After the update to the Nextstrain docker image, I'm now longer getting the above jax error. However, if run attempt to run the repo in it's current state without touching config.yaml I get the following error:

MissingInputException in rule process_metadata in file /nextstrain/build/workflow/snakemake_rules/run_models_over_period.smk, line 2:
Missing input files for rule process_metadata:
    output: data/xbb15_windowed/sequence_counts_by_submission.tsv
    wildcards: analysis_period=xbb15_windowed
    affected files:
        data/gisaid_metadata.tsv.gz

The config looks like it should just use what's in analysis_period.xbb15. However, if I take config.yaml and delete other entries in analysis_period and keep only xbb15 I get different behavior with a different error after it runs through a number of steps:

rule download_phenotypes:
    output: data/xbb15/phenotypes/mutation_phenotypes.csv, data/xbb15/phenotypes/mutation_phenotypes_randomized.csv, data/xbb15/phenotypes/lineage_phenotypes.csv, data/xbb15/phenotypes/lineage_phenotypes_randomized.csv
    jobid: 7
    reason: Missing output files: data/xbb15/phenotypes/lineage_phenotypes.csv
    wildcards: analysis_period=xbb15
    resources: tmpdir=/tmp


        python ./scripts/provision-phenotypes.py             --config-path config.yaml             --base-url https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main             --output-path data/xbb15/phenotypes
        
No `output` key in config. Falling back to default files.
Downloading https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/mutation_phenotypes.csv...
Saved to data/xbb15/phenotypes/mutation_phenotypes.csv
Downloading https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/mutation_phenotypes_randomized.csv...
Saved to data/xbb15/phenotypes/mutation_phenotypes_randomized.csv
Downloading https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/lineage_phenotypes.csv...
Failed to download https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/lineage_phenotypes.csv: 404 Client Error: Not Found for url: https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/lineage_phenotypes.csv
Downloading https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/lineage_phenotypes_randomized.csv...
Failed to download https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/lineage_phenotypes_randomized.csv: 404 Client Error: Not Found for url: https://raw.githubusercontent.com/jbloomlab/SARS2-spike-predictor-phenos/refs/heads/main/results/lineage_phenotypes_randomized.csv

Marlin, this doesn't have to be a correct / complete analysis, but I really would like the default analysis and what's specified in the readme to at least run. Can you take another look at this? Please try working from a fresh repo to make sure things at least run.

Noting here that if I swap xbb15 to the provided jn1 analysis period, the workflow completes successfully. I'd be 100% of okay merging this PR if you can just reduce scope to something like this and then we'll get phenotypic predictors working in a subsequent PR.

Copy link
Member

@trvrb trvrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking now is having the default config.yaml / instructions in the readme complete without error. See above comment.

@marlinfiggins
Copy link
Author

marlinfiggins commented Dec 19, 2024

Sorry about this, I've made some changes.

The bug you were running into was with running the regression prior model. It was attempting to fetch the phenotypes from the wrong files.

The issue here was that I had to switch the code to use the old version of the predictors clade_phenotypes.csv present on the main branch of SARS2-spike-predictor-phenos. There were several differences in that version of the phenotypes and the one that lives on the lineage phenotypes PR which is named lineage_phenotypes.csv.

Previously, I just grabbed this by hand from the main branch and renamed the relevant columns to be consistent since I expected this PR to be merged by the time we tested this.

For now, I've changed this and the relevant code to use the phenotypes from the main branch, but I'll just revert these changes once the lineage phenotypes PR is properly merged.

@marlinfiggins
Copy link
Author

I've also removed the window analysis for now until I fix the location of the gisaid_metadata file for the windowed analysis.

@marlinfiggins
Copy link
Author

I've also added the window analysis back and fixed the instructions for provisioning the metadata.

@trvrb
Copy link
Member

trvrb commented Dec 24, 2024

@marlinfiggins --- Thanks so much for humoring me with the detailed instructions. I was able to run this workflow for different analysis periods and everything worked as expected. I like the set up for data/{analysis_period}/ for sequence counts and variant relationships and the corresponding results/{analysis_period}/ for model outputs.

@trvrb trvrb self-requested a review December 24, 2024 00:04
Copy link
Member

@trvrb trvrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks great at this point.

@trvrb trvrb merged commit 6abc677 into master Dec 24, 2024
@trvrb trvrb deleted the rewrite branch December 24, 2024 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants