Skip to content

Commit

Permalink
set up basic README, add more via mkdocs etc (#54)
Browse files Browse the repository at this point in the history
* set up basics, add more via mkdocs etc

* change coverage and pycov version to fix testing

* started writing stuff

* set up basics
  • Loading branch information
leokim-l authored Nov 14, 2024
1 parent 90b99aa commit 62f563b
Show file tree
Hide file tree
Showing 13 changed files with 2,497 additions and 2,000 deletions.
92 changes: 24 additions & 68 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,79 +1,35 @@
# MALCO
# pheval.llm

Multilingual Analysis of LLMs for Clinical Observations
![Contributors](https://img.shields.io/github/contributors/monarch-initiative/pheval.llm?style=plastic)
![Stars](https://img.shields.io/github/stars/monarch-initiative/pheval.llm)
![Licence](https://img.shields.io/github/license/monarch-initiative/pheval.llm)
![Issues](https://img.shields.io/github/issues/monarch-initiative/pheval.llm)

Built using the PhEval runner template (see instructions below).
## Evaluate LLMs' capability at performing differential diagnosis for rare genetic diseases through medical-vignette-like prompts created with [phenopacket2prompt](https://github.com/monarch-initiative/phenopacket2prompt).

# Usage
Let us start by documenting how to run the current version in a new folder. This has to be changed!
```shell
poetry install
poetry shell
mkdir myinputdirectory
mkdir myoutputdirectory
cp -r /path/to/promptdir myinputdirectory/
cp inputdir/config.yaml myinputdirectory
pheval run -i myinputdirectory -r "malcorunner" -o myoutputdirectory -t tests
```
### Description
To systematically assess and evaluate an LLM's ability to perform differential diagnostics tasks, we employed prompts programatically created with [phenopacket2prompt](https://github.com/monarch-initiative/phenopacket2prompt), thereby avoiding any patient privacy issues. The original data are phenopackets located at [phenopacket-store](https://github.com/monarch-initiative/phenopacket-store/). A programmatic approach for scoring and grounding results is also developed, made possible thanks to the ontological structure of the [Mondo Disease Ontology](https://mondo.monarchinitiative.org/).

Two main analyses are carried out:
- A benchmark of some openAI GPT-models against a state of the art tool for differential diagnostics, [Exomiser](https://github.com/exomiser/Exomiser). The bottom line, Exomiser [clearly outperforms the LLMs](https://github.com/monarch-initiative/pheval.llm/blob/short_letter/notebooks/plot_exomiser_o1MINI_o1PREVIEW_4o.ipynb).
- A comparison of gpt-4o's ability to carry out differential diagnosis when prompted in different languages.

## Template Runner for PhEval
Formerly MALCO, Multilingual Analysis of LLMs for Clinical Observations.
Built using the [PhEval](https://github.com/monarch-initiative/pheval) runner template.

This serves as a template repository designed for crafting a personalised PhEval runner. Presently, the runner executes a mock predictor found in `src/pheval_template/run/fake_predictor.py`. Nevertheless, the primary objective is to leverage this repository as a starting point to develop your own runner for your tool, allowing you to customise and override existing methods effortlessly, given that it already encompasses all the necessary setup for integration with PhEval. There are exemplary methods throughout the runner to provide an idea on how things could be implemented.

## Installation
# Usage
Before starting a run take care of editing the [run parameters](inputdir/run_parameters.csv) as follows:
- The first line contains a non-empty comma-separated list of (supported) language codes between double quotation marks in which one wishes to prompt.
- The second line contains a non-empty comma-separated list of (supported) model names between double quotation marks which one wishes to prompt.
- The third line contains two comma-separated binary entries, represented by 0 (false) and 1 (true). The first set to true runs the prompting and grounding, i.e. the run step, the second one executes the scoring and the rest of the analysis, i.e. the post processing step.

```bash
git clone https://github.com/yaseminbridges/pheval.template.git
cd pheval.template
At this point one can install and run the code by doing
```shell
poetry install
poetry shell
mkdir outputdirectory
cp -r /path/to/promptdir inputdir/
pheval run -i inputdir -r "malcorunner" -o outputdirectory -t tests
```

## Configuring a run with the template runner

A `config.yaml` should be located in the input directory and formatted like so:

```yaml
tool: template
tool_version: 1.0.0
variant_analysis: False
gene_analysis: True
disease_analysis: False
tool_specific_configuration_options:
```
The testdata directory should include the subdirectory named `phenopackets` - which should contain phenopackets.

## Run command

```bash
pheval run --input-dir /path/to/input_dir \
--runner templatephevalrunner \
--output-dir /path/to/output_dir \
--testdata-dir /path/to/testdata_dir
```

## Benchmark

You can benchmark the run with the `pheval-utils benchmark` command:

```bash
pheval-utils benchmark --directory /path/to/output_directoy \
--phenopacket-dir /path/to/phenopacket_dir \
--output-prefix OUTPUT_PREFIX \
--gene-analysis \
--plot-type bar_cumulative
```

The path provided to the `--directory` parameter should be the same as the one provided to the `--output-dir` in the `pheval run` command

## Personalising to your own tool

If overriding this template to create your own runner implementation. There are key files that should change to fit with your runner implementation.

1. The name of the Runner class in `src/pheval_template/runner.py` should be changed.
2. Once the name of the Runner class has been customised, line 15 in `pyproject.toml` should also be changed to match the class name, then run `poetry lock` and `poetry install`

The runner you give on the CLI will then change to the name of the runner class.

You should also remove the `src/pheval_template/run/fake_predictor.py` and implement the running of your own tool. Methods in the post-processing can also be altered to process your own tools output.
7 changes: 7 additions & 0 deletions docs/analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Scoring
In order to fairly score clinically accurate diagnoses - considering we are only using phenotypic data - we needed to match the grounded answers by an LLM (or by Exomiser) to the correct result present in the phenopacket, consisting of an OMIM identifier. This is illustrated in the image below.
![figure](images/mondo_grouping.png).

# Statistics

# More TBD
Binary file added docs/images/mondo_grouping.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/ppkt2score.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 10 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Welcome to pheval.llm, formerly MALCO

To systematically assess and evaluate an LLM's ability to perform differential diagnostics tasks, we employed prompts programatically created with [phenopacket2prompt](https://github.com/monarch-initiative/phenopacket2prompt), thereby avoiding any patient privacy issues. The original data are phenopackets located at [phenopacket-store](https://github.com/monarch-initiative/phenopacket-store/). A programmatic approach for scoring and grounding results is also developed, made possible thanks to the ontological structure of the [Mondo Disease Ontology](https://mondo.monarchinitiative.org/).

Two main analyses are carried out:
- A benchmark of some openAI GPT-models against a state of the art tool for differential diagnostics, [Exomiser](https://github.com/exomiser/Exomiser). The bottom line, Exomiser [clearly outperforms the LLMs](https://github.com/monarch-initiative/pheval.llm/blob/short_letter/notebooks/plot_exomiser_o1MINI_o1PREVIEW_4o.ipynb).
- A comparison of gpt-4o's ability to carry out differential diagnosis when prompted in different languages.

## Project layout
The description of the steps we take are found in the figure below ![figure](images/ppkt2score.png).
7 changes: 7 additions & 0 deletions docs/layout.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
The first part of the code does:

### Prepare step

### Run step

### Post process step
3 changes: 3 additions & 0 deletions docs/reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
The grounding happens via

::: src.malco.post_process.mondo_score_utils
10 changes: 10 additions & 0 deletions docs/run.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Grounding
Since LLMs today, up to November 2024, show little ability to precisely and reliably return unique identifiers of some entity present in a database, we need to deal with this issue. In order to transform some human language disease name such as "cystic fibrosis" into its corresponding [OMIM identifier OMIM:219700](https://omim.org/entry/219700) we use the following approach:

<!--- Add links to files as soon as they are merged--->
1. First, we try exact lexical matching between the LLMs reply and the OMIM diseases label.
2. Then we run [CurateGPT](https://github.com/monarch-initiative/curategpt) on the remaining ones that have not been grounded.

We remark here that we ground to MONDO.

# OntoGPT
3 changes: 3 additions & 0 deletions docs/run_parameters.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
"en"
"gpt-4","gpt-3.5-turbo","gpt-4o","gpt-4-turbo"
0,1
16 changes: 16 additions & 0 deletions docs/setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Before starting a run take care of editing the [run parameters](inputdir/run_parameters.csv) as follows:

- The first line contains a non-empty comma-separated list of (supported) language codes between double quotation marks in which one wishes to prompt.
- The second line contains a non-empty comma-separated list of (supported) model names between double quotation marks which one wishes to prompt.
- The third line contains two comma-separated binary entries, represented by 0 (false) and 1 (true). The first set to true runs the prompting and grounding, i.e. the run step, the second one executes the scoring and the rest of the analysis, i.e. the post processing step.

At this point one can install and run the code by doing:
```shell
poetry install
poetry shell
mkdir outputdirectory
cp -r /path/to/promptdir inputdir/
pheval run -i inputdir -r "malcorunner" -o outputdirectory -t tests
```

As an example, the [input file](https://github.com/monarch-initiative/pheval.llm/tree/main/docs/run_parameters.csv) file will execute only the post_process block for English, prompting the models gpt-4, gpt-3.5-turbo, gpt-4o, and gpt-4-turbo.
3 changes: 3 additions & 0 deletions inputdir/run_parameters.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
"en"
"gpt-4","gpt-3.5-turbo","gpt-4o","gpt-4-turbo"
0,1
Loading

0 comments on commit 62f563b

Please sign in to comment.