Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add litgpt evaluate command #1177

Merged
merged 49 commits into from
Apr 4, 2024
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
375e99e
`litgpt evaluate` command
rasbt Mar 21, 2024
0c53da1
update package dependench
rasbt Mar 21, 2024
669ce22
add llm-eval dependency
rasbt Mar 21, 2024
d161d12
move imports
rasbt Mar 21, 2024
e7ebfbc
update cli test
rasbt Mar 21, 2024
4660507
cleanup
rasbt Mar 21, 2024
018cc89
eval unit test
rasbt Mar 22, 2024
98130f9
run tests on cpu
rasbt Mar 22, 2024
a549535
Add lm-eval to test dependencies
rasbt Mar 22, 2024
7ff3ff2
bump version
rasbt Mar 22, 2024
c47e764
Update litgpt/scripts/evaluate.py
rasbt Mar 25, 2024
042d2a5
Update litgpt/scripts/evaluate.py
rasbt Mar 25, 2024
359dad5
Update litgpt/scripts/evaluate.py
rasbt Mar 25, 2024
9bbc5cc
Merge branch 'main' into litgpt-eval
rasbt Mar 25, 2024
f7147c4
make args required
rasbt Mar 25, 2024
0786285
automatically infer repo_id
rasbt Mar 25, 2024
b54095d
check out_dir defaults
rasbt Mar 25, 2024
4c77a6a
move evaluate.py
rasbt Mar 25, 2024
223eb95
Merge branch 'main' into litgpt-eval
rasbt Mar 25, 2024
96d8229
Deps
carmocca Mar 26, 2024
9d9ef7c
Extra file
carmocca Mar 26, 2024
5abec5a
fix import
awaelchli Mar 27, 2024
966ff3e
fix evaluate reference
rasbt Mar 27, 2024
9b2ae7d
fix doc formatting
rasbt Mar 27, 2024
bb4ea30
prototype
rasbt Mar 27, 2024
8988dda
Add batch size
rasbt Mar 28, 2024
f7a46f1
Merge branch 'main' into litgpt-eval
rasbt Mar 28, 2024
45968da
revert to saving temp file and fix output print
rasbt Mar 29, 2024
4c712a2
Merge branch 'main' into litgpt-eval
rasbt Mar 29, 2024
b3b693e
run test on cpu
rasbt Mar 29, 2024
17d4aa2
update tests and docs
rasbt Mar 29, 2024
296101d
update
rasbt Mar 29, 2024
bacd1d6
fix test
rasbt Mar 30, 2024
afaee75
fix test
rasbt Mar 30, 2024
687a382
fix test
rasbt Mar 30, 2024
cdb06c6
Merge branch 'main' into litgpt-eval
rasbt Mar 30, 2024
5faa293
fix tests
rasbt Mar 30, 2024
1c3686c
extend tests
rasbt Mar 30, 2024
a881630
finally fixed
rasbt Mar 30, 2024
1ca218b
Merge branch 'main' into litgpt-eval
rasbt Apr 1, 2024
012ad9b
add new pretrain image
rasbt Apr 2, 2024
8c55ca1
Merge branch 'main' into litgpt-eval
rasbt Apr 2, 2024
9b381c1
Parametrize CLI test
carmocca Apr 3, 2024
b53b688
Minor fixes
carmocca Apr 3, 2024
6cc84ab
Merge branch 'main' into litgpt-eval
carmocca Apr 3, 2024
887ff61
Update evaluation.md
rasbt Apr 3, 2024
6e9e238
Merge branch 'main' into litgpt-eval
rasbt Apr 3, 2024
5a944d2
Apply suggestions from code review
carmocca Apr 4, 2024
efb6ca4
Update tutorials/evaluation.md
carmocca Apr 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions litgpt/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
)
from litgpt.scripts.download import download_from_hub as download_fn
from litgpt.scripts.merge_lora import merge_lora as merge_lora_fn
from litgpt.scripts.evaluate import convert_and_evaluate as evaluate_fn

if TYPE_CHECKING:
from jsonargparse import ArgumentParser
Expand Down Expand Up @@ -78,6 +79,7 @@ def main() -> None:
},
},
"merge_lora": {"help": "Merges the LoRA weights with the base model.", "fn": merge_lora_fn},
"evaluate": {"help": "Evaluate a model with the LM Evaluation Harness.", "fn": evaluate_fn},
}

from jsonargparse import set_config_read_mode, set_docstring_parse_options
Expand Down
116 changes: 116 additions & 0 deletions litgpt/scripts/evaluate.py
rasbt marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Copyright Lightning AI. Licensed under the Apache License 2.0, see LICENSE file.

import json
import os
from pathlib import Path
from typing import Optional
import torch

from litgpt.scripts.convert_lit_checkpoint import convert_lit_checkpoint
from litgpt.utils import CLI, copy_config_files


def safe_safetensors(out_dir, repo_id):
from transformers import AutoModel

state_dict = torch.load(out_dir/"model.pth")
model = AutoModel.from_pretrained(
repo_id, state_dict=state_dict
)
model.save_pretrained(out_dir)


def prepare_results(results, save_filepath, print_results=True):
from lm_eval.utils import make_table

if print_results:
print(make_table(results))
if "groups" in results:
print(make_table(results, "groups"))

json_result = json.dumps(
results, indent=2, ensure_ascii=False
rasbt marked this conversation as resolved.
Show resolved Hide resolved
)
save_filepath.open("w", encoding="utf-8").write(json_result)


def convert_and_evaluate(
checkpoint_dir: Optional[str] = None,
out_dir: Optional[str] = None,
repo_id: Optional[str] = None,
skip_conversion: bool = False,
tasks: Optional[str] = "hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge",
num_fewshot: Optional[int] = None,
batch_size: int = 1,
device: Optional[str] = None,
limit: Optional[float] = None,
seed: int = 1234,
save_filepath: Optional[str] = None,
) -> None:
"""Convert a LitGPT model and run the LM Evaluation Harness

Arguments:
checkpoint_dir: Directory where the `lit_model.pth` and tokenizer files are located.
out_dir: Directory in which to save the converted checkpoints for evaluation.
repo_id: The original repo ID the model was derived from.
skip_conversion: Set to `True` to skip the model conversion,
assuming the model has already been converted and the
model.pth and .safetensor files exist.
tasks: CSV of task names to evaluate.
By default, the Open LM Leaderboard tasks are used:
"hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge"
num_fewshot: Number of examples in few-shot context.
batch_size: Batch size configuration.
device: Device to use for evaluation, for example, "cuda" or "cuda:0".
limit: Limit on number of examples per task.
seed: Random seed.
save_filepath: The file where the results will be saved.
Saves to `out_dir`/results.json by default.
rasbt marked this conversation as resolved.
Show resolved Hide resolved
"""

from lm_eval import evaluator

if checkpoint_dir is None:
raise ValueError("Provide a checkpoint_dir argument.")
if out_dir is None:
raise ValueError("Provide a checkpoint_dir argument.")
if repo_id is None:
raise ValueError("Provide a repo_id argument.")
carmocca marked this conversation as resolved.
Show resolved Hide resolved

checkpoint_dir, out_dir = Path(checkpoint_dir), Path(out_dir)

if save_filepath is None:
save_filepath = "results.json"
save_filepath = out_dir / Path(save_filepath)
rasbt marked this conversation as resolved.
Show resolved Hide resolved
else:
save_filepath = Path(save_filepath)
rasbt marked this conversation as resolved.
Show resolved Hide resolved

out_dir.mkdir(parents=True, exist_ok=True)

copy_config_files(source_dir=checkpoint_dir, out_dir=out_dir)

if not skip_conversion:
convert_lit_checkpoint(checkpoint_dir=checkpoint_dir, output_dir=out_dir)
safe_safetensors(out_dir, repo_id)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

results = evaluator.simple_evaluate(
model="hf",
model_args=f"pretrained={out_dir}",
tasks=tasks.split(","),
num_fewshot=num_fewshot,
batch_size=batch_size,
device=device,
limit=limit,
random_seed=seed,
numpy_random_seed=seed,
torch_random_seed=seed,
)

print("results", results)
prepare_results(results, save_filepath)


if __name__ == "__main__":
CLI(convert_and_evaluate)
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ test = [
"pytest-timeout",
"transformers>=4.38.0",
"einops",
"lm-eval>=0.42.0",
"protobuf",
"lightning-thunder; python_version >= '3.10'",
]
Expand All @@ -43,6 +44,8 @@ all = [
"pyarrow", # litgpt.data.prepare_starcoder.py
"tensorboard", # litgpt.pretrain
"torchmetrics", # litgpt.pretrain
"transformers>=4.38.0", # litgpt.evaluate
"lm-eval>=0.42.0", # litgpt.evaluate
"safetensors", # download
"huggingface_hub[hf_transfer]>=0.21.0" # download
]
Expand Down
3 changes: 2 additions & 1 deletion tests/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,15 @@ def test_cli():
main()
out = out.getvalue()
assert "usage: litgpt" in out
assert "{download,chat,finetune,pretrain,generate,convert,merge_lora}" in out
assert "{download,chat,finetune,pretrain,generate,convert,merge_lora,evaluate}" in out
assert (
"""Available subcommands:
download Download weights or tokenizer data from the Hugging
Face Hub.
chat Chat with a model."""
in out
)
assert ("""evaluate Evaluate a model with the LM Evaluation Harness.""") in out

out = StringIO()
with pytest.raises(SystemExit), redirect_stdout(out), mock.patch("sys.argv", ["litgpt", "finetune", "-h"]):
Expand Down
62 changes: 62 additions & 0 deletions tests/test_evaluate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Copyright Lightning AI. Licensed under the Apache License 2.0, see LICENSE file.

import sys
from pathlib import Path

import datasets
import pytest

from litgpt.scripts.download import download_from_hub
from litgpt.scripts.evaluate import safe_safetensors, prepare_results
from litgpt.scripts.convert_lit_checkpoint import convert_lit_checkpoint
from lm_eval import evaluator

# support running without installing as a package
wd = Path(__file__).parent.parent.resolve()
sys.path.append(str(wd))


@pytest.mark.xfail(
raises=(datasets.builder.DatasetGenerationError, NotImplementedError),
strict=False,
match="Loading a dataset cached in a LocalFileSystem is not supported",
)
def test_run_eval(tmp_path, float_like):
repo_id = "EleutherAI/pythia-14m"
download_from_hub(repo_id=repo_id, checkpoint_dir=tmp_path)
rasbt marked this conversation as resolved.
Show resolved Hide resolved

checkpoint_path = Path(tmp_path) / Path(repo_id)

convert_lit_checkpoint(checkpoint_dir=checkpoint_path, output_dir=checkpoint_path)
safe_safetensors(out_dir=checkpoint_path, repo_id=repo_id)

eval_tasks = "coqa,hellaswag"
results = evaluator.simple_evaluate(
model="hf",
model_args=f"pretrained={checkpoint_path}",
tasks=eval_tasks.split(","),
limit=2,
device="cpu"
)

save_path = checkpoint_path/"results.json"
prepare_results(results, save_path, print_results=False)
rasbt marked this conversation as resolved.
Show resolved Hide resolved

print(checkpoint_path/"dump.txt")
assert save_path.is_file()
assert results["results"] == {
'coqa': {
'alias': 'coqa',
'em,none': 0.0,
'em_stderr,none': 0.0,
'f1,none': 0.0,
'f1_stderr,none': 0.0
},
'hellaswag': {
'acc,none': 0.0,
'acc_stderr,none': 0.0,
'acc_norm,none': 0.5,
'acc_norm_stderr,none': 0.5,
'alias': 'hellaswag'
}
}
77 changes: 46 additions & 31 deletions tutorials/evaluation.md
rasbt marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -9,59 +9,74 @@ You can evaluate LitGPT using [EleutherAI's lm-eval](https://github.com/Eleuther
You need to install the `lm-eval` framework first:

```bash
pip install 'lm_eval @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@115206dc89dad67b8b'
pip install lm_eval
```

 

### Evaluating LitGPT base models

Use the following command to evaluate LitGPT models on all tasks in Eleuther AI's Evaluation Harness.
Suppose you downloaded a base model that we want to evaluate. Here, we use the `microsoft/phi-2` model:

```bash
python eval/lm_eval_harness.py \
--checkpoint_dir "checkpoints/meta-llama/Llama-2-7b-hf" \
--precision "bf16-true" \
--save_filepath "results.json"
litgpt download --repo_id microsoft/phi-2
```

To evaluate on LLMs on specific tasks, for example, TruthfulQA and HellaSwag, you can use the `--eval_task` flag as follows:
The download command above will save the model to the `checkoints/microsoft/phi-2` directory, which we can
specify in the following evaluation command:

```bash
python eval/lm_eval_harness.py \
--checkpoint_dir "checkpoints/meta-llama/Llama-2-7b-hf" \
--eval_tasks "[truthfulqa_mc,hellaswag]" \
--precision "bf16-true" \
--save_filepath "results.json"

```
litgpt evaluate \
--checkpoint_dir checkpoints/microsoft/phi-2/ \
--out_dir evaluate_model/ \
--repo_id microsoft/phi-2
```

A list of supported tasks can be found [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md).
Please note that the `litgpt eval` command run an internal model conversion.
rasbt marked this conversation as resolved.
Show resolved Hide resolved
This is only necessary the first time you want to evaluate a model. To skip the conversion,
when you want to evaluate a model a second time, you can pass the `--skip_conversion true` argument:
rasbt marked this conversation as resolved.
Show resolved Hide resolved

```
litgpt evaluate \
--checkpoint_dir checkpoints/microsoft/phi-2/ \
--out_dir evaluate_model/ \
--repo_id microsoft/phi-2 \
--skip_conversion true
```

 

### Evaluating LoRA-finetuned LLMs
> [!TIP]
> By default, `ligpt evaluate` will evaluate a model on all Open LM Leaderboard tasks, which corresponds
to the setting `--tasks "hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge"`.
rasbt marked this conversation as resolved.
Show resolved Hide resolved

The above command can be used to evaluate models that are saved via a single checkpoint file. This includes downloaded checkpoints and base models finetuned via the full and adapter finetuning scripts.
> [!TIP]
> The evaluation may take a long time, and for testing purpoes, you may want to reduce the number of tasks
> or set a limit for the number of examples per task, for example, `--limit 10`.

For LoRA-finetuned models, you need to first merge the LoRA weights with the original checkpoint file as described in the [Merging LoRA Weights](finetune_lora.md#merging-lora-weights) section of the LoRA finetuning documentation.
A list of supported tasks can be found [here](https://github.com/EleutherAI/lm-evaluation-harness/blob/master/docs/task_table.md).

 

## FAQs

* **How do I evaluate on MMLU?**

MMLU is available as with lm-eval harness but the task name is not MMLU. You can use `hendrycksTest*` as regex to evaluate on MMLU.
 

### Evaluating LoRA-finetuned LLMs

```shell
python eval/lm_eval_harness.py \
--checkpoint_dir "checkpoints/meta-llama/Llama-2-7b-hf" \
--precision "bf16-true" \
--eval_tasks "[hendrycksTest*]" \
--num_fewshot 5 \
--save_filepath "results.json"
```
No further conversion is necessary when evaluating LoRA-finetuned models as the `finetune lora` command already prepares the necessary merged model files:

* **Is Truthful MC is not available in lm-eval?**
```bash
litgpt finetune lora \
--checkpoint_dir checkpoints/microsoft/phi-2 \
--out_dir lora_model
```

It is available as `truthfulqa_mc`.
 

```
rasbt marked this conversation as resolved.
Show resolved Hide resolved
litgpt evaluate \
--checkpoint_dir lora_model/final \
--out_dir evaluate_model/ \
--repo_id microsoft/phi-2
```
Loading