This repository provides a framework for evaluating open source models following the HuggingFace's AutoModelForCausalLM API on OpenAI's SimpleQA dataset.
This is basically a trimmed version of openai/simple-evals with modifications to support HF models.
- Efficient evaluation with separation of response generation and grading
- HF
accelerate
for loading models across all available resources
>>> # Preview the dataset
>>> df = pd.read_csv("https://openaipublic.blob.core.windows.net/simple-evals/simple_qa_test_set.csv")
>>> df.head()
metadata problem answer
0 {'topic': 'Science and technology', 'answer_ty... Who received the IEEE Frank Rosenblatt Award i... Michio Sugeno
1 {'topic': 'Science and technology', 'answer_ty... Who was awarded the Oceanography Society's Jer... Annick Bricaud
2 {'topic': 'Geography', 'answer_type': 'Place',... What's the name of the women's liberal arts co... Radcliffe College
3 {'topic': 'Sports', 'answer_type': 'Person', '... In whose honor was the Leipzig 1877 tournament... Adolf Anderssen
4 {'topic': 'Art', 'answer_type': 'Person', 'url... According to Karl Küchler, what did Empress El... Poet Henrich Heine.
Evaluation is split into two stages:
- Response generation: generate responses to the questions in the dataset using the specified model.
- Response grading: grade the generated responses using the specified grader model.
Since the grader model is usually a much larger model, it makes sense to separate the two stages to avoid running out of memory on low resource machines.
Ensure you have the required packages installed. You can install them using the following command:
pip install -r requirements.txt
To evaluate a model on the entire dataset, use the following command:
HF_TOKEN=<HF_TOKEN> python simpleqa_eval_hf.py --generate_responses --model_name_hf <model_name_hf> --grade_responses --grader_model_name_hf <grader_model_name_hf> [options]
This runs both response generation and grading.
To only generate responses for a model on the dataset questions, use:
HF_TOKEN=<HF_TOKEN> python simpleqa_eval_hf.py --generate_responses --model_name_hf <model_name_hf> [options]
Responses are saved in a JSON file in the results
directory.
--generate_responses
: Flag to generate responses using the specified model.--model_name_hf
: The name of the model to be evaluated (required if--generate_responses
is True).--system_message
: Optional system message to be included in the prompt.--max_tokens
: Maximum number of tokens for the model's response (default: 1024).--temperature
: Sampling temperature for the model (default: 0.7).--device
: Device to run the model on.--num_examples
: Number of examples to evaluate.
To grade the responses, use the following command:
HF_TOKEN=<HF_TOKEN> python simpleqa_eval_hf.py --grade_responses --responses_file <responses_file> --grader_model_name_hf <grader_model_name_hf> [options]
--grade_responses
: Flag to grade the responses using the specified grader model.--responses_file
: Path to the file containing the generated responses (if standalone response grading is done).--grader_model_name_hf
: The name of the model used for grading the responses.--grader_max_tokens
: Maximum number of tokens for the grader model's response (default: 1024).--grader_temperature
: Sampling temperature for the grader model (default: 0.7).--grader_device
: Device to run the grader model on.
To generate responses:
python simpleqa_eval_hf.py --generate_responses --model_name_hf google/gemma-2b-it --num_examples 100
To grade the above generated responses:
python simpleqa_eval_hf.py --grade_responses --responses_file results/simpleqa_gemma-2b-it_100_responses.json --grader_model_name_hf tiiuae/falcon-180B
The script generates an HTML report and a JSON file with the evaluation metrics. The files are saved in the results
directory with names based on the model and grader model used.
This project is licensed under the MIT License. See the LICENSE file for details.