A unified, precise, and extensible toolkit to benchmark LLMs on various mathematical tasks 🧮✨.
🔴🚀 Important Notice: We've identified variances above 5% in results from diverse math evaluation frameworks. To ensure fair and standardized comparisons across research, our toolkit strives to harmonize evaluation methods, promoting consistent and reliable math evaluation.
🌟 In Practice: Esteemed projects like ToRA (ICLR'24) and DeepSeek-Coder have leveraged this suite!
🙌 Statement: This project is a fork of math-evaluation-harness, which is no longer actively maintained by the original author. We have forked it to ensure continued updates (e.g., more features, more automation, more benchamrk) and maintenance under GAIR Lab.
-
Models: Seamless compatibility with models from Hugging Face 🤗 and vLLM.
-
Datasets: An extensive array of datasets including
minerva_math
,math
,math_oai
,gsm8k
,gsm_hard
,svamp
,asdiv
,mawps
,tabmwp
,finqa
,theorem_qa
,bbh
,mmlu_stem
,sat_math
,mathqa
,hungarian_exam
. -
Prompts: Diverse prompting paradigms, from Direct to Chain-of-Thought (CoT), Program-of-Thought (PoT/PAL), and Tool-Integrated Reasoning (ToRA).
conda create -n math_eval python=3.10
conda activate math_eval
We suggest using vLLM docker directly:
docker run --network host --cap-add=SYS_ADMIN --privileged -d \
--entrypoint '' --name vllm \
--runtime nvidia --gpus all \
--security-opt apparmor:unconfined \
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v /mnt:/mnt \
-p 8000:8000 \
vllm/vllm-openai:latest \
sleep infinity
git clone https://github.com/GAIR-NLP/math-evaluation-harness.git
cd math-evaluation-harness
pip install -r requirements.txt
-
Configure model and data settings in
scripts/run_math_eval.sh
, and set thePROMPT_TYPE
variable accordingly:- For base models, choose from:
direct
,cot
,pal
, ortool-integrated
. - For SFT models, your options include:
tora
,wizard_zs
,deepseek-math
, etc.- To add new models, update the
construct_prompt
function inutils.py
to include your new prompt template.
- To add new models, update the
- For base models, choose from:
-
Run the script:
If you want to evaluate a model on a specific dataset, you can specify the dataset name in the second argument, e.g., gsm8k
or svamp
. If you want to evaluate a model on all benchmarks, you can omit the second argument.
# evaluate one model only on one dataset or several datasets
bash scripts/run_eval.sh ${your_model_path} "dataset_name1,dataset_name2"
# evaluate one model only on all benchmarks
bash scripts/run_eval.sh ${your_model_path}
# evaluate a large model on all benchmarks, such as 70B, you need to specific the TP size
bash scripts/run_eval.sh ${your_model_path} 2
When you would like evaluate all saved intermediate ckpts during the language model training process, you can leverage the following command:
# only eval on one dataset or several datasets
bash auto_dir_run.sh ${your_model_folder_path} "dataset_name1,dataset_name2"
# eval on all benchmarks
bash auto_dir_run.sh ${your_model_folder_path}
# eval on all bencharks, such as 70B, you need to specific the TP size
bash auto_dir_run.sh ${your_model_folder_path} 2
Note that the your_model_folder_path should be like:
$ ls your_model_folder_path
checkpoint-1000/ checkpoint-2000/ checkpoint-3000/ checkpoint-4000/ checkpoint-5000/ checkpoint-6000/ checkpoint-7000/ checkpoint-8000/ checkpoint-9000/
- Summarize the results automatically
# summarize all results of all intermediate ckpts in your_model_folder_path
python gather_results.py --do_all_ckpts --dir_path outputs/${your_model_folder_path}
# summarize results of one ckpt
python gather_results.py --do_one_ckpt --dir_path outputs/${your_model_folder_path}
This command will generate summary results and plot the results automatically in the your model folder. We provide some examples of the results in the outputs
folder.
PROMPT_TYPE=cot
Model | Size | Data | Uniq. Token | Train Token | GSM8K | MATH1 | SVAMP | ASDiv | MAWPS | TAB2 | MQA | MMLU STEM | SAT | AVG |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1-2B Base Models | ||||||||||||||
Tinyllama | 1.1B | - | - | - | 2.9 | 3.2 | 11.0 | 18.1 | 20.4 | 12.5 | 14.6 | 16.1 | 21.9 | 13.4 |
Phi-1.5 | 1.3B | - | - | - | 32.4 | 4.2 | 43.4 | 53.1 | 66.2 | 24.4 | 14.3 | 21.8 | 18.8 | 31.0 |
Qwen1.5 | 1.8B | - | - | - | 36.1 | 6.8 | 48.5 | 63.6 | 79.0 | 29.2 | 25.1 | 31.3 | 40.6 | 40.0 |
Gemma | 2.0B | - | - | - | 18.8 | 11.4 | 38.0 | 56.6 | 72.5 | 36.9 | 26.8 | 34.4 | 50.0 | 38.4 |
DeepSeekLLM | 1.3B | OWM | 14B | 150B | 11.5 | 8.9 | - | - | - | - | - | 29.6 | 31.3 | - |
DeepSeekMath | 1.3B | - | 120B | 150B | 23.8 | 13.6 | - | - | - | - | - | 33.1 | 56.3 | - |
Rho-Math | 1.1B | OWM | 14B | 30B | 36.2 | 15.6 | 52.1 | 67.0 | 83.9 | 29.0 | 32.5 | 23.3 | 28.1 | 40.9 |
>= 7B Base Models | ||||||||||||||
LLaMA-2 | 7B | - | - | 14.0 | 3.6 | 39.5 | 51.7 | 63.5 | 30.9 | 12.4 | 32.7 | 34.4 | 31.4 | |
Mistral | 7B | - | - | 41.2 | 11.6 | 64.7 | 68.5 | 87.5 | 52.9 | 33.0 | 49.5 | 59.4 | 52.0 | |
Minerva | 8B | - | 39B | 164B | 16.2 | 14.1 | - | - | - | - | - | 35.6 | - | - |
Minerva | 62B | - | 39B | 109B | 52.4 | 27.6 | - | - | - | - | - | 53.9 | - | - |
Minerva | 540B | - | 39B | 26B | 58.8 | 33.6 | - | - | - | - | - | 63.9 | - | - |
LLemma | 7B | PPile | 55B | 200B | 38.8 | 17.2 | 56.1 | 69.1 | 82.4 | 48.7 | 41.0 | 45.4 | 59.4 | 50.9 |
LLemma | 34B | PPile | 55B | 50B | 54.2 | 23.0 | 67.9 | 75.7 | 90.1 | 57.0 | 49.8 | 54.7 | 68.8 | 60.1 |
Intern-Math | 7B | - | 31B | 125B | 41.8 | 14.4 | 61.6 | 66.8 | 83.7 | 50.0 | 57.3 | 24.8 | 37.5 | 48.7 |
Intern-Math | 20B | - | 31B | 125B | 65.4 | 30.0 | 75.7 | 79.3 | 94.0 | 50.9 | 38.5 | 53.1 | 71.9 | 62.1 |
DeepSeekMath | 7B | - | 120B | 500B | 64.1 | 34.2 | 74.0 | 83.9 | 92.4 | 63.4 | 62.4 | 56.4 | 84.4 | 68.4 |
Rho-Math | 7B | OWM | 14B | 10.5B | 66.9 | 31.0 | 77.8 | 79.0 | 93.9 | 49.9 | 58.7 | 54.6 | 84.4 | 66.2 |
PROMPT_TYPE=tora
Model | Size | SFT Data | GSM8k | MATH | SVAMP | ASDiv | MAWPS | TAB | GSM-Hard | AVG |
---|---|---|---|---|---|---|---|---|---|---|
GPT4-early (PAL) | - | - | 94.2 | 51.8 | 94.8 | 92.6 | 97.7 | 95.9 | 77.6 | 86.4 |
MAmmoTH | 70B | MI-260k | 76.9 | 41.8 | 82.4 | - | - | - | - | - |
ToRA | 7B | ToRA-69k | 68.8 | 40.1 | 68.2 | 73.9 | 88.8 | 42.4 | 54.6 | 62.4 |
ToRA | 70B | ToRA-69k | 84.3 | 49.7 | 82.7 | 86.8 | 93.8 | 74.0 | 67.2 | 76.9 |
DeepSeekMath | 7B | ToRA-69k | 79.8 | 52.0 | 80.1 | 87.1 | 93.8 | 85.8 | 63.1 | 77.4 |
Rho-Math | 1B | ToRA-69k | 59.4 | 40.6 | 60.7 | 74.2 | 88.6 | 26.7 | 48.1 | 56.9 |
Rho-Math | 7B | ToRA-69k | 81.3 | 51.8 | 80.8 | 85.5 | 94.5 | 70.1 | 63.1 | 75.3 |
PROMPT_TYPE=deepseek-math
Size | Model | GSM8k | MATH | SWAMP | ASDiv | MAWPS | AVG |
---|---|---|---|---|---|---|---|
7B | DeepSeek-Math-Instruct | 82.4 | 45.8 | 83.5 | 90.1 | 95.7 | 79.5 |
DeepSeek-Math-RL | 88.3 | 50.0 | 87.2 | 92.0 | 95.5 | 82.6 |
This project is still under active development. We welcome any contributions, including bug reports, feature requests, and pull requests.
- https://github.com/microsoft/ToRA
- https://github.com/openai/prm800k
- https://github.com/wellecks/lm-evaluation-harness
- https://github.com/deepseek-ai/DeepSeek-Math
Footnotes
-
We suggest utilizing the OpenAI test subset for evaluating MATH performance, since the original
MATH
test set has already been included in public training sets such as PRM800k. We use minerva_math prompt. ↩ -
abbreviations: TAB=tabmwp, MQA = mathqa, SAT = sat_math ↩