Skip to content

Latest commit

 

History

History
115 lines (85 loc) · 5.47 KB

README.md

File metadata and controls

115 lines (85 loc) · 5.47 KB

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs.

Data and code for our paper An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs.

For more details, please refer to the project page: https://mathhay.github.io/.

[Webpage] [Paper] [Huggingface Dataset] [Leaderboard] [Twitter]

💥 News 💥

Overview of the automatic construction of MATHHAY:


Overview of the framework for the automatic construction of the MATHHAY Benchmark.

Compared to existing long-context benchmarks:

Benchmark Multi-Doc Multi-Step Avoid Contam. Irrelevant Docs Realistic Docs Auto. Const. Math. Reasoning
ZeroSCROLLS
L-Eval (Math)
LongBench
BAMBOO
InfiniteBench (Math)
Loong
NIAH
RULER
FlenQA
SummHay
BABILong
NeedleBench
MathHay (Ours)

Leaderboard on the MathHay V1:

Accuracy scores on the MathHay V1:


Performance of Selected Models on MATHHAY (32K to 128K tokens). The model with the best performance is highlighted in bold.

Dataset Examples


Examples of the single step single document tasks.

Automatic Generation for MathHay

Run the following commands to install dependencies:

pip install openai
pip install pydantic
pip install tavily-python
pip install spacy
pip install pandas
pip install langchain
pip install langchain-core
pip install nltk
pip install tiktoken
pip install google
pip install boto3

python -m spacy download en_core_web_sm

Set up environment variables:

export TAVILY_API_KEY=""
export OPENAI_API_KEY=""
export PYTHONPATH="."

To generate MathHay data, use:

sh scripts/bench_generation.sh March-2024-to-September-2024 2 2 2

where assigned input arguments are time, number of topics, number of subtopics, and number of queries.

Evaluations on MathVista

Run the evaluation command:

sh scripts/evaluation.sh March-2024-to-September-2024 sssd gpt-4o 32000 middle full

where assigned input arguments are time, task type, models to be evaluated, input length, placement, and dataset choice.

License and Usage

Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data. This repository is being released for research purposes only.

Citation

If you use our data or method, please cite our paper:

@article{wang2024mathhay,
  title={MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs},
  author={Wang, Lei and Dong, Shan and Xu, Yuhui and Dong, Hanze and Wang, Yalu and Saha, Amrita and Lim, Ee-Peng and Xiong, Caiming and Sahoo, Doyen},
  journal={arXiv preprint arXiv:2410.04698},
  year={2024}
}