MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs.

Data and code for our paper An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs.

For more details, please refer to the project page: https://mathhay.github.io/.

[Webpage] [Paper] [Huggingface Dataset] [Leaderboard] [Twitter]

💥 News 💥

[2024.11.14] Our code is now accessible.
[2024.10.07] Our paper is now accessible at https://arxiv.org/abs/2410.04698.

Overview of the automatic construction of MATHHAY:

Overview of the framework for the automatic construction of the MATHHAY Benchmark.

Compared to existing long-context benchmarks:

Benchmark	Multi-Doc	Multi-Step	Avoid Contam.	Irrelevant Docs	Realistic Docs	Auto. Const.	Math. Reasoning
ZeroSCROLLS	✓	✓	✗	✓	✓	✗	✗
L-Eval (Math)	✓	✓	✓	✗	✗	✓	✓
LongBench	✓	✓	✓	✓	✓	✗	✗
BAMBOO	✓	✗	✓	✓	✓	✗	✗
InfiniteBench (Math)	✓	✓	✗	✓	✗	✗	✓
Loong	✓	✓	✓	✓	✓	✗	✗
NIAH	✗	✗	✗	✓	✓	✓	✗
RULER	✓	✓	✗	✓	✓	✓	✗
FlenQA	✗	✓	✗	✓	✓	✗	✗
SummHay	✓	✗	✓	✓	✓	✗	✗
BABILong	✓	✓	✓	✓	✓	✗	✗
NeedleBench	✓	✓	✗	✓	✓	✓	✗
MathHay (Ours)	✓	✓	✓	✓	✓	✓	✓

Leaderboard on the MathHay V1:

Accuracy scores on the MathHay V1:

Performance of Selected Models on MATHHAY (32K to 128K tokens). The model with the best performance is highlighted in bold.

Dataset Examples

Examples of the single step single document tasks.

Automatic Generation for MathHay

Run the following commands to install dependencies:

pip install openai
pip install pydantic
pip install tavily-python
pip install spacy
pip install pandas
pip install langchain
pip install langchain-core
pip install nltk
pip install tiktoken
pip install google
pip install boto3

python -m spacy download en_core_web_sm

Set up environment variables:

export TAVILY_API_KEY=""
export OPENAI_API_KEY=""
export PYTHONPATH="."

To generate MathHay data, use:

sh scripts/bench_generation.sh March-2024-to-September-2024 2 2 2

where assigned input arguments are time, number of topics, number of subtopics, and number of queries.

Evaluations on MathVista

Run the evaluation command:

sh scripts/evaluation.sh March-2024-to-September-2024 sssd gpt-4o 32000 middle full

where assigned input arguments are time, task type, models to be evaluated, input length, placement, and dataset choice.

License and Usage

Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data. This repository is being released for research purposes only.

Citation

If you use our data or method, please cite our paper:

@article{wang2024mathhay,
  title={MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs},
  author={Wang, Lei and Dong, Shan and Xu, Yuhui and Dong, Hanze and Wang, Yalu and Saha, Amrita and Lim, Ee-Peng and Xiong, Caiming and Sahoo, Doyen},
  journal={arXiv preprint arXiv:2410.04698},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs.

💥 News 💥

Overview of the automatic construction of MATHHAY:

Compared to existing long-context benchmarks:

Leaderboard on the MathHay V1:

Dataset Examples

Automatic Generation for MathHay

Evaluations on MathVista

License and Usage

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs.

💥 News 💥

Overview of the automatic construction of MATHHAY:

Compared to existing long-context benchmarks:

Leaderboard on the MathHay V1:

Dataset Examples

Automatic Generation for MathHay

Evaluations on MathVista

License and Usage

Citation