🚀 A List of Long-Context LLM Benchmarks. Better view at here.
Dataset | Release Date | Type | Domain | Token Length | Language | Data Released? | Answer Released |
---|---|---|---|---|---|---|---|
ZeroSCROLLS | 2023-05 | Realistic | Novel Report Meetings TV Wikipedia | Avg ~15k | EN | ✅ | ❌ |
L-Eval | 2023-07 | Realistic | Math Code Paper e.t.c | ACL'24 Outstanding Avg ~ 15k | ZH | ✅ | ✅ |
LongBench | 2023-08 | Realistic | Code Meeting Wiki Novel | Avg ~13k | ZH EN | ✅ | ✅ |
BAMBOO | 2023-09 | Realistic | Paper TVshows GovReport Code Meeting | Only 4k, 16k | EN | ✅ | ✅ |
LooGLE | 2023-11 | Realistic | Paper Wikipedia TV&Movie | Avg ~24K | EN | ✅ | ✅ |
LVEval | 2024-02 | Realistic | Mixup | 16 32 64 128 256k | ZH EN | ✅ | ✅ |
InfiniteBench | 2024-02 | Realistic | Code Novel Math Dialogue | > 100k | ZH EN | ✅ | ✅ |
DocFInQA | 2024-02 | Realistic | Finance | > 100k | EN | ✅ | ✅ |
Counting-Stars | 2024-03 | Needle | Essay Novel | Any | ZH EN | ✅ | ✅ |
ClongEval | 2024-03 | Realistic | Story News Conversation | < 100k | ZH | ✅ | ✅ |
NovelQA | 2024-03 | Realistic | Novel | > 100 k | EN | ✅ | ❌ |
RULER | 2024-04 | Needle | Essays | Any | EN | ✅ | ✅ |
XL2Bench | 2024-04 | Realistic | Novel Paper Law | > 100k | ZH EN | ❌ | ❌ |
babilong | 2024-06 | Needle | Books | Any | EN | ✅ | ✅ |
MedOdyssey | 2024-06 | Realistic Needle | Medical | 40k-180K | ZH EN | ✅ | ✅ |
Loong | 2024-06 | Realistic | Papers Legal Finance | 40k-230k | ZH EN | ✅ | ✅ |
LongIns | 2024-06 | Other | Multible QA | 256 - 16k | EN | ❌ | ❌ |
NOCHA | 2024-07 | Realistic | Novel | > 100k | EN | ❌ | ❌ |
[SummaryStack][https://arxiv.org/abs/2407.01370] | 2024-07 | Other | News Conversations | Avg ~92k | EN | ✅ | ✅ |
NeedleBench | 2024-07 | Needle | Essays | Any | ZH EN | ✅ | ✅ |
ML-Needle | 2024-08 | Needle | Wikipedia | 4K-32K | ZH EN SP GR AR VT | ✅ | ✅ |