title

lang

layout

ja_tasks

jamtb_tasks

en_tasks

tools

About evaluation

en

about

title

subtitle

text

metric

setting

link

JCom (JCommonsenseQA)

Q&A regarding commonsense and inference

Five-choice questions created with a knowledge base

Accuracy

4-shot

href	text
https://aclanthology.org/2022.lrec-1.317/	(Kurihara et al., 2022)

title

subtitle

text

metric

setting

link

JEMHopQA

Multi-hop Q&A

Open-ended Q&A to assess the amount of knowledge and reasoning ability

Character F1

4-shot

href	text
https://aclanthology.org/2024.lrec-main.831/	(Ishii et al., 2024)

title

subtitle

text

metric

setting

link

NIILC

Classical Q&A

Open-ended Q&A that can be answered by an encyclopedia

Character F1

4-shot

href	text
https://www.anlp.jp/proceedings/annual_meeting/2003/pdf_dir/C7-6.pdf	(Sekine, 2003)

title

subtitle

text

metric

setting

link

JSQuAD

Reading comprehension

Open-ended Q&A for Wikipedia article

Character F1

4-shot

href	text
https://aclanthology.org/2022.lrec-1.317/	(Kurihara et al., 2022)

title

subtitle

text

metric

setting

link

XL-Sum

Summarization

Task to generate a highlight from a news article of BBC

ROUGE-2

1-shot

href	text
https://aclanthology.org/2021.findings-acl.413/	(Hasan et al., 2021)

title

subtitle

text

metric

setting

link

MGSM

Mathematics

Japanese translation of math word problems (GSM8K)

Accuracy (exact match)

4-shot

href	text
https://openreview.net/forum?id=fR3wGCk-IXp	(Shi et al., 2023)

title

subtitle

text

metric

setting

link

WMT20 (en-ja)

English-Japanese translation

Translation of news articles

BLEU

4-shot

href	text
https://aclanthology.org/2020.wmt-1.1/	(Barrault et al., 2020)

title

subtitle

text

metric

setting

link

WMT20 (ja-en)

Japanese-English translation

Translation of news articles

BLEU

4-shot

href	text
https://aclanthology.org/2020.wmt-1.1/	(Barrault et al., 2020)

title

subtitle

text

metric

setting

link

JMMLU

Multi-task natural language understanding

Japanese translation of four-choice exam questions benchmark MMLU (53 subjects)

Accuracy

5-shot

href	text
https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/A7-5.pdf	(Yin et al, 2024)

title

subtitle

text

metric

setting

link

JHumanEval

Code generation

Japanese translation of HumanEval (code genration benchmark)

pass@1

0-shot, 10 trials

href	text
https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/P10-9.pdf	(Sato et al., 2024)

title
Coding

title
Extraction

title
Humanities

title
Math

title
Reasoning

title
Roleplay

title
STEM

title
Writing

title

subtitle

text

metric

setting

link

OpenBookQA

Q&A based on facts and common sense

Four-choice questions based on scientific knowledge and common sense

Accuracy

4-shot

href	text
https://aclanthology.org/D18-1260/	(Mihaylov et al., 2018)

title

subtitle

text

metric

setting

link

TriviaQA

Q&A based on knowledge

Open-ended Q&A based on trivias

Accuracy (exact match)

4-shot

href	text
https://aclanthology.org/P17-1147/	(Joshi et al., 2017)

title

subtitle

text

metric

setting

link

HellaSwag

Commonsense inference

Four-choice questions to predict the next event

Accuracy

4-shot

href	text
https://aclanthology.org/P19-1472/	(Zellers et al., 2019)

title

subtitle

text

metric

setting

link

SQuAD2

Reading comprehension

Open-ended Q&A developed for the evidence document

Accuracy (exact match)

4-shot

href	text
https://aclanthology.org/P18-2124/	(Rajpurkar et al., 2018)

title

subtitle

text

metric

setting

link

XWINO

Commonsense inference

Two-choice question to predict the antecedent of a pronoun

Accuracy

4-shot

href	text
https://aclanthology.org/2021.findings-acl.310/	(Tikhonov and Ryabinin, 2021)

title

subtitle

text

metric

setting

link

MMLU

Multitask natural language understanding

Four-choice exam questions benchmark MMLU (53 subjects)

Accuracy

5-shot

href	text
https://openreview.net/forum?id=d7KBjmI3GmQ	(Hendrycks et al., 2021)

title

subtitle

text

metric

setting

link

GSM8K

Mathematics

Math word problems

Accuracy (exact match)

4-shot

href	text
https://arxiv.org/abs/2110.14168	(Cobbe et al., 2021)

title

subtitle

text

metric

setting

link

BBH (BIG-Bench-Hard)

Collection of hard-to-solve tasks for LLM

23 tasks that are difficult in BIG-Bench dataset (Srivastava et al., 2023)

Accuracy (exact match)

3-shot, CoT

href	text
https://aclanthology.org/2023.findings-acl.824/	(Suzgun et al., 2023)

title

subtitle

text

metric

setting

link

HumanEval

Code generation

Ability of code generation measured by unit test

pass@1

0-shot, 10 trials

href	text
https://arxiv.org/abs/2107.03374	(Chen et al., 2021)

title

subtitle

link

LLM-jp evaluation script (1.3.0)

Automatic evaluation tool for Japanese LLMs

href	text
https://github.com/llm-jp/llm-jp-eval	(Han et al, 2024)

title

subtitle

link

JP Language Model Evaluation Harness (commit #9b42d41)

An evaluation framework for Japanese LLMs

href
https://github.com/Stability-AI/lm-evaluation-harness/

title

subtitle

link

Language Model Evaluation Harness (0.4.2)

An evaluation framework for LLMs

href	text
https://github.com/EleutherAI/lm-evaluation-harness	(Biderman et al., 2024)

title

subtitle

link

Code Generation LM Evaluation Harness (commit #0261c52)

An evaluation framework for code generation (HumanEval)

href
https://github.com/bigcode-project/bigcode-evaluation-harness

title

subtitle

link

FastChat (commit #e86e70d0)

An automatic evaluation framework by an LLM (MT-Bench)

href
https://github.com/lm-sys/FastChat

title

subtitle

link

swallow-evaluation

An evaluation framework used in Swallow Project (encompassing all the above-mentioned tools)

href
https://github.com/swallow-llm/swallow-evaluation

About evaluation

The Swallow Project is independently conducting evaluation experiments of publicly available LLMs in parallel with the development of LLMs in order to serve as a reference for the development of high-performance large language models (LLMs). By comparing with LLMs developed not only in Japan but also around the world, we can learn the "current level" of the Swallow project. By evaluating each LLM under the fair conditions while taking into account its unique specifications (tokenization, system prompts, etc.) and contrasting them with the development methods of LLMs, we can examine the "recipe" for developing a high-performance LLM. We also realize the challenges in LLM evaluation by experiencing that high or low task evaluation scores are due to not only differences in LLM performance but also trivial specifications in the evaluation (e.g., prompt format). On this site, you can view the results of LLM evaluations conducted within the Swallow project, including bar graphs, radar charts, and scatter plots. We hope that this site will be useful not only as information for selecting the right LLM for your application, but also as reference information for the development of LLMs that are strong in Japanese.

Evaluation tasks

In the 2024 Swallow project, we are conducting LLM evaluation experiments using 10 datasets for the Japanese understanding and generation tasks, MT-Bench for the Japanese multi-turn dialogue task, and 9 datasets for the English understanding and generation tasks. For all tasks, the evaluation scores range from 0 (lowest) to 1 (highest).

Japanese understanding and generation tasks

{% include taskcard.html items="ja_tasks" %}

Japanese multi-turn dialogue tasks (Japanese MT-Bench)

We used Japanese MT-Bench Nejumi Leaderboard Neo version, a Japanese version of MT-Bench, a benchmark for multi-turn dialogue capability. We evaluate instruction-tuned models only. This benchmark automatically rate response sentences on a 10-point scale using GPT-4 (gpt-4-1106-preview). The categories of evaluation are as follows.

{% include taskcard.html items="jamtb_tasks" %}

Note that our Japanese MT-Bench evaluation results are lower than those of the other leaderboards. We think that this difference in scores is caused by the fact that many leaderboards use GPT-4 (gpt-4-0613) to evaluate response texts, while we use GPT-4 (gpt-4-1106-preview). Our investigation revealed that while there are significant differences between our and the other leaderboard's evaluation scores, the relative rankings among the models remain largely unchanged. Therefore, we continued the evaluation without changing the GPT-4 version (since we had already completed many of the evaluations).

nglish understanding and generation tasks

{% include taskcard.html items="en_tasks" %}

Evaluation tools

We used these software packages for evaluation.

{%include card.html items="tools" style="col-sm-6 mb-3" %}

Evaluated models

We list the LLMs in alphabetical order. Some LLMs do not have scores for language understanding and generation but only those for Japanese MT-bench.

{%include models.html %}

Acknowledgements

This website used these software packages.

Bootstrap
Chart.js
DataTables

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about.en.md

about.en.md

About evaluation

Evaluation tasks

Japanese understanding and generation tasks

Japanese multi-turn dialogue tasks (Japanese MT-Bench)

nglish understanding and generation tasks

Evaluation tools

Evaluated models

Acknowledgements

Files

about.en.md

Latest commit

History

about.en.md

File metadata and controls

About evaluation

Evaluation tasks

Japanese understanding and generation tasks

Japanese multi-turn dialogue tasks (Japanese MT-Bench)

nglish understanding and generation tasks

Evaluation tools

Evaluated models

Acknowledgements