Skip to content

Latest commit

 

History

History
243 lines (219 loc) · 9.84 KB

about.en.md

File metadata and controls

243 lines (219 loc) · 9.84 KB
title lang layout ja_tasks jamtb_tasks en_tasks tools
About evaluation
en
about
title subtitle text metric setting link
JCom (JCommonsenseQA)
Q&A regarding commonsense and inference
Five-choice questions created with a knowledge base
Accuracy
4-shot
href text
(Kurihara et al., 2022)
title subtitle text metric setting link
JEMHopQA
Multi-hop Q&A
Open-ended Q&A to assess the amount of knowledge and reasoning ability
Character F1
4-shot
href text
(Ishii et al., 2024)
title subtitle text metric setting link
NIILC
Classical Q&A
Open-ended Q&A that can be answered by an encyclopedia
Character F1
4-shot
title subtitle text metric setting link
JSQuAD
Reading comprehension
Open-ended Q&A for Wikipedia article
Character F1
4-shot
href text
(Kurihara et al., 2022)
title subtitle text metric setting link
XL-Sum
Summarization
Task to generate a highlight from a news article of BBC
ROUGE-2
1-shot
href text
(Hasan et al., 2021)
title subtitle text metric setting link
MGSM
Mathematics
Japanese translation of math word problems (GSM8K)
Accuracy (exact match)
4-shot
href text
(Shi et al., 2023)
title subtitle text metric setting link
WMT20 (en-ja)
English-Japanese translation
Translation of news articles
BLEU
4-shot
href text
(Barrault et al., 2020)
title subtitle text metric setting link
WMT20 (ja-en)
Japanese-English translation
Translation of news articles
BLEU
4-shot
href text
(Barrault et al., 2020)
title subtitle text metric setting link
JMMLU
Multi-task natural language understanding
Japanese translation of four-choice exam questions benchmark MMLU (53 subjects)
Accuracy
5-shot
title subtitle text metric setting link
JHumanEval
Code generation
Japanese translation of HumanEval (code genration benchmark)
pass@1
0-shot, 10 trials
title
Coding
title
Extraction
title
Humanities
title
Math
title
Reasoning
title
Roleplay
title
STEM
title
Writing
title subtitle text metric setting link
OpenBookQA
Q&A based on facts and common sense
Four-choice questions based on scientific knowledge and common sense
Accuracy
4-shot
href text
(Mihaylov et al., 2018)
title subtitle text metric setting link
TriviaQA
Q&A based on knowledge
Open-ended Q&A based on trivias
Accuracy (exact match)
4-shot
href text
(Joshi et al., 2017)
title subtitle text metric setting link
HellaSwag
Commonsense inference
Four-choice questions to predict the next event
Accuracy
4-shot
href text
(Zellers et al., 2019)
title subtitle text metric setting link
SQuAD2
Reading comprehension
Open-ended Q&A developed for the evidence document
Accuracy (exact match)
4-shot
href text
(Rajpurkar et al., 2018)
title subtitle text metric setting link
XWINO
Commonsense inference
Two-choice question to predict the antecedent of a pronoun
Accuracy
4-shot
href text
(Tikhonov and Ryabinin, 2021)
title subtitle text metric setting link
MMLU
Multitask natural language understanding
Four-choice exam questions benchmark MMLU (53 subjects)
Accuracy
5-shot
href text
(Hendrycks et al., 2021)
title subtitle text metric setting link
GSM8K
Mathematics
Math word problems
Accuracy (exact match)
4-shot
href text
(Cobbe et al., 2021)
title subtitle text metric setting link
BBH (BIG-Bench-Hard)
Collection of hard-to-solve tasks for LLM
23 tasks that are difficult in BIG-Bench dataset (Srivastava et al., 2023)
Accuracy (exact match)
3-shot, CoT
href text
(Suzgun et al., 2023)
title subtitle text metric setting link
HumanEval
Code generation
Ability of code generation measured by unit test
pass@1
0-shot, 10 trials
href text
(Chen et al., 2021)
title subtitle link
LLM-jp evaluation script (1.3.0)
Automatic evaluation tool for Japanese LLMs
href text
(Han et al, 2024)
title subtitle link
JP Language Model Evaluation Harness (commit #9b42d41)
An evaluation framework for Japanese LLMs
title subtitle link
Language Model Evaluation Harness (0.4.2)
An evaluation framework for LLMs
href text
(Biderman et al., 2024)
title subtitle link
Code Generation LM Evaluation Harness (commit #0261c52)
An evaluation framework for code generation (HumanEval)
title subtitle link
FastChat (commit #e86e70d0)
An automatic evaluation framework by an LLM (MT-Bench)
title subtitle link
swallow-evaluation
An evaluation framework used in Swallow Project (encompassing all the above-mentioned tools)

About evaluation

The Swallow Project is independently conducting evaluation experiments of publicly available LLMs in parallel with the development of LLMs in order to serve as a reference for the development of high-performance large language models (LLMs). By comparing with LLMs developed not only in Japan but also around the world, we can learn the "current level" of the Swallow project. By evaluating each LLM under the fair conditions while taking into account its unique specifications (tokenization, system prompts, etc.) and contrasting them with the development methods of LLMs, we can examine the "recipe" for developing a high-performance LLM. We also realize the challenges in LLM evaluation by experiencing that high or low task evaluation scores are due to not only differences in LLM performance but also trivial specifications in the evaluation (e.g., prompt format). On this site, you can view the results of LLM evaluations conducted within the Swallow project, including bar graphs, radar charts, and scatter plots. We hope that this site will be useful not only as information for selecting the right LLM for your application, but also as reference information for the development of LLMs that are strong in Japanese.

Evaluation tasks

In the 2024 Swallow project, we are conducting LLM evaluation experiments using 10 datasets for the Japanese understanding and generation tasks, MT-Bench for the Japanese multi-turn dialogue task, and 9 datasets for the English understanding and generation tasks. For all tasks, the evaluation scores range from 0 (lowest) to 1 (highest).

Japanese understanding and generation tasks

{% include taskcard.html items="ja_tasks" %}

Japanese multi-turn dialogue tasks (Japanese MT-Bench)

We used Japanese MT-Bench Nejumi Leaderboard Neo version, a Japanese version of MT-Bench, a benchmark for multi-turn dialogue capability. We evaluate instruction-tuned models only. This benchmark automatically rate response sentences on a 10-point scale using GPT-4 (gpt-4-1106-preview). The categories of evaluation are as follows.

{% include taskcard.html items="jamtb_tasks" %}

Note that our Japanese MT-Bench evaluation results are lower than those of the other leaderboards. We think that this difference in scores is caused by the fact that many leaderboards use GPT-4 (gpt-4-0613) to evaluate response texts, while we use GPT-4 (gpt-4-1106-preview). Our investigation revealed that while there are significant differences between our and the other leaderboard's evaluation scores, the relative rankings among the models remain largely unchanged. Therefore, we continued the evaluation without changing the GPT-4 version (since we had already completed many of the evaluations).

nglish understanding and generation tasks

{% include taskcard.html items="en_tasks" %}

Evaluation tools

We used these software packages for evaluation.

{%include card.html items="tools" style="col-sm-6 mb-3" %}

Evaluated models

We list the LLMs in alphabetical order. Some LLMs do not have scores for language understanding and generation but only those for Japanese MT-bench.

{%include models.html %}

Acknowledgements

This website used these software packages.