How to use the leaderboard

Pairwise

The data need to be evaluated is in ../../data/test/testdata_pairwise.jsonl.

Prepare your output as the following format, an example is in ../../data/outputs/pairwise_example_output.jsonl:

{"output": 0}
{"output": 1}
...
{"output": 2}
{"output": 1}

where 0 and 1 means the first and second response is better, respectively, and 2 means equally good (tie).

If the judgement is obtained by pairwise comparison in one prompt (e.g. "Given the query: {query} \n\n Compare the two responses \n\n response 1: {response1} response 2: {response2} ...") that may have position bias, you also need to provide your output on the data where the order of two responses is swapped. Note that you don't need to recover the 0 and 1 by yourself in this file (like ../../data/outputs/pairwise_exchange_example_output.jsonl).

If the judgment is obtained by rating two response independently than comparing their ratings, like using a standard Reward Model, you can skip the above step.

Then calculate the agreement rate of your output(s) to human preference as follows:

python pairwise_eval.py \
  --source_file_path ../../data/test/testdata_pairwise.jsonl \
  --pred_file_path your/output/file.jsonl \
  --exchange_pred_file_path your/output/file/for/response/order/swapped.jsonl \
  --type "pairwise" # if "single" you do not need to provide `exchange_pred_file_path`

You will get results like this (using files ../../data/outputs/pairwise_example_output.jsonl and ../../data/outputs/pairwise_exchange_example_output.jsonl):

type = "pairwise"

Group Name      Agreement       Consistency
----------------------------
Summarization   45.83   73.61
Exam Questions  38.89   69.44
Code    47.5    75.83
Rewriting       49.17   74.17
Creative Writing        59.72   87.04
Functional Writing      61.67   81.67
General Communication   55.21   92.36
NLP Tasks       57.58   86.36
----------------------------
Overall 54.96   83.41

type = "single"

Group Name      Agreement       Consistency
----------------------------
Summarization   56.94   -
Exam Questions  41.67   -
Code    54.17   -
Rewriting       57.5    -
Creative Writing        63.43   -
Functional Writing      68.33   -
General Communication   57.29   -
NLP Tasks       62.12   -
----------------------------
Overall 59.99   -

Critique Generation

The data need to be critiqued is in ../../data/test/testdata_critique.jsonl.

Prepare you output as the following format, you can find an example (generated by Auto-J) in ../../data/outputs/critique_example_output.jsonl:

{"output": "the critiques for the first query-response pair"}
{"output": "the critiques for the second query-response pair"}
...
{"output": "the critiques for the 231-th query-response pair"}
{"output": "the critiques for the 232-th query-response pair"}

Then use GPT-4 as the judge to compare your critiques and the reference critiques (by ChatGPT, i.e., gpt-3.5-turbo-0613). Suppose your critique file is in your/critique/file.jsonl

python pairwise_critique_openai_eval.py \
  --source_file ../../data/test/testdata_critique.jsonl \
  --openai_model gpt-4 \
  --critic_file your/critique/file.jsonl \
  --critic_name auto-j \
  --reference_file ../../data/test/reference_chatgpt_critique.jsonl \
  --openai_api "your-openai-key" \
  --openai_org "your-openai_org-code (you can remove this line if you do not need to assign a specific organization)" \
  --batch_size 3
  --language "English" #You can also choose "Chinese"

You can simply rerun the same command to continue the evaluation if the program is interrupted. It will detect how many comparisons have been done and continue from there.

After that, the comparison results will be stored in ../../data/outputs/gpt-4-Eval_{critic_name}_vs_chatgpt.jsonl. We provide an example in ../../data/outputs/gpt-4-turbo-Eval_auto-j_vs_chatgpt.jsonl, which looks like this:

{"output": "A: Feedback 1 is significantly better. ...", "cost": 0.03564, "finish_reason": "stop", "meta": {"exchange": false}}
{"output": "A: Feedback 1 is significantly better. ...", "cost": 0.0318, "finish_reason": "stop", "meta": {"exchange": false}}
...
{"output": "B: Feedback 2 is significantly better. ...", "cost": 0.04626, "finish_reason": "stop", "meta": {"exchange": true}}
{"output": "A: Feedback 1 is significantly better. ...", "cost": 0.03867, "finish_reason": "stop", "meta": {"exchange": true}}

Sometimes due to the rate limit (or other errors) of OpenAI API, you may get lines like this:

{"output":"Failed!","cost":0.0,"finish_reason":"fail","meta":{"exchange":false,"idx":77}}

To fix this, you can just add --fix_mode as a new argument and rerun the command. It will automatically fix the failed lines.

python pairwise_critique_openai_eval.py \
  ... \
  same arguments as above \
  ... \
  --fix_mode

Then, resolve the comparison results given by GPT-4 as follows:

python critique_eval.py \
  --source_file ../../data/test/testdata_critique.jsonl \
  --openai_comparison_file ../../data/outputs/{openai_model}-Eval_{critic_name}_vs_chatgpt.jsonl

You may get results like this (using file ../../data/outputs/gpt-4-turbo-Eval_auto-j_vs_chatgpt.jsonl)):

Group	Winrate
---------------
Summarization	100.0
Exam Questions	83.33
Code	80.0
Rewriting	70.0
Creative Writing	72.22
Functional Writing	55.0
General Communication	75.0
NLP Tasks	79.55
---------------
Overall	73.71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

How to use the leaderboard

Pairwise

Critique Generation

Files

README.md

Latest commit

History

README.md

File metadata and controls

How to use the leaderboard

Pairwise

Critique Generation