The data need to be evaluated is in ../../data/test/testdata_pairwise.jsonl
.
Prepare your output as the following format, an example is in ../../data/outputs/pairwise_example_output.jsonl
:
{"output": 0}
{"output": 1}
...
{"output": 2}
{"output": 1}
where 0
and 1
means the first and second response is better, respectively, and 2
means equally good (tie).
If the judgement is obtained by pairwise comparison in one prompt (e.g. "Given the query: {query} \n\n Compare the two responses \n\n response 1: {response1} response 2: {response2} ...") that may have position bias, you also need to provide your output on the data where the order of two responses is swapped. Note that you don't need to recover the 0
and 1
by yourself in this file (like ../../data/outputs/pairwise_exchange_example_output.jsonl
).
If the judgment is obtained by rating two response independently than comparing their ratings, like using a standard Reward Model, you can skip the above step.
Then calculate the agreement rate of your output(s) to human preference as follows:
python pairwise_eval.py \
--source_file_path ../../data/test/testdata_pairwise.jsonl \
--pred_file_path your/output/file.jsonl \
--exchange_pred_file_path your/output/file/for/response/order/swapped.jsonl \
--type "pairwise" # if "single" you do not need to provide `exchange_pred_file_path`
You will get results like this (using files ../../data/outputs/pairwise_example_output.jsonl
and ../../data/outputs/pairwise_exchange_example_output.jsonl
):
- type = "pairwise"
Group Name Agreement Consistency
----------------------------
Summarization 45.83 73.61
Exam Questions 38.89 69.44
Code 47.5 75.83
Rewriting 49.17 74.17
Creative Writing 59.72 87.04
Functional Writing 61.67 81.67
General Communication 55.21 92.36
NLP Tasks 57.58 86.36
----------------------------
Overall 54.96 83.41
- type = "single"
Group Name Agreement Consistency
----------------------------
Summarization 56.94 -
Exam Questions 41.67 -
Code 54.17 -
Rewriting 57.5 -
Creative Writing 63.43 -
Functional Writing 68.33 -
General Communication 57.29 -
NLP Tasks 62.12 -
----------------------------
Overall 59.99 -
The data need to be critiqued is in ../../data/test/testdata_critique.jsonl
.
Prepare you output as the following format, you can find an example (generated by Auto-J) in ../../data/outputs/critique_example_output.jsonl
:
{"output": "the critiques for the first query-response pair"}
{"output": "the critiques for the second query-response pair"}
...
{"output": "the critiques for the 231-th query-response pair"}
{"output": "the critiques for the 232-th query-response pair"}
Then use GPT-4 as the judge to compare your critiques and the reference critiques (by ChatGPT, i.e., gpt-3.5-turbo-0613). Suppose your critique file is in your/critique/file.jsonl
python pairwise_critique_openai_eval.py \
--source_file ../../data/test/testdata_critique.jsonl \
--openai_model gpt-4 \
--critic_file your/critique/file.jsonl \
--critic_name auto-j \
--reference_file ../../data/test/reference_chatgpt_critique.jsonl \
--openai_api "your-openai-key" \
--openai_org "your-openai_org-code (you can remove this line if you do not need to assign a specific organization)" \
--batch_size 3
--language "English" #You can also choose "Chinese"
You can simply rerun the same command to continue the evaluation if the program is interrupted. It will detect how many comparisons have been done and continue from there.
After that, the comparison results will be stored in ../../data/outputs/gpt-4-Eval_{critic_name}_vs_chatgpt.jsonl
. We provide an example in ../../data/outputs/gpt-4-turbo-Eval_auto-j_vs_chatgpt.jsonl
, which looks like this:
{"output": "A: Feedback 1 is significantly better. ...", "cost": 0.03564, "finish_reason": "stop", "meta": {"exchange": false}}
{"output": "A: Feedback 1 is significantly better. ...", "cost": 0.0318, "finish_reason": "stop", "meta": {"exchange": false}}
...
{"output": "B: Feedback 2 is significantly better. ...", "cost": 0.04626, "finish_reason": "stop", "meta": {"exchange": true}}
{"output": "A: Feedback 1 is significantly better. ...", "cost": 0.03867, "finish_reason": "stop", "meta": {"exchange": true}}
Sometimes due to the rate limit (or other errors) of OpenAI API, you may get lines like this:
{"output":"Failed!","cost":0.0,"finish_reason":"fail","meta":{"exchange":false,"idx":77}}
To fix this, you can just add --fix_mode
as a new argument and rerun the command. It will automatically fix the failed lines.
python pairwise_critique_openai_eval.py \
... \
same arguments as above \
... \
--fix_mode
Then, resolve the comparison results given by GPT-4 as follows:
python critique_eval.py \
--source_file ../../data/test/testdata_critique.jsonl \
--openai_comparison_file ../../data/outputs/{openai_model}-Eval_{critic_name}_vs_chatgpt.jsonl
You may get results like this (using file ../../data/outputs/gpt-4-turbo-Eval_auto-j_vs_chatgpt.jsonl
)):
Group Winrate
---------------
Summarization 100.0
Exam Questions 83.33
Code 80.0
Rewriting 70.0
Creative Writing 72.22
Functional Writing 55.0
General Communication 75.0
NLP Tasks 79.55
---------------
Overall 73.71