You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I found that some answer with higher overall_socre possessing a lower helpfulness_score in evol_instruct.jsonl dataset which the principle is 100% helpfulness.
for example, the scores of 9th sample in evol_instruct.jsonl dataset is as following:
models
helpfulness
honesty
instruction following
truthfulness
overall score
gpt-3.5-turbo
4
5
4
5
7
llama-2-70b-chat
4
4
5
5
7.5
mpt-30b-chat
3
4
3
5
6.5
vicuna-33b
5
4
4
5
6.5
The answer of vicuna-33b has the highest helpfulness but lowest overall score.
My question is should I pickup the answer with the highest overall score or the highest helpfulness score as a preference anwer, or should I use the mean of the four principles.
Any suggestions will be appriciated, thx.
The text was updated successfully, but these errors were encountered:
The overall and fine-grained scores are annotated in different schemas and thus may not strictly match each other. Specifically, fine-grained scores are annotated according to our hand-written documentation, while overall scores totally rely on GPT-4 itself with the textual critique being the CoT rationale for scoring.
We investigated the effects of both kinds of scores in our paper (See section 4.1) and found that using fine-grained scores was slightly better. But note that the experiments were based on the previous "bugged" version of overall scores (see this issue), and we are not sure if the conclusion in the paper still apply to our updated scores.
Hi,
I found that some answer with higher overall_socre possessing a lower helpfulness_score in
evol_instruct.jsonl
dataset which the principle is 100% helpfulness.for example, the scores of 9th sample in
evol_instruct.jsonl
dataset is as following:The answer of vicuna-33b has the highest helpfulness but lowest overall score.
My question is should I pickup the answer with the highest overall score or the highest helpfulness score as a preference anwer, or should I use the mean of the four principles.
Any suggestions will be appriciated, thx.
The text was updated successfully, but these errors were encountered: