-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation - Differences between F1 scores #5
Comments
thanks! much appreciated and also curious to hear confirmation and more info :) |
So, when I run the evaluation script, I don't get the same numbers for the various F1 scores. Example:
So I get 64.6 for (UC) F1, 49.9 for F1 and 64.4 for Mean-F1. The fact that (UC) F1 is a little bit higher than Mean-F1 makes sense to me because, as you explained, UC means uncased. In fact, in all my evaluations, (UC) F1 >= Mean-F1. So far so good. But what is F1 which is considerably(!) lower with just 49.9 supposed to be? I also have some other questions:
SUM (66.1±3.2 | 0±0 | 0±0 | 59.6±4.4 | 0±0 | 0±0 | 0±0 | 0±0) / 8 keys = 15.7125 This doesn't match any of the given F1 scores (Mean-F1: 24.4, F1: 24.67, F1 (UC): 24.67). So how are these F1 scores calculated?
I think everyone would really benefit if the evaluation could be explained holistically... |
|
@filipggg thank you for taking the time! Re 1: So, in a nutshell, you Correct? Re 2: Yes, that makes sense to me now. Just to re-iterate for future readers: However, I noticed that the evaluation seems to have an issue, which I will describe in a new issue. Re 3: Would be great if you could double-check. |
The evaluation script puts out 3 different F1 scores:
F1
for the(UC)
rowI have 3 questions:
Thanks again for your work!
The text was updated successfully, but these errors were encountered: