Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation - Differences between F1 scores #5

Open
ivo-1 opened this issue Nov 23, 2022 · 5 comments
Open

Evaluation - Differences between F1 scores #5

ivo-1 opened this issue Nov 23, 2022 · 5 comments

Comments

@ivo-1
Copy link

ivo-1 commented Nov 23, 2022

The evaluation script puts out 3 different F1 scores:

  1. Column F1 for the (UC) row
  2. F1 score (below all the keys)
  3. Mean F1 score

I have 3 questions:

  1. What does UC mean?
  2. How do those F1 scores compare?
  3. Which F1 score is reported in the accompanying paper?

Thanks again for your work!

@tstanislawek
Copy link
Contributor

cc: @filipggg

ad 1) UC -> uncased (we are not checking correctness of the casing)
ad 2) @filipggg should know the answer
ad 3) F1 (UC1) or Mean F1 (both should the same numbers) -> but @filipggg please confirm

@ivo-1
Copy link
Author

ivo-1 commented Dec 18, 2022

thanks! much appreciated and also curious to hear confirmation and more info :)

@ivo-1
Copy link
Author

ivo-1 commented Dec 21, 2022

So, when I run the evaluation script, I don't get the same numbers for the various F1 scores.

Example:

        F1                P                   R
(UC)     64.6±1.8        64.5±1.7        64.8±1.9
address  60.0±3.0        58.7±3.0        61.3±3.1
money    46.1±4.3        46.9±4.3        45.2±4.2
town     76.9±3.8        75.6±3.9        78.3±3.9
postcode 67.3±4.4        67.0±4.2        67.8±4.7
street   34.5±4.9        33.0±4.8        36.3±5.0
name     59.8±4.5        59.9±4.5        59.7±4.4
number   87.9±3.0        89.1±2.9        86.7±3.2
income   45.6±4.7        46.2±4.8        44.8±4.4
spending 47.1±4.6        47.8±4.5        46.5±4.7
date     95.7±1.8        95.9±1.8        95.5±1.8
F1       49.9±1.4
Accuracy 4.7±1.9
Mean-F1  64.4±1.8

So I get 64.6 for (UC) F1, 49.9 for F1 and 64.4 for Mean-F1. The fact that (UC) F1 is a little bit higher than Mean-F1 makes sense to me because, as you explained, UC means uncased. In fact, in all my evaluations, (UC) F1 >= Mean-F1. So far so good. But what is F1 which is considerably(!) lower with just 49.9 supposed to be?

I also have some other questions:

  1. Why are there confidence intervals (±) and what do they mean considering there is always exactly 1 correct answer? Struggling to see how this makes sense.

  2. Is the Mean-F1 a micro- or macro-average? By my calculations (summing the F1 scores of the keys (which, I'm not sure if they are (UC) or not): [town, postcode, street, name, number, income, spending, date] and dividing by 8), the macro-average is 64.35 which would align with the Mean-F1 score given in the evaluation (rounded). However, on the hand-crafted run provided at https://kleister.info/challenge/kleister-charity the math doesn't check out:

SUM (66.1±3.2 | 0±0 | 0±0 | 59.6±4.4 | 0±0 | 0±0 | 0±0 | 0±0) / 8 keys = 15.7125

This doesn't match any of the given F1 scores (Mean-F1: 24.4, F1: 24.67, F1 (UC): 24.67).

So how are these F1 scores calculated?

  1. Which of the three F1 scores are you reporting in the paper?

I think everyone would really benefit if the evaluation could be explained holistically...

@filipggg
Copy link
Contributor

filipggg commented Jan 2, 2023

@ivo-1

  1. Conifdence intervals come from Bootstrap sampling, similarly as it is used commonly in machine translation (see e.g. https://aclanthology.org/W04-3250.pdf).

  2. F1 is micro-average. Mean-F1 is macro-average but averaged across the documents, not data point classes.

  3. AFAIR it was F1, but I'd need to double-check this.

@ivo-1
Copy link
Author

ivo-1 commented Jan 3, 2023

@filipggg thank you for taking the time!

Re 1: So, in a nutshell, you
a.) draw 440 predictions with replacement from the 440 total predictions
b.) evaluate these 440 samples with the solution (0 for wrong, 1 for correct per each key)
c.) calculate the sample mean and sample variance accordingly
d.) use Student’s t-distribution to calculate the true mean with probability 0.95 and the respective confidence interval around the true mean
e.) you repeat steps a.) - d.) e.g. 1000 times to get 1000 different distributions and then drop the 25 distributions with the lowest true mean and the 25 distributions with the highest true mean. Then calculate the average true mean with confidence 0.95 from the remaining 950 distributions.

Correct?

Re 2: Yes, that makes sense to me now. Just to re-iterate for future readers:
F1 row with (UC) column (top left corner): Micro-averaged F1 score (case-insensitive)
F1 (below the table with all the keys): Micro-averaged F1 score (case-sensitive)
Mean-F1: Macro-averaged F1 score (over the documents) (case-insensitive)

However, I noticed that the evaluation seems to have an issue, which I will describe in a new issue.

Re 3: Would be great if you could double-check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants