[BFCL] Use `N/A` in Score Report for Unevaluated Categories #849

HuanzhiMao · 2024-12-20T14:10:43Z

Previously, when a test category was not evaluated, it was represented as a “0” in the score CSV files. This made it difficult to distinguish between a category that was unevaluated and one where the model actually scored a 0. In this PR, unevaluated categories will be marked as “N/A” to clarify this distinction.
Additionally, this PR refactors the code related to the score report output section to reduce duplication and improve maintainability.

Note: This change will not affect leaderboard scores. If a category is unevaluated, it will still be treated as a 0 when calculating the overall accuracy, and the overall accuracy column will report the score with 0 taken into account.
For summary columns that are the average of a few categories (for example, python simple (which consists of simple, java, and javascript), if any of the categories involved are unevaluated, the summary column will be marked as 'N/A` to avoid confusion.

…tion

This PR improves the behavior of the generation and evaluation pipeline. When executable categories are involved and API keys are not provided in the `.env` file, instead of throwing an error, the affected categories will now be skipped. This enhancement provides a smoother experience for first-time users. 1. What will happen to overall score? What would be the difference between score on BFCL official leaderboard vs. without Executable? If the API Key is not provided, that category will not be evaluated and will be treated as 0 by default in the overall score calculation, which means the overall score (and the one on the leaderboard) will be hurt if the API Keys are not supplied. PR #849 should make things more clear. 2. What percentage of executable are there? 310 in total, out of 4751 entries.

Fanjia-Yan

Multi-turn or Live won't have N/A in score CSV Is that expected?

berkeley-function-call-leaderboard/bfcl/eval_checker/eval_runner_helper.py

HuanzhiMao added 2 commits December 20, 2024 01:17

minor function name change

4bc0b59

use N/A for accuracy, refactor helper function to reduce code duplica…

b1b0c12

…tion

HuanzhiMao changed the title ~~[BFCL] Use N'A in Score Report for Unevaluated Categories~~ [BFCL] Use N/A in Score Report for Unevaluated Categories Dec 20, 2024

HuanzhiMao mentioned this pull request Dec 21, 2024

[BFCL] Skip Executable Categories When API Keys Missing #848

Merged

HuanzhiMao requested review from ShishirPatil, CharlieJCJ and Fanjia-Yan December 21, 2024 15:05

Fanjia-Yan requested changes Dec 21, 2024

View reviewed changes

berkeley-function-call-leaderboard/bfcl/eval_checker/eval_runner_helper.py Outdated Show resolved Hide resolved

HuanzhiMao requested a review from Fanjia-Yan December 22, 2024 04:44

HuanzhiMao added 5 commits December 21, 2024 21:47

Merge remote-tracking branch 'upstream/main' into nan-on-csv-output

ea2093c

update change log

ac74caa

Merge remote-tracking branch 'upstream/main' into nan-on-csv-output

110ee13

fix logic for live and multi turn score section

aa91646

better display for summary columns

1010e39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFCL] Use `N/A` in Score Report for Unevaluated Categories #849

[BFCL] Use `N/A` in Score Report for Unevaluated Categories #849

HuanzhiMao commented Dec 20, 2024 •

edited

Loading

Fanjia-Yan left a comment

[BFCL] Use N/A in Score Report for Unevaluated Categories #849

Are you sure you want to change the base?

[BFCL] Use N/A in Score Report for Unevaluated Categories #849

Conversation

HuanzhiMao commented Dec 20, 2024 • edited Loading

Fanjia-Yan left a comment

Choose a reason for hiding this comment

[BFCL] Use `N/A` in Score Report for Unevaluated Categories #849

[BFCL] Use `N/A` in Score Report for Unevaluated Categories #849

HuanzhiMao commented Dec 20, 2024 •

edited

Loading