Skip to content

Commit

Permalink
[BFCL] Leaderboard Update, 10/21/2024 (#672)
Browse files Browse the repository at this point in the history
This PR updates the leaderboard to reflect the change in score due to
the following PR merge:

1. #660 
2. #661
3. #683
4. #679
5. #708 
6. #709
7. #701
8. #657 
9. #658 
10. #640 
11. #653
12. #642 
13. #696 
14. #667

Close #662.

Note: Some models (like `firefunction`, `functionary`,
`microsoft/phi`)are not included in this leaderboard update because we
don't have all the entries generated. We will add them back once we get
the full result generated.
  • Loading branch information
HuanzhiMao authored Oct 21, 2024
1 parent 13c46f0 commit 9032355
Show file tree
Hide file tree
Showing 7 changed files with 188 additions and 38 deletions.
2 changes: 1 addition & 1 deletion data/forms/formHeaders.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"forms": {
"combined_Sep_20_2024": {
"overall": {
"rows": [
{
"columns": [
Expand Down
33 changes: 0 additions & 33 deletions data_combined_Sep_20_2024.csv

This file was deleted.

61 changes: 61 additions & 0 deletions data_live.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
Rank,Model,Live Overall Acc,AST Summary,Python Simple AST,Python Multiple AST,Python Parallel AST,Python Parallel Multiple AST,Irrelevance Detection,Relevance Detection
1,Mistral-Medium-2312 (Prompt),100.00%,0.00%,0.00%,0.00%,0.00%,0.00%,100.00%,0.00%
2,Gemini-1.5-Flash-002 (Prompt),76.28%,78.20%,77.91%,78.30%,93.75%,66.67%,72.91%,85.37%
3,GPT-4-turbo-2024-04-09 (FC),76.23%,77.45%,77.52%,77.63%,81.25%,66.67%,74.51%,73.17%
4,GPT-4o-2024-08-06 (FC),75.43%,74.98%,74.42%,75.12%,81.25%,70.83%,76.69%,63.41%
5,o1-mini-2024-09-12 (Prompt),75.39%,71.39%,73.26%,71.07%,75.00%,62.50%,82.74%,48.78%
6,ToolACE-8B (FC),74.99%,73.33%,66.67%,74.93%,81.25%,70.83%,77.26%,80.49%
7,GPT-4o-mini-2024-07-18 (Prompt),74.63%,75.51%,79.46%,74.35%,93.75%,70.83%,73.26%,75.61%
8,Gemini-1.5-Pro-002 (Prompt),74.41%,77.00%,77.52%,76.76%,87.50%,75.00%,70.86%,65.85%
9,Gemini-1.5-Pro-001 (Prompt),73.12%,69.14%,67.44%,69.24%,93.75%,66.67%,80.00%,56.10%
10,xLAM-8x22b-r (FC),71.97%,79.40%,78.29%,80.14%,75.00%,62.50%,60.00%,85.37%
11,GPT-4o-mini-2024-07-18 (FC),70.19%,74.23%,72.87%,74.45%,87.50%,70.83%,63.54%,80.49%
12,Hammer2.0-7b (FC),69.79%,76.63%,74.42%,77.15%,81.25%,75.00%,58.17%,95.12%
13,Command-R-Plus (Prompt) (Original),69.75%,69.59%,66.67%,70.30%,68.75%,70.83%,69.83%,73.17%
14,Gemma-2-27b-it (Prompt),69.48%,77.30%,79.46%,77.24%,68.75%,62.50%,56.69%,87.80%
15,Gemma-2-9b-it (Prompt),69.21%,73.11%,73.64%,73.58%,56.25%,58.33%,62.40%,87.80%
16,Gemini-1.5-Flash-001 (Prompt),69.21%,75.21%,74.42%,75.12%,93.75%,75.00%,59.43%,82.93%
17,xLAM-8x7b-r (FC),69.12%,74.53%,68.22%,76.76%,62.50%,54.17%,60.00%,87.80%
18,GPT-4-turbo-2024-04-09 (Prompt),69.04%,84.64%,85.66%,84.57%,87.50%,75.00%,44.57%,82.93%
19,mistral-large-2407 (FC),68.37%,79.55%,81.78%,79.27%,68.75%,75.00%,50.97%,75.61%
20,xLAM-7b-r (FC),67.88%,72.28%,71.32%,73.48%,31.25%,58.33%,59.77%,97.56%
21,GPT-3.5-Turbo-0125 (Prompt),67.48%,64.27%,63.57%,64.61%,68.75%,54.17%,71.77%,80.49%
22,Gorilla-OpenFunctions-v2 (FC),67.44%,61.42%,73.64%,58.73%,68.75%,41.67%,76.34%,73.17%
23,Gemini-1.5-Flash-002 (FC),67.35%,57.98%,58.14%,57.96%,68.75%,50.00%,81.94%,60.98%
24,Meta-Llama-3-70B-Instruct (Prompt),66.15%,79.10%,78.68%,79.65%,68.75%,66.67%,45.14%,92.68%
25,Qwen2.5-7B-Instruct (Prompt),65.97%,72.13%,72.48%,72.32%,62.50%,66.67%,55.31%,92.68%
26,Gemini-1.5-Pro-001 (FC),65.53%,58.05%,57.75%,58.24%,75.00%,41.67%,77.03%,63.41%
27,Claude-3-Haiku-20240307 (Prompt),65.04%,74.53%,77.13%,74.64%,68.75%,45.83%,49.71%,82.93%
28,Gemini-1.5-Flash-001 (FC),64.90%,59.48%,58.14%,60.46%,43.75%,41.67%,73.49%,58.54%
29,Gemini-1.5-Pro-002 (FC),64.59%,61.05%,58.91%,61.33%,81.25%,58.33%,69.71%,70.73%
30,Hammer2.0-1.5b (FC),63.22%,68.76%,70.54%,68.56%,56.25%,66.67%,53.37%,92.68%
31,Open-Mistral-Nemo-2407 (FC),62.37%,68.46%,71.71%,67.79%,62.50%,66.67%,53.14%,60.98%
32,DBRX-Instruct (Prompt),62.33%,72.06%,74.81%,71.65%,75.00%,58.33%,46.29%,87.80%
33,GPT-4o-2024-08-06 (Prompt),62.19%,42.55%,42.64%,42.82%,25.00%,41.67%,93.37%,36.59%
34,Hermes-2-Pro-Llama-3-8B (FC),61.79%,64.57%,67.44%,64.42%,56.25%,45.83%,57.83%,56.10%
35,Qwen2.5-1.5B-Instruct (Prompt),61.71%,60.37%,64.73%,59.88%,50.00%,41.67%,63.09%,75.61%
36,GPT-3.5-Turbo-0125 (FC),61.22%,76.25%,74.42%,77.82%,43.75%,50.00%,36.57%,97.56%
37,Llama-3.1-70B-Instruct (Prompt),61.13%,72.58%,77.13%,71.46%,87.50%,62.50%,42.17%,92.68%
38,Hermes-2-Pro-Llama-3-70B (FC),60.51%,55.28%,63.18%,53.04%,56.25%,66.67%,68.46%,60.98%
39,MiniCPM3-4B (FC),59.88%,50.71%,56.98%,49.47%,56.25%,33.33%,73.94%,58.54%
40,Gemini-1.0-Pro-002 (FC),58.91%,55.81%,58.91%,56.12%,37.50%,20.83%,63.20%,68.29%
41,Llama-3.1-8B-Instruct (Prompt),57.93%,71.31%,71.32%,72.23%,50.00%,45.83%,36.57%,78.05%
42,Granite-20b-FunctionCalling (FC),57.49%,57.08%,65.12%,55.35%,43.75%,54.17%,56.34%,95.12%
43,Command-R-Plus (FC) (Original),57.26%,61.50%,66.67%,60.56%,56.25%,50.00%,49.14%,92.68%
44,Hermes-2-Pro-Mistral-7B (FC),56.46%,59.85%,64.73%,59.40%,43.75%,37.50%,50.40%,75.61%
45,Claude-3.5-Sonnet-20240620 (Prompt),54.24%,31.24%,65.12%,22.66%,37.50%,33.33%,90.97%,19.51%
46,Qwen2-7B-Instruct (Prompt),54.24%,61.57%,59.30%,62.20%,50.00%,66.67%,41.49%,87.80%
47,Nexusflow-Raven-v2 (FC),53.49%,39.03%,39.92%,38.48%,56.25%,41.67%,74.97%,65.85%
48,xLAM-7b-fc-r (FC),53.44%,60.07%,75.58%,57.28%,43.75%,25.00%,42.51%,70.73%
49,Hammer2.0-0.5b (FC),52.42%,45.17%,48.84%,44.07%,62.50%,41.67%,61.94%,85.37%
50,Llama-3.2-3B-Instruct (Prompt),50.91%,44.49%,47.67%,44.74%,0.00%,29.17%,60.11%,63.41%
51,Meta-Llama-3-8B-Instruct (Prompt),50.51%,59.78%,60.85%,60.75%,37.50%,20.83%,35.20%,75.61%
52,Gemini-1.0-Pro-002 (Prompt),45.67%,38.13%,41.47%,36.93%,68.75%,33.33%,55.54%,80.49%
53,Gemma-2-2b-it (Prompt),41.63%,11.46%,11.24%,11.96%,0.00%,0.00%,89.03%,12.20%
54,Llama-3.1-70B-Instruct (FC),39.09%,0.52%,0.39%,0.10%,25.00%,4.17%,99.77%,0.00%
55,Qwen2-1.5B-Instruct (Prompt),39.00%,41.87%,50.39%,40.50%,25.00%,20.83%,32.91%,75.61%
56,Llama-3.2-3B-Instruct (FC),38.92%,0.00%,0.00%,0.00%,0.00%,0.00%,100.00%,2.44%
57,Llama-3.2-1B-Instruct (FC),38.78%,0.00%,0.00%,0.00%,0.00%,0.00%,99.77%,0.00%
58,xLAM-1b-fc-r (FC),38.34%,54.31%,63.18%,54.19%,0.00%,0.00%,11.20%,97.56%
59,Llama-3.1-8B-Instruct (FC),33.10%,47.12%,48.45%,47.16%,37.50%,37.50%,8.91%,92.68%
60,Llama-3.2-1B-Instruct (Prompt),29.85%,8.91%,25.97%,4.82%,6.25%,4.17%,60.91%,48.78%
Loading

0 comments on commit 9032355

Please sign in to comment.