[BFCL] Bug with open-source models #861

Aktsvigun · 2024-12-30T19:57:16Z

Describe the issue
Hey team, I noticed a critical bug in open-source models evaluation. Precisely, they never use the tools / response_format / guided_json arguments even for FC (native support of function calling models). As a result, FC models provide lower performance than their prompt analogs (which is absurd: this feature should improve the performance but in fact decreases).
If you agree, I'd be happy to make a PR.

Suggestion
Use tools / response_format arguments for the FC model.

ID datapoint

Datapoint / Model Handler permalink:

The text was updated successfully, but these errors were encountered:

Aktsvigun · 2024-12-30T20:37:42Z

Some numbers to support my claim that "FC models provide lower performance than their prompt analogs":

Llama-3.1-70B-Instruct (Prompt): Rank 30
Llama-3.1-8B-Instruct (Prompt): Rank 44
Llama-3.1-8B-Instruct (FC): Rank 75
Llama-3.1-70B-Instruct (FC): Rank 76

Larger llama (70B) is even lower than its small-version 8B analog with FC, which is completely absurd imo.

HuanzhiMao · 2024-12-31T01:52:22Z

Hey @Aktsvigun,
Thanks for the issue.
Could you clarify what you mean by the tools / response_format / guided_json arguments?

Aktsvigun · 2024-12-31T09:10:11Z

Hey @HuanzhiMao , sure!
I mean passing the functions to the request as an argument rather than just inserting it in the prompt. For example, you do this for OpenAI models:

gorilla/berkeley-function-call-leaderboard/bfcl/model_handler/proprietary_model/openai.py

Line 72 in 0cea216

return self.generate_with_backoff(

You can see the tools (functions) are passed inside the tools argument.

For oss models, however, the function/s is/are always simply inserted in the prompt. The current difference between FC and prompt model versions (e.g. between Llama-3.1-70B-Instruct (FC) and Llama-3.1-70B-Instruct (Prompt)) is in the formatting of the prompt (

gorilla/berkeley-function-call-leaderboard/bfcl/model_handler/oss_model/base_oss_handler.py

Line 298 in 0cea216

api_response = self.client.completions.create(

), while the FC version should use the tools argument of the request.

Aktsvigun · 2024-12-31T09:54:18Z

To shed more light: as I said, FC and Prompt versions of models currently differ only within formatting of the prompt. This thing is also quite weird and most likely causes this kind of difference between such versions (e.g. Llama-70B-prompt ranks 30, and Llama-70B-FC ranks 76): the FC version has less "meaningful" information in the prompt (it only has a few instructions) compared to the Prompt version. Comparing them below:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Environment: ipython
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.Do not use variables.

{
    "name": "soccer.get_last_match",
    "description": "Retrieve the details of the last match played by a specified soccer club. Note that the provided function is in Python 3 syntax.",
    "parameters": {
        "type": "dict",
        "properties": {
            "team_name": {
                "type": "string",
                "description": "The name of the soccer club."
            },
            "include_stats": {
                "type": "boolean",
                "description": "If true, include match statistics like possession, shots on target etc. Default is false."
            }
        },
        "required": [
            "team_name"
        ]
    }
}

Get me the details of the last game played by Liverpool F.C. Include its statistics.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in composing functions. You are given a question and a set of possible functions. Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the function can be used, point it out. If the given question lacks the parameters required by the function, also point it out.
You should only return the function calls in your response.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.

At each turn, your should try your best to complete the tasks requested by the user within the current turn. Continue to output functions to call until you have fulfilled the user's request to the best of your ability. Once you have no more functions to call, the system will consider the current turn complete and proceed to the next turn or task.

Here is a list of functions in JSON format that you can invoke.
[{'name': 'soccer.get_last_match', 'description': 'Retrieve the details of the last match played by a specified soccer club. Note that the provided function is in Python 3 syntax.', 'parameters': {'type': 'dict', 'properties': {'team_name': {'type': 'string', 'description': 'The name of the soccer club.'}, 'include_stats': {'type': 'boolean', 'description': 'If true, include match statistics like possession, shots on target etc. Default is false.'}}, 'required': ['team_name']}}]<|eot_id|><|start_header_id|>user<|end_header_id|>

Get me the details of the last game played by Liverpool F.C. Include its statistics.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Again, If I understand it correct, the FC version was expected to have less instructions, however, it was also expected to use the function calling ability of the model (tools and tool_choice arguments). However, since these arguments are not used, the FC versions result in pathetic performance.

To support my claim even further, such difference in ranks between Llama's FC and Prompt versions goes in contrast with proprietary models: since this feature is correctly implemented in openai models, GPT-4o-2024-08-06 performs higher with FC than Prompt (rank 2 vs rank 3).

HuanzhiMao · 2024-12-31T11:28:39Z

Thanks for bringing this up! Actually, there are two ways to do inference for a locally hosted model using vllm/sglang.

The first approach, as what you are referring to, is using the Chat Completion endpoint (i.e., client.chat.completions.create(). You could refer to the vllm manual here. In this approach, user would pass in a list of chat history in the messages field and all the function definitions in the tools field. The inference framework (vllm/sglang) will then apply the model's chat template and construct a well-formatted prompt. The "Tool Calling Section" from the vllm manual is also referring to this method (https://docs.vllm.ai/en/latest/usage/tool_calling.html).

The second approach is by using the Completion endpoint (i.e., client.completions.create()). You could refer to the vllm manual here. The user needs to manually apply the chat template and construct the formatted prompt string himself, and then feed that into the prompt field. You can see that this endpoint only takes the prompt field, but no messages or tools. Assuming the user constructs the prompt correctly, then this approach will leads to the exact same model output as the first approach.

We currently use the second approach. It might look like the tools are only inserted in the prompt, but that is by design—this method gives us full control over the final formatted prompt and is generally recommended for advanced use cases. In other words, there’s no actual bug here; it’s simply a matter of whether you want to leverage the built-in chat template (approach #1) or manage your own prompt template (approach #2). Both approaches work correctly—ours just uses the second for finer control.

On another note, you can verify that our open-source model inference pipeline is correct by printing out the formatted_prompt in base_oss_handler/_query_prompting method, before it hits self.client.completions.create. Take Llama 3.1 as an example, the meta team has an example prompt in their model card (link) for function calling. And below is what our pipeline produces. You can see that our formatted prompt is structurally correct; the only difference is in the function doc and user question, but that's expected.

formatted prompt

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Environment: ipython
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.Do not use variables.

{
    "name": "calculate_em_force",
    "description": "Calculate the induced electromagnetic force based on Faraday's Law of Electromagnetic Induction, given the magnetic field (in Tesla), change in magnetic field area (in square meters), and the change in time (in seconds). Note that the provided function is in Python 3 syntax.",
    "parameters": {
        "type": "dict",
        "properties": {
            "b_field": {
                "type": "integer",
                "description": "The magnetic field in Tesla."
            },
            "area": {
                "type": "integer",
                "description": "The change in area of magnetic field in square meters."
            },
            "d_time": {
                "type": "integer",
                "description": "The change in time in seconds."
            }
        },
        "required": [
            "b_field",
            "area",
            "d_time"
        ]
    }
}

Calculate the induced electromagnetic force for a magnetic field of 5 Tesla, area of 2 square meters and change in time of 4 seconds, then repeat with a change in time of 10 seconds.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Aktsvigun · 2024-12-31T12:09:01Z

Huanzhi, thank you for your comprehensive response!
I didn't know about manual insertion, thanks! However, based on your description, it seems that this feature is indeed implemented incorrect (I don't know whether it's on your side or on vllm side):

FC should work better (as this is the feature specific for function calling), but instead works significantly worse
For FC, a larger-sized model performs worse (Llama-70B-FC is one rank lower than Llama-8B-FC), which is also vague
For proprietary models e.g. OpenAI models, which use the tools argument, FC models work expectedly better than Prompt models.

To prove this point, I'll launch the benchmark for these models using the chat.completions with tools argument. Will keep you updated!

HuanzhiMao · 2024-12-31T12:29:19Z

All the model responses we obtained are available here. The error logs there might give you some insights as to why the FC mode sometimes performs worse than the prompting mode.

For Llama models specifically, their FC mode seems to suffer from parameter type errors (eg, output a string when it should be an integer, etc), while their prompting mode have far less such issue.

For proprietary models, not all FC mode work better than their prompting mode. For example, the latest o1's prompting mode is significantly better than its FC counterpart (rank #6 vs rank #36).

Aktsvigun · 2024-12-31T18:48:06Z

Thank you for the feedback!

I have just run the python in all non-live tasks with Llama-3.1-8B. The results are much higher than the current ones:

"Model": "Meta Llama-3.1-8B-Instruct (Prompt)",
"Cost ($ Per 1k Function Calls)": NaN,
"Latency Mean (s)": NaN,
"Latency Standard Deviation (s)": NaN,
"Latency 95th Percentile (s)": NaN,
"Non-Live AST Acc": NaN,
"Non-Live Simple AST": NaN,
"Non-Live Multiple AST": 64.00% vs 54.00%
"Non-Live Parallel AST": 60.00% vs 48.5%
"Non-Live Parallel Multiple AST": 47.00% vs 34.5%
"Non-Live Exec Acc": 64.09% vs 50.18%
"Non-Live Simple Exec": 58.36% vs 58.71%
"Non-Live Multiple Exec": 84.00% vs 58%
"Non-Live Parallel Exec": 74.00% vs 54%
"Non-Live Parallel Multiple Exec": 40.00% vs 30%

I'm not 100% sure I correctly modified each component of the benchmark (since the benchmark is aimed at using the completion endpoints at the moment), so there may be even more room for improvement.

I'll be on vacation until 12.01, so will leave you to discuss internally whether you feel this should be fixed (because it's apparently a bug). If you need the fix, will be happy to help you after the vacation. Happy New Year! 🎄

HuanzhiMao · 2025-01-01T10:40:54Z

Thank you for the feedback!

I have just run the python in all non-live tasks with Llama-3.1-8B. The results are much higher than the current ones:
"Model": "Meta Llama-3.1-8B-Instruct (Prompt)",
"Cost ($ Per 1k Function Calls)": NaN,
"Latency Mean (s)": NaN,
"Latency Standard Deviation (s)": NaN,
"Latency 95th Percentile (s)": NaN,
"Non-Live AST Acc": NaN,
"Non-Live Simple AST": NaN,
"Non-Live Multiple AST": 64.00% vs 54.00%
"Non-Live Parallel AST": 60.00% vs 48.5%
"Non-Live Parallel Multiple AST": 47.00% vs 34.5%
"Non-Live Exec Acc": 64.09% vs 50.18%
"Non-Live Simple Exec": 58.36% vs 58.71%
"Non-Live Multiple Exec": 84.00% vs 58%
"Non-Live Parallel Exec": 74.00% vs 54%
"Non-Live Parallel Multiple Exec": 40.00% vs 30%
I'm not 100% sure I correctly modified each component of the benchmark (since the benchmark is aimed at using the completion endpoints at the moment), so there may be even more room for improvement.

I'll be on vacation until 12.01, so will leave you to discuss internally whether you feel this should be fixed (because it's apparently a bug). If you need the fix, will be happy to help you after the vacation. Happy New Year! 🎄

Where did you get these numbers? The number we obtained using the completion endpoint (in #800, the one currently on the leaderboard) is much higher. For example, Non-Live Multiple AST should be at 93.5% for Llama-3.1-8B-Instruct (Prompt).

Aktsvigun · 2025-01-01T15:34:51Z

Apologies for not making it clear. I compared with Llama-3.1-8B-Instruct-FC since the bug is in the function calling implementation (I guess everything is fine with Prompt versions given their high quality). I took the numbers from the leaderboard (rank 84 acc. to 01.01).

HuanzhiMao added the BFCL-General General BFCL Issue label Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFCL] Bug with open-source models #861

[BFCL] Bug with open-source models #861

Aktsvigun commented Dec 30, 2024

Aktsvigun commented Dec 30, 2024 •

edited

Loading

HuanzhiMao commented Dec 31, 2024

Aktsvigun commented Dec 31, 2024

Aktsvigun commented Dec 31, 2024

HuanzhiMao commented Dec 31, 2024 •

edited

Loading

Aktsvigun commented Dec 31, 2024

HuanzhiMao commented Dec 31, 2024 •

edited

Loading

Aktsvigun commented Dec 31, 2024

HuanzhiMao commented Jan 1, 2025

Aktsvigun commented Jan 1, 2025

[BFCL] Bug with open-source models #861

[BFCL] Bug with open-source models #861

Comments

Aktsvigun commented Dec 30, 2024

Aktsvigun commented Dec 30, 2024 • edited Loading

HuanzhiMao commented Dec 31, 2024

Aktsvigun commented Dec 31, 2024

Aktsvigun commented Dec 31, 2024

HuanzhiMao commented Dec 31, 2024 • edited Loading

Aktsvigun commented Dec 31, 2024

HuanzhiMao commented Dec 31, 2024 • edited Loading

Aktsvigun commented Dec 31, 2024

HuanzhiMao commented Jan 1, 2025

Aktsvigun commented Jan 1, 2025

Aktsvigun commented Dec 30, 2024 •

edited

Loading

HuanzhiMao commented Dec 31, 2024 •

edited

Loading

HuanzhiMao commented Dec 31, 2024 •

edited

Loading