-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BFCL] Bug with open-source models #861
Comments
Some numbers to support my claim that "FC models provide lower performance than their
Larger llama (70B) is even lower than its small-version 8B analog with FC, which is completely absurd imo. |
Hey @Aktsvigun, |
Hey @HuanzhiMao , sure! gorilla/berkeley-function-call-leaderboard/bfcl/model_handler/proprietary_model/openai.py Line 72 in 0cea216
You can see the tools (functions) are passed inside the tools argument.
For oss models, however, the function/s is/are always simply inserted in the prompt. The current difference between FC and gorilla/berkeley-function-call-leaderboard/bfcl/model_handler/oss_model/base_oss_handler.py Line 298 in 0cea216
tools argument of the request.
|
To shed more light: as I said,
Again, If I understand it correct, the To support my claim even further, such difference in ranks between Llama's |
Thanks for bringing this up! Actually, there are two ways to do inference for a locally hosted model using vllm/sglang. The first approach, as what you are referring to, is using the Chat Completion endpoint (i.e., The second approach is by using the Completion endpoint (i.e., We currently use the second approach. It might look like the tools are only inserted in the prompt, but that is by design—this method gives us full control over the final formatted prompt and is generally recommended for advanced use cases. In other words, there’s no actual bug here; it’s simply a matter of whether you want to leverage the built-in chat template (approach #1) or manage your own prompt template (approach #2). Both approaches work correctly—ours just uses the second for finer control. On another note, you can verify that our open-source model inference pipeline is correct by printing out the formatted prompt
|
Huanzhi, thank you for your comprehensive response!
To prove this point, I'll launch the benchmark for these models using the |
All the model responses we obtained are available here. The error logs there might give you some insights as to why the FC mode sometimes performs worse than the prompting mode. For Llama models specifically, their FC mode seems to suffer from parameter type errors (eg, output a string when it should be an integer, etc), while their prompting mode have far less such issue. For proprietary models, not all FC mode work better than their prompting mode. For example, the latest |
Thank you for the feedback! I have just run the
I'm not 100% sure I correctly modified each component of the benchmark (since the benchmark is aimed at using the I'll be on vacation until 12.01, so will leave you to discuss internally whether you feel this should be fixed (because it's apparently a bug). If you need the fix, will be happy to help you after the vacation. Happy New Year! 🎄 |
Where did you get these numbers? The number we obtained using the |
Apologies for not making it clear. I compared with |
Describe the issue
Hey team, I noticed a critical bug in open-source models evaluation. Precisely, they never use the
tools
/response_format
/guided_json
arguments even for FC (native support of function calling models). As a result, FC models provide lower performance than theirprompt
analogs (which is absurd: this feature should improve the performance but in fact decreases).If you agree, I'd be happy to make a PR.
Suggestion
Use
tools
/response_format
arguments for the FC model.ID datapoint
The text was updated successfully, but these errors were encountered: