-
Notifications
You must be signed in to change notification settings - Fork 768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue about tool call metrics. #1658
Labels
Comments
Update: [report bug] Now I'm sure the existing bug. Here I would list several ridiculous samples but with error evaluation scores. sample = [
HumanMessage(content="What's the weather like in New York and Shanghai right now?"),
AIMessage(content="...", tool_calls=[
ToolCall(name="weather_check", args={"location": "Shanghai"}),
ToolCall(name="weather_check", args={"location": "Shanghai"})
]),
]
sample = MultiTurnSample(
user_input=sample,
reference_tool_calls=[
ToolCall(name="weather_check", args={"location": "New York"}),
ToolCall(name="weather_check", args={"location": "Shanghai"})
]
)
# 1.0 sample = [
HumanMessage(content="What's the match schedule and weather in LA this weekend?"),
AIMessage(content="...", tool_calls=[
ToolCall(name="weather_check", args={"location": "Los Angeles"}),
ToolCall(name="match_check", args={"location": "Los Angeles"})
]),
]
sample = MultiTurnSample(
user_input=sample,
reference_tool_calls=[
ToolCall(name="match_check", args={"location": "Los Angeles"}),
ToolCall(name="weather_check", args={"location": "Los Angeles"})
]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 0.0 sample = [
HumanMessage(content="What's the weather like in New York and Shanghai right now?"),
AIMessage(content="...", tool_calls=[
ToolCall(name="weather_check", args={"location": "Shanghai"}),
ToolCall(name="weather_check", args={"location": "New York"})
]),
]
sample = MultiTurnSample(
user_input=sample,
reference_tool_calls=[
ToolCall(name="weather_check", args={"location": "New York"}),
ToolCall(name="weather_check", args={"location": "Shanghai"})
]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 1.0 sample = [
HumanMessage(content="What's the weather like in New York?"),
AIMessage(content="...", tool_calls=[
ToolCall(name="weather_check", args={"location": "Shanghai"})
]),
HumanMessage(content="So what about Shanghai?"),
AIMessage(content="...", tool_calls=[
ToolCall(name="weather_check", args={"location": "New York"})
]),
]
sample = MultiTurnSample(
user_input=sample,
reference_tool_calls=[
ToolCall(name="weather_check", args={"location": "New York"}),
ToolCall(name="weather_check", args={"location": "Shanghai"})
]
)
scorer = ToolCallAccuracy()
result = asyncio.run(scorer.multi_turn_ascore(sample))
# 1.0 |
Hey @MattZ-99 this is incredibly useful.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Hi all, it's surprising to see the ToolUse metrics in Ragas version v0.2.
Based on the base
Metric
class, I have implemented several useful metrics before.After the update of ToolUse section, I'm glad to contribute them for the open source.
As a new here, I'd like to have a discussion before pull request.
Parallel function calling now is a common feature for current llm, where
parallel
means multiple independent/unordered function calls. For example, Berkeley Function-Calling Leaderboard provides a specialized "parallel" category.The below case is the parallel tool call for two information.
Both ordered tool_calls should be acceptable, while current ToolCallAccuracy only supports ordered tool_calls.
That is,
ToolCallAccuracy
is not suitable for parallel calling.As for the solution, I have two ideas:
which one would you think is more flexible?
Besides, I also have a small concern about current
ToolCallAccuracy
.As the
tool_call_pred
andtool_call_pred
is checked aligned, thereference_tool_calls
andpred_tool_calls
should not be go through the double loop.It would result in the following inconsistency:
The text was updated successfully, but these errors were encountered: