Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluators can access all columns #1606

Merged
merged 76 commits into from
May 31, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
7145533
access to all columns
aakrem May 2, 2024
d3d3412
fix correct answer in ai critique
aakrem May 2, 2024
e4a481f
fix tests
aakrem May 2, 2024
6c09e6e
add correct answer to evaluators
aakrem May 3, 2024
2a62d29
get correct answer from evaluator instead of evaluation payload
aakrem May 3, 2024
ae58b0e
access correct answer directly in the evaluators and handle passing m…
aakrem May 3, 2024
d54f20d
add default value for correct_answer and small renaming
aakrem May 3, 2024
c9c212a
adjust schema
aakrem May 9, 2024
be1ee39
add as much correct answers columns there are in an ES
aakrem May 9, 2024
e18c3a4
ajust correct answer type in frontend
aakrem May 9, 2024
ed2d413
small build fix
aakrem May 9, 2024
2b131d2
fix build
aakrem May 9, 2024
519e99d
hsndle multiple correct answers
aakrem May 9, 2024
68f626e
revert to single ground truth
aakrem May 9, 2024
ac6bf19
toggle correct answer input visibility
bekossy May 9, 2024
d2e1a36
rename correct_answer to value
aakrem May 9, 2024
07e0e86
migration script
aakrem May 9, 2024
fc30f52
added antd collapse to toggle correct_answer input
bekossy May 9, 2024
7e24866
fix evaluators tests
aakrem May 9, 2024
bab7582
select ground truth to apply diff in eval scenario view
bekossy May 9, 2024
0f67bd3
display only unique correct_answers
aakrem May 9, 2024
c01f6b2
filtered out duplicate keys from correctAnswer array
bekossy May 10, 2024
cb3b08f
added filtercolumns component and improve table headername display
bekossy May 10, 2024
725cf8d
Merge branch 'main' into access-to-all-columns
bekossy May 10, 2024
00529d0
bug fix
bekossy May 10, 2024
398dfe8
bug fix
bekossy May 10, 2024
89faad9
added dropdown diff and cleanup
bekossy May 11, 2024
14e5eb4
made static onClick prop dynamic and improve diff feature
bekossy May 12, 2024
e152c51
added ground truth column to comparison view and improved diff feature
bekossy May 13, 2024
0495376
fixed correct answer output
bekossy May 13, 2024
d879572
added helper to remove correctAnswer prefix and improved dropdown def…
bekossy May 14, 2024
3ebabb3
rename variable
aakrem May 14, 2024
975b01f
Merge pull request #1645 from Agenta-AI/sub-issue/-improve-eval-compa…
aakrem May 14, 2024
fc8ae62
improved diff button text
bekossy May 15, 2024
90f7380
access to all columns
aakrem May 2, 2024
34f2436
Merge branch 'main' into access-to-all-columns
aakrem May 15, 2024
14c8ee2
small refactor for correct answers logic
aakrem May 15, 2024
7424be3
fix errors type
aakrem May 15, 2024
95b82ef
convert correct_answer_keys to list
aakrem May 16, 2024
1d3d9d8
improve type
aakrem May 16, 2024
87c1d81
access to all columns
aakrem May 2, 2024
a706a69
Merge branch 'main' into access-to-all-columns
aakrem May 17, 2024
b5300df
add default correct answer in case its not provided
aakrem May 17, 2024
1002198
advanced settings in a separate component
aakrem May 17, 2024
7624354
bug fix
bekossy May 18, 2024
f4e2d47
Merge pull request #1665 from Agenta-AI/access-to-all-columns-advance…
aakrem May 19, 2024
f610819
fix backend tests
aakrem May 19, 2024
0aa62e1
create direct_use evaluators with default correct answers
aakrem May 19, 2024
27f8a0a
remove not needed code
aakrem May 19, 2024
dc8ed51
Add condition to evaluator card to show action buttons when direct_us…
bekossy May 19, 2024
ffb3c2d
filtered out evaluators when direct_use is true or settings_template …
bekossy May 20, 2024
5811770
Modify the evaluator definition for correct answer key
mmabrouk May 28, 2024
ca309b5
Refactored the evaluator service to use specific correct_answers
mmabrouk May 28, 2024
1f14f0e
Show the advanced settings undere a hidden collapse
mmabrouk May 28, 2024
7e11adc
Made the code more secure by removing the global().get which would al…
mmabrouk May 28, 2024
81f9b12
Improved the logic to use a correct_answer as a ground truth column i…
mmabrouk May 28, 2024
20d54ab
rewrote logic for creating ready to use evaluators
mmabrouk May 28, 2024
e6dcfd3
Allow editing ready to use evalutors
mmabrouk May 28, 2024
9faa400
allow the addition of ready to use evaluators
mmabrouk May 28, 2024
df16e88
Fixed evaluators definition
mmabrouk May 28, 2024
87b0c2f
minor fix
mmabrouk May 28, 2024
a0d0420
updated pyproject
mmabrouk May 28, 2024
1a6e4ce
Added auto similarity
mmabrouk May 28, 2024
6b44f5c
formatting
mmabrouk May 28, 2024
3fba85d
updated docker
mmabrouk May 28, 2024
111c2e5
fix-lenshtein test
mmabrouk May 28, 2024
7b69bec
t
mmabrouk May 28, 2024
31d7c88
fix the test
mmabrouk May 28, 2024
ea6919b
improved tests
mmabrouk May 28, 2024
f5ca784
remove comment
mmabrouk May 29, 2024
c78856a
improved label
mmabrouk May 29, 2024
7bffe15
fixed correct_answer_key payload
bekossy May 29, 2024
d285754
cleanup
bekossy May 29, 2024
2080680
Merge pull request #1711 from Agenta-AI/fix-all-columns
mmabrouk May 30, 2024
d6132bc
Merge branch 'main' into access-to-all-columns
mmabrouk May 31, 2024
b9dde52
update lock
mmabrouk May 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 24 additions & 21 deletions agenta-backend/agenta_backend/resources/evaluators/evaluators.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,16 @@
{
"name": "Exact Match",
"key": "auto_exact_match",
"direct_use": True,
"direct_use": False,
aakrem marked this conversation as resolved.
Show resolved Hide resolved
"settings_template": {
"label": "Exact Match Settings",
"description": "Settings for the Exact Match evaluator",
"correct_answer": {
"label": "Correct Answer",
"type": "string",
},
},
"description": "Exact Match evaluator determines if the output exactly matches the specified correct answer, ensuring precise alignment with expected results.",
},
{
"name": "Contains Json",
Expand All @@ -31,7 +36,11 @@
"min": 0,
"max": 1,
"required": True,
}
},
"correct_answer": {
"label": "Correct Answer",
"type": "string",
},
},
"description": "Similarity Match evaluator checks if the generated answer is similar to the expected answer. You need to provide the similarity threshold. It uses the Jaccard similarity to compare the answers.",
},
Expand Down Expand Up @@ -67,7 +76,11 @@
"default": "",
"description": "The name of the field in the JSON output that you wish to evaluate",
"required": True,
}
},
"correct_answer": {
"label": "Correct Answer",
"type": "string",
},
},
"description": "JSON Field Match evaluator compares specific fields within JSON (JavaScript Object Notation) data. This matching can involve finding similarities or correspondences between fields in different JSON objects.",
},
Expand Down Expand Up @@ -112,27 +125,13 @@
"description": "https://your-webhook-url.com",
"required": True,
},
"correct_answer": {
"label": "Correct Answer",
"type": "string",
},
},
"description": "Webhook test evaluator sends the generated answer and the correct_answer to a webhook and expects a response indicating the correctness of the answer. You need to provide the URL of the webhook and the response of the webhook must be between 0 and 1.",
},
{
"name": "A/B Test",
"key": "human_a_b_testing",
"direct_use": False,
"settings_template": {
"label": "A/B Testing Settings",
"description": "Settings for A/B testing configurations",
},
},
{
"name": "Single Model Test",
"key": "human_single_model_test",
"direct_use": False,
"settings_template": {
"label": "Single Model Testing Settings",
"description": "Settings for single model testing configurations",
},
},
{
"name": "Starts With",
"key": "auto_starts_with",
Expand Down Expand Up @@ -245,6 +244,10 @@
"label": "Levenshtein Distance Settings",
"description": "Evaluates the Levenshtein distance between the output and the correct answer. If a threshold is specified, it checks if the distance is below this threshold and returns a boolean value. If no threshold is specified, it returns the numerical Levenshtein distance.",
"threshold": {"label": "Threshold", "type": "number", "required": False},
"correct_answer": {
"label": "Correct Answer",
"type": "string",
},
},
"description": "This evaluator calculates the Levenshtein distance between the output and the correct answer. If a threshold is provided in the settings, it returns a boolean indicating whether the distance is within the threshold. If no threshold is provided, it returns the actual Levenshtein distance as a numerical value.",
},
Expand Down
6 changes: 0 additions & 6 deletions agenta-backend/agenta_backend/routers/evaluation_router.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,11 +119,6 @@ async def create_evaluation(
return response

evaluations = []
correct_answer_column = (
"correct_answer"
if payload.correct_answer_column is None
else payload.correct_answer_column
)

for variant_id in payload.variant_ids:
evaluation = await evaluation_service.create_new_evaluation(
Expand All @@ -141,7 +136,6 @@ async def create_evaluation(
evaluation_id=evaluation.id,
rate_limit_config=payload.rate_limit.dict(),
lm_providers_keys=payload.lm_providers_keys,
correct_answer_column=correct_answer_column,
)
evaluations.append(evaluation)

Expand Down
68 changes: 43 additions & 25 deletions agenta-backend/agenta_backend/services/evaluators_service.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import re
import json
import httpx
from typing import Any, Dict, Tuple
from typing import Any, Dict, Tuple, List

from agenta_backend.services.security import sandbox
from agenta_backend.models.db_models import Error, Result
Expand All @@ -18,13 +18,14 @@
def auto_exact_match(
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
aakrem marked this conversation as resolved.
Show resolved Hide resolved
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
) -> Result:
try:
exact_match = True if output == correct_answer else False
exact_match = True if output == data_point[correct_answer_key] else False
aakrem marked this conversation as resolved.
Show resolved Hide resolved
result = Result(type="bool", value=exact_match)
return result
except Exception as e:
Expand All @@ -40,14 +41,15 @@ def auto_exact_match(
def auto_similarity_match(
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
) -> Result:
try:
set1 = set(output.split())
set2 = set(correct_answer.split())
set2 = set(data_point[correct_answer_key].split())
intersect = set1.intersection(set2)
union = set1.union(set2)

Expand All @@ -72,7 +74,8 @@ def auto_similarity_match(
def auto_regex_test(
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
Expand All @@ -96,14 +99,17 @@ def auto_regex_test(
def field_match_test(
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
) -> Result:
try:
output_json = json.loads(output)
result = output_json[settings_values["json_field"]] == correct_answer
result = (
output_json[settings_values["json_field"]] == data_point[correct_answer_key]
)
return Result(type="bool", value=result)
except Exception as e:
logging.debug("Field Match Test Failed because of Error: " + str(e))
Expand All @@ -113,15 +119,16 @@ def field_match_test(
def auto_webhook_test(
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
) -> Result:
try:
with httpx.Client() as client:
payload = {
"correct_answer": correct_answer,
"correct_answer": data_point[correct_answer_key],
"output": output,
"inputs": inputs,
}
Expand Down Expand Up @@ -168,7 +175,8 @@ def auto_webhook_test(
def auto_custom_code_run(
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
Expand All @@ -178,7 +186,7 @@ def auto_custom_code_run(
app_params=app_params,
inputs=inputs,
output=output,
correct_answer=correct_answer,
data_point=data_point,
code=settings_values["code"],
)
return Result(type="number", value=result)
Expand All @@ -195,7 +203,8 @@ def auto_custom_code_run(
def auto_ai_critique(
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
Expand All @@ -206,7 +215,7 @@ def auto_ai_critique(
Args:
inputs (Dict[str, Any]): Input parameters for the LLM app variant.
output (str): The output of the LLM app variant.
correct_answer (str): Correct answer for evaluation.
correct_answer_key (str): The key name of the correct answer in the datapoint.
app_params (Dict[str, Any]): Application parameters.
settings_values (Dict[str, Any]): Settings for the evaluation.
lm_providers_keys (Dict[str, Any]): Keys for language model providers.
Expand All @@ -224,7 +233,7 @@ def auto_ai_critique(
chain_run_args = {
"llm_app_prompt_template": app_params.get("prompt_user", ""),
"variant_output": output,
"correct_answer": correct_answer,
"correct_answer": data_point[correct_answer_key],
}

for key, value in inputs.items():
Expand Down Expand Up @@ -252,7 +261,8 @@ def auto_ai_critique(
def auto_starts_with(
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
Expand Down Expand Up @@ -280,7 +290,8 @@ def auto_starts_with(
def auto_ends_with(
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
Expand All @@ -306,7 +317,8 @@ def auto_ends_with(
def auto_contains(
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
Expand All @@ -332,7 +344,8 @@ def auto_contains(
def auto_contains_any(
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
Expand Down Expand Up @@ -363,7 +376,8 @@ def auto_contains_any(
def auto_contains_all(
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
Expand Down Expand Up @@ -394,7 +408,8 @@ def auto_contains_all(
def auto_contains_json(
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
Expand Down Expand Up @@ -444,13 +459,14 @@ def levenshtein_distance(s1, s2):
def auto_levenshtein_distance(
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
) -> Result:
try:
distance = levenshtein_distance(output, correct_answer)
distance = levenshtein_distance(output, data_point[correct_answer_key])

if "threshold" in settings_values:
threshold = settings_values["threshold"]
Expand All @@ -474,7 +490,8 @@ def evaluate(
evaluator_key: str,
inputs: Dict[str, Any],
output: str,
correct_answer: str,
data_point: Dict[str, Any],
correct_answer_key: str,
app_params: Dict[str, Any],
settings_values: Dict[str, Any],
lm_providers_keys: Dict[str, Any],
Expand All @@ -486,7 +503,8 @@ def evaluate(
return evaluation_function(
inputs,
output,
correct_answer,
data_point,
correct_answer_key,
app_params,
settings_values,
lm_providers_keys,
Expand Down
9 changes: 6 additions & 3 deletions agenta-backend/agenta_backend/tasks/evaluations.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,6 @@ def evaluate(
evaluation_id: str,
rate_limit_config: Dict[str, int],
lm_providers_keys: Dict[str, Any],
correct_answer_column: str,
):
"""
Evaluate function that performs the evaluation of an app variant using the provided evaluators and testset.
Expand Down Expand Up @@ -214,14 +213,18 @@ def evaluate(
continue

# 3. We evaluate
evaluators_results: [EvaluationScenarioResult] = []
evaluators_results: List[EvaluationScenarioResult] = []
for evaluator_config_db in evaluator_config_dbs:
logger.debug(f"Evaluating with evaluator: {evaluator_config_db}")
correct_answer_column = evaluator_config_db.settings_values.get(
aakrem marked this conversation as resolved.
Show resolved Hide resolved
"correct_answer"
)
if correct_answer_column in data_point:
result = evaluators_service.evaluate(
evaluator_key=evaluator_config_db.evaluator_key,
output=app_output.result.value,
correct_answer=data_point[correct_answer_column],
data_point=data_point,
correct_answer_key=correct_answer_column,
settings_values=evaluator_config_db.settings_values,
app_params=app_variant_parameters,
inputs=data_point,
Expand Down
Loading
Loading