Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Code Evaluations #610

Merged
merged 105 commits into from
Sep 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
d5774bf
Update - added restrictedpython
aybruhm Sep 10, 2023
e2d57bd
Feat - created security module
aybruhm Sep 10, 2023
3cefed1
Feat - implemented execute_code_safely function
aybruhm Sep 10, 2023
7e74d5a
Feat - created custom evaluation db collection
aybruhm Sep 11, 2023
3e0ee16
Feat - created custom evaluation type and store custom evaluation api…
aybruhm Sep 11, 2023
ad4fe0d
Feat - implemented store and execute custom code evaluation logics
aybruhm Sep 11, 2023
f4bd4ea
Feat - implemented function to check if module import is safe to ensu…
aybruhm Sep 11, 2023
05107f1
Cleanup - remove app_name from execute_custom_code_evaluation
aybruhm Sep 11, 2023
182c812
Feat - implemented store and execute custom evaluation routers
aybruhm Sep 11, 2023
7e8e805
Update - added custom_code_run to evaluation type and labels
aybruhm Sep 11, 2023
1e6c7d2
Feat - upload custom_code image
aybruhm Sep 11, 2023
b9c705d
Feat - created store custom evaluation type interface
aybruhm Sep 11, 2023
45e3f0d
Feat - created type interface for store custom evaluation success res…
aybruhm Sep 11, 2023
37ab07f
Feat - implemented save custom code evaluation api logic
aybruhm Sep 11, 2023
db97338
Feat - implemented custom evaluation dropdown component
aybruhm Sep 11, 2023
296b461
Update - added type dropdown component
aybruhm Sep 11, 2023
e97c9f6
Feat - implemented custom python code component
aybruhm Sep 11, 2023
9219fb7
Refactor - renamed component prop interface
aybruhm Sep 11, 2023
998a0af
Feat - created type interface for single custom evaluation
aybruhm Sep 11, 2023
9949093
Feat - implemented axios logic to fetch custom evaluations
aybruhm Sep 11, 2023
7cc58a8
Update - improve security in sandbox environment
aybruhm Sep 11, 2023
acdd892
Cleanup - removed custom evaluation type embedded model and some fiel…
aybruhm Sep 11, 2023
e4e3fa2
Feat - implemented fetch custom evaluations evaluation service
aybruhm Sep 11, 2023
193f619
Feat - implemented list custom evaluations api router
aybruhm Sep 11, 2023
7b9987a
Feat - created custom evaluation output and added new type in evaluat…
aybruhm Sep 11, 2023
7619a9b
Update - modified custom evaluations dropdown component to set custom…
aybruhm Sep 12, 2023
0b79b68
Update - include custom python code and evaluation dropdowns componen…
aybruhm Sep 12, 2023
5670435
Refactor - removed custom_code.png
aybruhm Sep 12, 2023
7e708d7
Feat - created evaluation api model to execute custom evaluation code
aybruhm Sep 12, 2023
3b1a5b6
Feat - implemented custom code run evaluation page
aybruhm Sep 12, 2023
e23e92a
Feat - implemented helper function to include dynamic values
aybruhm Sep 12, 2023
b008773
Update - add condition to save correct_answer for cusutom_code evalua…
aybruhm Sep 12, 2023
459f400
Feat - created type interface for execute custom eval code
aybruhm Sep 12, 2023
dd13b63
Feat - implemented axios logic to execute custom evaluation code
aybruhm Sep 12, 2023
530f72d
Update - added optional field (correct_answer)
aybruhm Sep 12, 2023
3b10006
Feat - implemented fetch average score for custom code run result ser…
aybruhm Sep 12, 2023
9d7a87c
Update - modified fetch_results and execute_custom_evaluation routers
aybruhm Sep 12, 2023
5ba12ea
Cleanup - remove unused code-blocks
aybruhm Sep 12, 2023
06ee96f
Feat - implemented custom code run evaluation table component
aybruhm Sep 12, 2023
e2aa26a
:art: Format - ran format-fix and black
aybruhm Sep 12, 2023
ed19a31
Merge branch 'main' into gh/custom-code-evaluation-in-ui
aybruhm Sep 12, 2023
67972ea
Feat - created create custom evaluation page
aybruhm Sep 12, 2023
5f7c7b3
Update - removed custom python code in evaluation component
aybruhm Sep 12, 2023
a80fdac
Cleanup - formatted custom evaluations dropdown component
aybruhm Sep 12, 2023
772635b
Refactor - renamed saveCutomCodeEvaluation to saveCustomCodeEvaluation
aybruhm Sep 12, 2023
6422f06
Update - added new styles
aybruhm Sep 12, 2023
ba348fc
Update - introduce pre-filled example of an evaluation function and s…
aybruhm Sep 12, 2023
2d9e359
:art: Format - ran format-fix and black
aybruhm Sep 12, 2023
4c2a6e1
Update - added variant_name to type interface ExecuteCustomEvalCode
aybruhm Sep 12, 2023
82aaa8c
Update - added app_params, output to sandbox and allow execute of eva…
aybruhm Sep 12, 2023
d47950b
Update - added output to execute_custom_code_execution service function
aybruhm Sep 12, 2023
4eae667
Update - added app_name, variant_name, and outputs to execute_custom_…
aybruhm Sep 12, 2023
4b3e778
Refactor - modified executeCustomEvaluationCode axios api logic
aybruhm Sep 12, 2023
6c6bb84
Update - added styles for copy btn in custom python code component
aybruhm Sep 12, 2023
70a7a0b
Update - refactor evaluate function and added new args in callCUstomC…
aybruhm Sep 12, 2023
c3e14ed
Update - add custom code evaluation id to evaluation
aybruhm Sep 12, 2023
6075cf1
Update - retrieve evaluations for custom code evals
aybruhm Sep 12, 2023
71736a9
Update -added custom code evalation id to router push
aybruhm Sep 12, 2023
3df9720
Update - added custom_code_evaluation_id and made it optional
aybruhm Sep 12, 2023
25b9715
Update - added btn to copy code example for custom evaluation function
aybruhm Sep 12, 2023
1b0e4b7
Update - created format_outputs helper function
aybruhm Sep 12, 2023
0905050
Update - added correct_answer to execute custom evaluation code api m…
aybruhm Sep 12, 2023
087160c
Update - modified evaluation function example description
aybruhm Sep 12, 2023
4ae2c74
Update - modified fetch_average_score_for_custom_code_run
aybruhm Sep 12, 2023
17235cd
Feat - created update_evaluation_scenario_score logic and added doc s…
aybruhm Sep 12, 2023
ef95da8
Update - include correct_answer to custom eval code params
aybruhm Sep 12, 2023
ded6649
Feat - implemented update evaluation scenario score axios logic
aybruhm Sep 12, 2023
d7f2584
Feat - created evaluation scenario score update api model
aybruhm Sep 12, 2023
d2ae93a
Update - receive put data by payload instead of query
aybruhm Sep 12, 2023
b6a80b6
Update - modified custom code run evaluation table component
aybruhm Sep 12, 2023
175ef80
:art: Format - ran format-fix and black
aybruhm Sep 12, 2023
66fb647
Update - installed packages.json
aybruhm Sep 12, 2023
d4a33f7
Update - integrated ace editor for code input and syntax highlighting
aybruhm Sep 12, 2023
c01e81e
Update - set result and avg_score to 2 decimal places
aybruhm Sep 12, 2023
f9b8929
:art: Format - ran format-fix
aybruhm Sep 12, 2023
253777b
Cleanup - add ? to handle undefined error
aybruhm Sep 13, 2023
3a98f9f
:art: Format - ran format-fix
aybruhm Sep 13, 2023
b05fa01
Merge branch 'main' into gh/custom-code-evaluation-in-ui
aybruhm Sep 13, 2023
7ef2d6a
:art: Format - ran format-fix and black
aybruhm Sep 13, 2023
6d74e30
Cleanup - removed raise exception when no custom evaluations is found
aybruhm Sep 13, 2023
67442cb
Refactor - override error interceptor for get all variant parameters …
aybruhm Sep 14, 2023
ec1452d
Cleanup - removed console log
aybruhm Sep 14, 2023
1160122
Feat - created backend router to get evaluation scenario score and ax…
aybruhm Sep 14, 2023
df09373
Update - round score by 2 decimal
aybruhm Sep 14, 2023
a6e908b
Refactor - removed CustomEvaluationsDropdown component
aybruhm Sep 14, 2023
493d166
Refactor - improve get_evaluation_scenario_score_router
aybruhm Sep 14, 2023
e498683
Refactor - directly include dropdown select of custom evaluations
aybruhm Sep 14, 2023
89b579e
Update - added logic to fetch results of ran evaluation scenarios
aybruhm Sep 14, 2023
b93c371
:art: Format - ran format-fix and black
aybruhm Sep 14, 2023
3c5fcfa
Cleanup - fix type error
aybruhm Sep 14, 2023
5750a80
custom code evaluation: ui enhancements and bug fixes
MohammedMaaz Sep 15, 2023
9dcb7ea
resolve type errors
MohammedMaaz Sep 15, 2023
52dfae1
ran prettier
MohammedMaaz Sep 15, 2023
89924b9
Refactor - renamed store to create
aybruhm Sep 17, 2023
a086a4c
:art: Format - ran black
aybruhm Sep 17, 2023
163297a
Cleanup - removed react-ace and installed monaco-editor
aybruhm Sep 17, 2023
09ae969
Refactor - switch from react-ace to monaco-editor
aybruhm Sep 17, 2023
00b66e5
Feat - created custom evaluation names api model
aybruhm Sep 17, 2023
bfa4beb
Feat - implemented fetch custom evaluation names service
aybruhm Sep 17, 2023
13dc4aa
Feat - implemented evaluation router to get custom evaluation names a…
aybruhm Sep 17, 2023
679e361
Feat - added validation to check if evaluation name (input) exists
aybruhm Sep 17, 2023
5547f4a
:art: Format - ran format-fix
aybruhm Sep 17, 2023
ce00151
Refactor - remove /create from evaluation_router and renamed all pref…
aybruhm Sep 17, 2023
7cb08fd
Refactor - renamed Store prefix to Create
aybruhm Sep 17, 2023
dbffbcc
Cleanup - renamed store custom evaluation success reponse to start wi…
aybruhm Sep 17, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 50 additions & 3 deletions agenta-backend/agenta_backend/models/api/evaluation_model.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from pydantic import BaseModel, Field
from typing import Optional, List, Dict
from datetime import datetime
from enum import Enum
from datetime import datetime
from pydantic import BaseModel, Field
from typing import Optional, List, Dict, Any


class EvaluationTypeSettings(BaseModel):
Expand All @@ -19,6 +19,7 @@ class EvaluationType(str, Enum):
auto_ai_critique = "auto_ai_critique"
human_a_b_testing = "human_a_b_testing"
human_scoring = "human_scoring"
custom_code_run = "custom_code_run"


class EvaluationStatusEnum(str, Enum):
Expand All @@ -33,6 +34,9 @@ class Evaluation(BaseModel):
status: str
evaluation_type: EvaluationType
evaluation_type_settings: Optional[EvaluationTypeSettings]
custom_code_evaluation_id: Optional[
str
] # will be added when running custom code evaluation
llm_app_prompt_template: Optional[str]
variants: Optional[List[str]]
app_name: str
Expand Down Expand Up @@ -70,13 +74,21 @@ class EvaluationScenario(BaseModel):
class EvaluationScenarioUpdate(BaseModel):
vote: Optional[str]
score: Optional[str]
correct_answer: Optional[str] # will be used when running custom code evaluation
outputs: List[EvaluationScenarioOutput]
evaluation_prompt_template: Optional[str]
open_ai_key: Optional[str]


class EvaluationScenarioScoreUpdate(BaseModel):
score: float


class NewEvaluation(BaseModel):
evaluation_type: EvaluationType
custom_code_evaluation_id: Optional[
str
] # will be added when running custom code evaluation
evaluation_type_settings: Optional[EvaluationTypeSettings]
app_name: str
variants: List[str]
Expand All @@ -90,5 +102,40 @@ class DeleteEvaluation(BaseModel):
evaluations_ids: List[str]


class CreateCustomEvaluation(BaseModel):
evaluation_name: str
python_code: str
app_name: str


class CustomEvaluationOutput(BaseModel):
id: str
app_name: str
evaluation_name: str
created_at: datetime


class CustomEvaluationDetail(BaseModel):
id: str
app_name: str
evaluation_name: str
python_code: str
created_at: datetime
updated_at: datetime


class CustomEvaluationNames(BaseModel):
id: str
evaluation_name: str


class ExecuteCustomEvaluationCode(BaseModel):
inputs: List[Dict[str, Any]]
app_name: str
variant_name: str
correct_answer: str
outputs: List[Dict[str, Any]]


class EvaluationWebhook(BaseModel):
score: float
13 changes: 13 additions & 0 deletions agenta-backend/agenta_backend/models/db_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ class EvaluationScenarioOutput(EmbeddedModel):
class EvaluationDB(Model):
status: str
evaluation_type: str
custom_code_evaluation_id: Optional[str]
evaluation_type_settings: EvaluationTypeSettings
llm_app_prompt_template: str
variants: List[str]
Expand Down Expand Up @@ -115,6 +116,18 @@ class Config:
collection = "evaluation_scenarios"


class CustomEvaluationDB(Model):
evaluation_name: str
python_code: str
app_name: str
user: UserDB = Reference()
created_at: Optional[datetime] = Field(default=datetime.utcnow())
updated_at: Optional[datetime] = Field(default=datetime.utcnow())

class Config:
collection = "custom_evaluations"


class TestSetDB(Model):
name: str
app_name: str
Expand Down
215 changes: 214 additions & 1 deletion agenta-backend/agenta_backend/routers/evaluation_router.py
Original file line number Diff line number Diff line change
@@ -1,22 +1,31 @@
import os
import random
from bson import ObjectId
from datetime import datetime
from typing import List, Optional
import random

from fastapi.responses import JSONResponse
from fastapi import HTTPException, APIRouter, Body, Depends

from agenta_backend.services.helpers import format_inputs, format_outputs
from agenta_backend.models.api.evaluation_model import (
CustomEvaluationNames,
Evaluation,
EvaluationScenario,
CustomEvaluationOutput,
CustomEvaluationDetail,
EvaluationScenarioScoreUpdate,
EvaluationScenarioUpdate,
ExecuteCustomEvaluationCode,
NewEvaluation,
DeleteEvaluation,
EvaluationType,
CreateCustomEvaluation,
EvaluationUpdate,
EvaluationWebhook,
)
from agenta_backend.services.results_service import (
fetch_average_score_for_custom_code_run,
fetch_results_for_human_a_b_testing_evaluation,
fetch_results_for_auto_exact_match_evaluation,
fetch_results_for_auto_similarity_match_evaluation,
Expand All @@ -26,10 +35,17 @@
)
from agenta_backend.services.evaluation_service import (
UpdateEvaluationScenarioError,
fetch_custom_evaluation_names,
fetch_custom_evaluations,
fetch_custom_evaluation_detail,
get_evaluation_scenario_score,
update_evaluation_scenario,
update_evaluation_scenario_score,
update_evaluation,
create_new_evaluation,
create_new_evaluation_scenario,
create_custom_code_evaluation,
execute_custom_code_evaluation,
)
from agenta_backend.services.db_manager import engine, query, get_user_object
from agenta_backend.models.db_models import EvaluationDB, EvaluationScenarioDB
Expand Down Expand Up @@ -213,6 +229,60 @@ async def update_evaluation_scenario_router(
raise HTTPException(status_code=500, detail=str(e)) from e


@router.get("/evaluation_scenario/{evaluation_scenario_id}/score")
async def get_evaluation_scenario_score_router(
evaluation_scenario_id: str,
stoken_session: SessionContainer = Depends(verify_session()),
):
"""Get the s

Args:
evaluation_scenario_id (str): _description_
stoken_session (SessionContainer, optional): _description_. Defaults to Depends(verify_session()).

Raises:
HTTPException: _description_
HTTPException: _description_
HTTPException: _description_

Returns:
_type_: _description_
"""

# Get user and organization id
kwargs: dict = await get_user_and_org_id(stoken_session)
scenario_score = await get_evaluation_scenario_score(
evaluation_scenario_id, **kwargs
)
return scenario_score


@router.put("/evaluation_scenario/{evaluation_scenario_id}/score")
async def update_evaluation_scenario_score_router(
evaluation_scenario_id: str,
payload: EvaluationScenarioScoreUpdate,
stoken_session: SessionContainer = Depends(verify_session()),
):
"""Updates evaluation scenario score

Args:
evaluation_scenario_id (str): the evaluation scenario to update
score (float): the value to update

Raises:
HTTPException: server error if evaluation update went wrong
"""

try:
# Get user and organization id
kwargs: dict = await get_user_and_org_id(stoken_session)
return await update_evaluation_scenario_score(
evaluation_scenario_id, payload.score, **kwargs
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e)) from e


@router.get("/", response_model=List[Evaluation])
async def fetch_list_evaluations(
app_name: Optional[str] = None,
Expand All @@ -238,6 +308,7 @@ async def fetch_list_evaluations(
id=str(evaluation.id),
status=evaluation.status,
evaluation_type=evaluation.evaluation_type,
custom_code_evaluation_id=evaluation.custom_code_evaluation_id,
evaluation_type_settings=evaluation.evaluation_type_settings,
llm_app_prompt_template=evaluation.llm_app_prompt_template,
variants=evaluation.variants,
Expand Down Expand Up @@ -275,6 +346,7 @@ async def fetch_evaluation(
id=str(evaluation.id),
status=evaluation.status,
evaluation_type=evaluation.evaluation_type,
custom_code_evaluation_id=evaluation.custom_code_evaluation_id,
evaluation_type_settings=evaluation.evaluation_type_settings,
llm_app_prompt_template=evaluation.llm_app_prompt_template,
variants=evaluation.variants,
Expand Down Expand Up @@ -386,6 +458,147 @@ async def fetch_results(
results = await fetch_results_for_auto_ai_critique(evaluation_id)
return {"results_data": results}

elif evaluation.evaluation_type == EvaluationType.custom_code_run:
results = await fetch_average_score_for_custom_code_run(evaluation_id)
return {"avg_score": results}


@router.post("/custom_evaluation/")
async def create_custom_evaluation(
custom_evaluation_payload: CreateCustomEvaluation,
stoken_session: SessionContainer = Depends(verify_session()),
):
"""Create evaluation with custom python code.

Args:
\n custom_evaluation_payload (CreateCustomEvaluation): the required payload
"""

# Get user and organization id
kwargs: dict = await get_user_and_org_id(stoken_session)

# create custom evaluation in database
evaluation_id = await create_custom_code_evaluation(
custom_evaluation_payload, **kwargs
)

return JSONResponse(
{
"status": "success",
"message": "Evaluation created successfully.",
"evaluation_id": evaluation_id,
},
status_code=200,
)


@router.get(
"/custom_evaluation/list/{app_name}",
response_model=List[CustomEvaluationOutput],
)
async def list_custom_evaluations(
app_name: str,
stoken_session: SessionContainer = Depends(verify_session()),
):
"""List the custom code evaluations for a given app.

Args:
app_name (str): the name of the app

Returns:
List[CustomEvaluationOutput]: a list of custom evaluation
"""

# Get user and organization id
kwargs: dict = await get_user_and_org_id(stoken_session)

# Fetch custom evaluations from database
evaluations = await fetch_custom_evaluations(app_name, **kwargs)
return evaluations


@router.get(
"/custom_evaluation/{id}",
response_model=CustomEvaluationDetail,
)
async def get_custom_evaluation(
id: str,
stoken_session: SessionContainer = Depends(verify_session()),
):
"""Get the custom code evaluation detail.

Args:
id (str): the id of the custom evaluation

Returns:
CustomEvaluationDetail: Detail of the custom evaluation
"""

# Get user and organization id
kwargs: dict = await get_user_and_org_id(stoken_session)

# Fetch custom evaluations from database
evaluation = await fetch_custom_evaluation_detail(id, **kwargs)
return evaluation


@router.get(
"/custom_evaluation/{app_name}/names/",
response_model=List[CustomEvaluationNames],
)
async def get_custom_evaluation_names(
app_name: str, stoken_session: SessionContainer = Depends(verify_session())
):
"""Get the names of custom evaluation for a given app.

Args:
app_name (str): the name of the app the evaluation belongs to

Returns:
List[CustomEvaluationNames]: the list of name of custom evaluations
"""
# Get user and organization id
kwargs: dict = await get_user_and_org_id(stoken_session)

custom_eval_names = await fetch_custom_evaluation_names(app_name, **kwargs)
return custom_eval_names


@router.post(
"/custom_evaluation/execute/{evaluation_id}/",
)
async def execute_custom_evaluation(
evaluation_id: str,
payload: ExecuteCustomEvaluationCode,
stoken_session: SessionContainer = Depends(verify_session()),
):
"""Execute a custom evaluation code.

Args:
evaluation_id (str): the custom evaluation id
payload (ExecuteCustomEvaluationCode): the required payload

Returns:
float: the result of the evaluation custom code
"""

# Get user and organization id
kwargs: dict = await get_user_and_org_id(stoken_session)

# Execute custom code evaluation
formatted_inputs = format_inputs(payload.inputs)
formatted_outputs = format_outputs(payload.outputs)
result = await execute_custom_code_evaluation(
evaluation_id,
payload.app_name,
formatted_outputs[payload.variant_name], # gets the output of the app variant
payload.correct_answer,
payload.variant_name,
formatted_inputs,
**kwargs,
)
return result


@router.post("/webhook_example_fake", response_model=EvaluationWebhook)
async def webhook_example_fake():
Expand Down
Loading
Loading