Skip to content

Commit

Permalink
Custom Code Evaluations (#610)
Browse files Browse the repository at this point in the history
* Update - added restrictedpython

* Feat - created security module

* Feat - implemented execute_code_safely function

* Feat - created custom evaluation db collection

* Feat - created custom evaluation type and store custom evaluation api models

* Feat - implemented store and execute custom code evaluation logics

* Feat - implemented function to check if module import is safe to ensure code safe execution

* Cleanup - remove app_name from execute_custom_code_evaluation

* Feat - implemented store and execute custom evaluation routers

* Update - added custom_code_run to evaluation type and labels

* Feat - upload custom_code image

* Feat - created store custom evaluation type interface

* Feat - created type interface for store custom evaluation success response

* Feat - implemented save custom code evaluation api logic

* Feat - implemented custom evaluation dropdown component

* Update - added type dropdown component

* Feat - implemented custom python code component

* Refactor - renamed component prop interface

* Feat - created type interface for single custom evaluation

* Feat - implemented axios logic to fetch custom evaluations

* Update - improve security in sandbox environment

* Cleanup - removed custom evaluation type embedded model and some fields in custom evaluation db

* Feat - implemented fetch custom evaluations evaluation service

* Feat - implemented list custom evaluations api router

* Feat - created custom evaluation output and added new type in evaluation type schema

* Update - modified custom evaluations dropdown component to set custom evaluation id

* Update - include custom python code and evaluation dropdowns component in evaluations component

* Refactor - removed custom_code.png

* Feat - created evaluation api model to execute custom evaluation code

* Feat - implemented custom code run evaluation page

* Feat - implemented helper function to include dynamic values

* Update - add condition to save correct_answer for cusutom_code evaluations

* Feat - created type interface for execute custom eval code

* Feat - implemented axios logic to execute custom evaluation code

* Update - added optional field (correct_answer)

* Feat - implemented fetch average score for custom code run result service

* Update - modified fetch_results and execute_custom_evaluation routers

* Cleanup - remove unused code-blocks

* Feat - implemented custom code run evaluation table component

* 🎨 Format - ran format-fix and black

* Feat - created create custom evaluation page

* Update - removed custom python code in evaluation component

* Cleanup - formatted custom evaluations dropdown component

* Refactor - renamed saveCutomCodeEvaluation to saveCustomCodeEvaluation

* Update - added new styles

* Update - introduce pre-filled example of an evaluation function and syntax-highling on the example code

* 🎨 Format - ran format-fix and black

* Update - added variant_name to type interface ExecuteCustomEvalCode

* Update - added app_params, output to sandbox and allow execute of evaluation function

* Update - added output to execute_custom_code_execution service function

* Update - added app_name, variant_name, and outputs to execute_custom_evaluation_code api model

* Refactor - modified executeCustomEvaluationCode axios api logic

* Update - added styles for copy btn in custom python code component

* Update - refactor evaluate function and added new args in callCUstomCodeHandler function

* Update - add custom code evaluation id to evaluation

* Update - retrieve evaluations for custom code evals

* Update -added custom code evalation id to router push

* Update - added custom_code_evaluation_id and made it optional

* Update - added btn to copy code example for custom evaluation function

* Update - created format_outputs helper function

* Update - added correct_answer to execute custom evaluation code api model

* Update - modified evaluation function example description

* Update - modified fetch_average_score_for_custom_code_run

* Feat - created update_evaluation_scenario_score logic and added doc strings to custom evaluations services

* Update - include correct_answer to custom eval code params

* Feat - implemented update evaluation scenario score axios logic

* Feat - created evaluation scenario score update api model

* Update - receive put data by payload instead of query

* Update - modified custom code run evaluation table component

* 🎨 Format - ran format-fix and black

* Update - installed packages.json

* Update - integrated ace editor for code input and syntax highlighting

* Update - set result and avg_score to 2 decimal places

* 🎨 Format - ran format-fix

* Cleanup - add ? to handle undefined error

* 🎨 Format - ran format-fix

* 🎨 Format - ran format-fix and black

* Cleanup - removed raise exception when no custom evaluations is found

* Refactor - override error interceptor for get all variant parameters api call

* Cleanup - removed console log

* Feat - created backend router to get evaluation scenario score and axios logic to make backend call

* Update - round score by 2 decimal

* Refactor - removed CustomEvaluationsDropdown component

* Refactor - improve get_evaluation_scenario_score_router

* Refactor - directly include dropdown select of custom evaluations

* Update - added logic to fetch results of ran evaluation scenarios

* 🎨 Format - ran format-fix and black

* Cleanup - fix type error

* custom code evaluation: ui enhancements and bug fixes

* resolve type errors

* ran prettier

* Refactor - renamed store to create

* 🎨 Format - ran black

* Cleanup - removed react-ace and installed monaco-editor

* Refactor - switch from react-ace to monaco-editor

* Feat - created custom evaluation names api model

* Feat - implemented fetch custom evaluation names service

* Feat - implemented evaluation router to get custom evaluation names and integrated router to axios

* Feat - added validation to check if evaluation name (input) exists

* 🎨 Format - ran format-fix

* Refactor - remove /create from evaluation_router and renamed all prefix store_ with create_

* Refactor - renamed Store prefix to Create

* Cleanup - renamed store custom evaluation success reponse to start with create and ran format-fix

---------

Co-authored-by: Abram <[email protected]>
  • Loading branch information
MohammedMaaz and aybruhm authored Sep 17, 2023
1 parent 1ce56cf commit 6e7e421
Show file tree
Hide file tree
Showing 26 changed files with 3,701 additions and 942 deletions.
53 changes: 50 additions & 3 deletions agenta-backend/agenta_backend/models/api/evaluation_model.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from pydantic import BaseModel, Field
from typing import Optional, List, Dict
from datetime import datetime
from enum import Enum
from datetime import datetime
from pydantic import BaseModel, Field
from typing import Optional, List, Dict, Any


class EvaluationTypeSettings(BaseModel):
Expand All @@ -19,6 +19,7 @@ class EvaluationType(str, Enum):
auto_ai_critique = "auto_ai_critique"
human_a_b_testing = "human_a_b_testing"
human_scoring = "human_scoring"
custom_code_run = "custom_code_run"


class EvaluationStatusEnum(str, Enum):
Expand All @@ -33,6 +34,9 @@ class Evaluation(BaseModel):
status: str
evaluation_type: EvaluationType
evaluation_type_settings: Optional[EvaluationTypeSettings]
custom_code_evaluation_id: Optional[
str
] # will be added when running custom code evaluation
llm_app_prompt_template: Optional[str]
variants: Optional[List[str]]
app_name: str
Expand Down Expand Up @@ -70,13 +74,21 @@ class EvaluationScenario(BaseModel):
class EvaluationScenarioUpdate(BaseModel):
vote: Optional[str]
score: Optional[str]
correct_answer: Optional[str] # will be used when running custom code evaluation
outputs: List[EvaluationScenarioOutput]
evaluation_prompt_template: Optional[str]
open_ai_key: Optional[str]


class EvaluationScenarioScoreUpdate(BaseModel):
score: float


class NewEvaluation(BaseModel):
evaluation_type: EvaluationType
custom_code_evaluation_id: Optional[
str
] # will be added when running custom code evaluation
evaluation_type_settings: Optional[EvaluationTypeSettings]
app_name: str
variants: List[str]
Expand All @@ -90,5 +102,40 @@ class DeleteEvaluation(BaseModel):
evaluations_ids: List[str]


class CreateCustomEvaluation(BaseModel):
evaluation_name: str
python_code: str
app_name: str


class CustomEvaluationOutput(BaseModel):
id: str
app_name: str
evaluation_name: str
created_at: datetime


class CustomEvaluationDetail(BaseModel):
id: str
app_name: str
evaluation_name: str
python_code: str
created_at: datetime
updated_at: datetime


class CustomEvaluationNames(BaseModel):
id: str
evaluation_name: str


class ExecuteCustomEvaluationCode(BaseModel):
inputs: List[Dict[str, Any]]
app_name: str
variant_name: str
correct_answer: str
outputs: List[Dict[str, Any]]


class EvaluationWebhook(BaseModel):
score: float
13 changes: 13 additions & 0 deletions agenta-backend/agenta_backend/models/db_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ class EvaluationScenarioOutput(EmbeddedModel):
class EvaluationDB(Model):
status: str
evaluation_type: str
custom_code_evaluation_id: Optional[str]
evaluation_type_settings: EvaluationTypeSettings
llm_app_prompt_template: str
variants: List[str]
Expand Down Expand Up @@ -115,6 +116,18 @@ class Config:
collection = "evaluation_scenarios"


class CustomEvaluationDB(Model):
evaluation_name: str
python_code: str
app_name: str
user: UserDB = Reference()
created_at: Optional[datetime] = Field(default=datetime.utcnow())
updated_at: Optional[datetime] = Field(default=datetime.utcnow())

class Config:
collection = "custom_evaluations"


class TestSetDB(Model):
name: str
app_name: str
Expand Down
215 changes: 214 additions & 1 deletion agenta-backend/agenta_backend/routers/evaluation_router.py
Original file line number Diff line number Diff line change
@@ -1,22 +1,31 @@
import os
import random
from bson import ObjectId
from datetime import datetime
from typing import List, Optional
import random

from fastapi.responses import JSONResponse
from fastapi import HTTPException, APIRouter, Body, Depends

from agenta_backend.services.helpers import format_inputs, format_outputs
from agenta_backend.models.api.evaluation_model import (
CustomEvaluationNames,
Evaluation,
EvaluationScenario,
CustomEvaluationOutput,
CustomEvaluationDetail,
EvaluationScenarioScoreUpdate,
EvaluationScenarioUpdate,
ExecuteCustomEvaluationCode,
NewEvaluation,
DeleteEvaluation,
EvaluationType,
CreateCustomEvaluation,
EvaluationUpdate,
EvaluationWebhook,
)
from agenta_backend.services.results_service import (
fetch_average_score_for_custom_code_run,
fetch_results_for_human_a_b_testing_evaluation,
fetch_results_for_auto_exact_match_evaluation,
fetch_results_for_auto_similarity_match_evaluation,
Expand All @@ -26,10 +35,17 @@
)
from agenta_backend.services.evaluation_service import (
UpdateEvaluationScenarioError,
fetch_custom_evaluation_names,
fetch_custom_evaluations,
fetch_custom_evaluation_detail,
get_evaluation_scenario_score,
update_evaluation_scenario,
update_evaluation_scenario_score,
update_evaluation,
create_new_evaluation,
create_new_evaluation_scenario,
create_custom_code_evaluation,
execute_custom_code_evaluation,
)
from agenta_backend.services.db_manager import engine, query, get_user_object
from agenta_backend.models.db_models import EvaluationDB, EvaluationScenarioDB
Expand Down Expand Up @@ -213,6 +229,60 @@ async def update_evaluation_scenario_router(
raise HTTPException(status_code=500, detail=str(e)) from e


@router.get("/evaluation_scenario/{evaluation_scenario_id}/score")
async def get_evaluation_scenario_score_router(
evaluation_scenario_id: str,
stoken_session: SessionContainer = Depends(verify_session()),
):
"""Get the s
Args:
evaluation_scenario_id (str): _description_
stoken_session (SessionContainer, optional): _description_. Defaults to Depends(verify_session()).
Raises:
HTTPException: _description_
HTTPException: _description_
HTTPException: _description_
Returns:
_type_: _description_
"""

# Get user and organization id
kwargs: dict = await get_user_and_org_id(stoken_session)
scenario_score = await get_evaluation_scenario_score(
evaluation_scenario_id, **kwargs
)
return scenario_score


@router.put("/evaluation_scenario/{evaluation_scenario_id}/score")
async def update_evaluation_scenario_score_router(
evaluation_scenario_id: str,
payload: EvaluationScenarioScoreUpdate,
stoken_session: SessionContainer = Depends(verify_session()),
):
"""Updates evaluation scenario score
Args:
evaluation_scenario_id (str): the evaluation scenario to update
score (float): the value to update
Raises:
HTTPException: server error if evaluation update went wrong
"""

try:
# Get user and organization id
kwargs: dict = await get_user_and_org_id(stoken_session)
return await update_evaluation_scenario_score(
evaluation_scenario_id, payload.score, **kwargs
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e)) from e


@router.get("/", response_model=List[Evaluation])
async def fetch_list_evaluations(
app_name: Optional[str] = None,
Expand All @@ -238,6 +308,7 @@ async def fetch_list_evaluations(
id=str(evaluation.id),
status=evaluation.status,
evaluation_type=evaluation.evaluation_type,
custom_code_evaluation_id=evaluation.custom_code_evaluation_id,
evaluation_type_settings=evaluation.evaluation_type_settings,
llm_app_prompt_template=evaluation.llm_app_prompt_template,
variants=evaluation.variants,
Expand Down Expand Up @@ -275,6 +346,7 @@ async def fetch_evaluation(
id=str(evaluation.id),
status=evaluation.status,
evaluation_type=evaluation.evaluation_type,
custom_code_evaluation_id=evaluation.custom_code_evaluation_id,
evaluation_type_settings=evaluation.evaluation_type_settings,
llm_app_prompt_template=evaluation.llm_app_prompt_template,
variants=evaluation.variants,
Expand Down Expand Up @@ -386,6 +458,147 @@ async def fetch_results(
results = await fetch_results_for_auto_ai_critique(evaluation_id)
return {"results_data": results}

elif evaluation.evaluation_type == EvaluationType.custom_code_run:
results = await fetch_average_score_for_custom_code_run(evaluation_id)
return {"avg_score": results}


@router.post("/custom_evaluation/")
async def create_custom_evaluation(
custom_evaluation_payload: CreateCustomEvaluation,
stoken_session: SessionContainer = Depends(verify_session()),
):
"""Create evaluation with custom python code.
Args:
\n custom_evaluation_payload (CreateCustomEvaluation): the required payload
"""

# Get user and organization id
kwargs: dict = await get_user_and_org_id(stoken_session)

# create custom evaluation in database
evaluation_id = await create_custom_code_evaluation(
custom_evaluation_payload, **kwargs
)

return JSONResponse(
{
"status": "success",
"message": "Evaluation created successfully.",
"evaluation_id": evaluation_id,
},
status_code=200,
)


@router.get(
"/custom_evaluation/list/{app_name}",
response_model=List[CustomEvaluationOutput],
)
async def list_custom_evaluations(
app_name: str,
stoken_session: SessionContainer = Depends(verify_session()),
):
"""List the custom code evaluations for a given app.
Args:
app_name (str): the name of the app
Returns:
List[CustomEvaluationOutput]: a list of custom evaluation
"""

# Get user and organization id
kwargs: dict = await get_user_and_org_id(stoken_session)

# Fetch custom evaluations from database
evaluations = await fetch_custom_evaluations(app_name, **kwargs)
return evaluations


@router.get(
"/custom_evaluation/{id}",
response_model=CustomEvaluationDetail,
)
async def get_custom_evaluation(
id: str,
stoken_session: SessionContainer = Depends(verify_session()),
):
"""Get the custom code evaluation detail.
Args:
id (str): the id of the custom evaluation
Returns:
CustomEvaluationDetail: Detail of the custom evaluation
"""

# Get user and organization id
kwargs: dict = await get_user_and_org_id(stoken_session)

# Fetch custom evaluations from database
evaluation = await fetch_custom_evaluation_detail(id, **kwargs)
return evaluation


@router.get(
"/custom_evaluation/{app_name}/names/",
response_model=List[CustomEvaluationNames],
)
async def get_custom_evaluation_names(
app_name: str, stoken_session: SessionContainer = Depends(verify_session())
):
"""Get the names of custom evaluation for a given app.
Args:
app_name (str): the name of the app the evaluation belongs to
Returns:
List[CustomEvaluationNames]: the list of name of custom evaluations
"""
# Get user and organization id
kwargs: dict = await get_user_and_org_id(stoken_session)

custom_eval_names = await fetch_custom_evaluation_names(app_name, **kwargs)
return custom_eval_names


@router.post(
"/custom_evaluation/execute/{evaluation_id}/",
)
async def execute_custom_evaluation(
evaluation_id: str,
payload: ExecuteCustomEvaluationCode,
stoken_session: SessionContainer = Depends(verify_session()),
):
"""Execute a custom evaluation code.
Args:
evaluation_id (str): the custom evaluation id
payload (ExecuteCustomEvaluationCode): the required payload
Returns:
float: the result of the evaluation custom code
"""

# Get user and organization id
kwargs: dict = await get_user_and_org_id(stoken_session)

# Execute custom code evaluation
formatted_inputs = format_inputs(payload.inputs)
formatted_outputs = format_outputs(payload.outputs)
result = await execute_custom_code_evaluation(
evaluation_id,
payload.app_name,
formatted_outputs[payload.variant_name], # gets the output of the app variant
payload.correct_answer,
payload.variant_name,
formatted_inputs,
**kwargs,
)
return result


@router.post("/webhook_example_fake", response_model=EvaluationWebhook)
async def webhook_example_fake():
Expand Down
Loading

0 comments on commit 6e7e421

Please sign in to comment.