Custom Code Evaluations (#610)

* Update - added restrictedpython * Feat - created security module * Feat - implemented execute_code_safely function * Feat - created custom evaluation db collection * Feat - created custom evaluation type and store custom evaluation api models * Feat - implemented store and execute custom code evaluation logics * Feat - implemented function to check if module import is safe to ensure code safe execution * Cleanup - remove app_name from execute_custom_code_evaluation * Feat - implemented store and execute custom evaluation routers * Update - added custom_code_run to evaluation type and labels * Feat - upload custom_code image * Feat - created store custom evaluation type interface * Feat - created type interface for store custom evaluation success response * Feat - implemented save custom code evaluation api logic * Feat - implemented custom evaluation dropdown component * Update - added type dropdown component * Feat - implemented custom python code component * Refactor - renamed component prop interface * Feat - created type interface for single custom evaluation * Feat - implemented axios logic to fetch custom evaluations * Update - improve security in sandbox environment * Cleanup - removed custom evaluation type embedded model and some fields in custom evaluation db * Feat - implemented fetch custom evaluations evaluation service * Feat - implemented list custom evaluations api router * Feat - created custom evaluation output and added new type in evaluation type schema * Update - modified custom evaluations dropdown component to set custom evaluation id * Update - include custom python code and evaluation dropdowns component in evaluations component * Refactor - removed custom_code.png * Feat - created evaluation api model to execute custom evaluation code * Feat - implemented custom code run evaluation page * Feat - implemented helper function to include dynamic values * Update - add condition to save correct_answer for cusutom_code evaluations * Feat - created type interface for execute custom eval code * Feat - implemented axios logic to execute custom evaluation code * Update - added optional field (correct_answer) * Feat - implemented fetch average score for custom code run result service * Update - modified fetch_results and execute_custom_evaluation routers * Cleanup - remove unused code-blocks * Feat - implemented custom code run evaluation table component * 🎨 Format - ran format-fix and black * Feat - created create custom evaluation page * Update - removed custom python code in evaluation component * Cleanup - formatted custom evaluations dropdown component * Refactor - renamed saveCutomCodeEvaluation to saveCustomCodeEvaluation * Update - added new styles * Update - introduce pre-filled example of an evaluation function and syntax-highling on the example code * 🎨 Format - ran format-fix and black * Update - added variant_name to type interface ExecuteCustomEvalCode * Update - added app_params, output to sandbox and allow execute of evaluation function * Update - added output to execute_custom_code_execution service function * Update - added app_name, variant_name, and outputs to execute_custom_evaluation_code api model * Refactor - modified executeCustomEvaluationCode axios api logic * Update - added styles for copy btn in custom python code component * Update - refactor evaluate function and added new args in callCUstomCodeHandler function * Update - add custom code evaluation id to evaluation * Update - retrieve evaluations for custom code evals * Update -added custom code evalation id to router push * Update - added custom_code_evaluation_id and made it optional * Update - added btn to copy code example for custom evaluation function * Update - created format_outputs helper function * Update - added correct_answer to execute custom evaluation code api model * Update - modified evaluation function example description * Update - modified fetch_average_score_for_custom_code_run * Feat - created update_evaluation_scenario_score logic and added doc strings to custom evaluations services * Update - include correct_answer to custom eval code params * Feat - implemented update evaluation scenario score axios logic * Feat - created evaluation scenario score update api model * Update - receive put data by payload instead of query * Update - modified custom code run evaluation table component * 🎨 Format - ran format-fix and black * Update - installed packages.json * Update - integrated ace editor for code input and syntax highlighting * Update - set result and avg_score to 2 decimal places * 🎨 Format - ran format-fix * Cleanup - add ? to handle undefined error * 🎨 Format - ran format-fix * 🎨 Format - ran format-fix and black * Cleanup - removed raise exception when no custom evaluations is found * Refactor - override error interceptor for get all variant parameters api call * Cleanup - removed console log * Feat - created backend router to get evaluation scenario score and axios logic to make backend call * Update - round score by 2 decimal * Refactor - removed CustomEvaluationsDropdown component * Refactor - improve get_evaluation_scenario_score_router * Refactor - directly include dropdown select of custom evaluations * Update - added logic to fetch results of ran evaluation scenarios * 🎨 Format - ran format-fix and black * Cleanup - fix type error * custom code evaluation: ui enhancements and bug fixes * resolve type errors * ran prettier * Refactor - renamed store to create * 🎨 Format - ran black * Cleanup - removed react-ace and installed monaco-editor * Refactor - switch from react-ace to monaco-editor * Feat - created custom evaluation names api model * Feat - implemented fetch custom evaluation names service * Feat - implemented evaluation router to get custom evaluation names and integrated router to axios * Feat - added validation to check if evaluation name (input) exists * 🎨 Format - ran format-fix * Refactor - remove /create from evaluation_router and renamed all prefix store_ with create_ * Refactor - renamed Store prefix to Create * Cleanup - renamed store custom evaluation success reponse to start with create and ran format-fix --------- Co-authored-by: Abram <[email protected]>
Agenta-AI · Sep 17, 2023 · 6e7e421 · 6e7e421
1 parent 1ce56cf
commit 6e7e421
Show file tree

Hide file tree

Showing 26 changed files with 3,701 additions and 942 deletions.
diff --git a/agenta-backend/agenta_backend/models/api/evaluation_model.py b/agenta-backend/agenta_backend/models/api/evaluation_model.py
@@ -1,7 +1,7 @@
-from pydantic import BaseModel, Field
-from typing import Optional, List, Dict
-from datetime import datetime
 from enum import Enum
+from datetime import datetime
+from pydantic import BaseModel, Field
+from typing import Optional, List, Dict, Any
 
 
 class EvaluationTypeSettings(BaseModel):
@@ -19,6 +19,7 @@ class EvaluationType(str, Enum):
     auto_ai_critique = "auto_ai_critique"
     human_a_b_testing = "human_a_b_testing"
     human_scoring = "human_scoring"
+    custom_code_run = "custom_code_run"
 
 
 class EvaluationStatusEnum(str, Enum):
@@ -33,6 +34,9 @@ class Evaluation(BaseModel):
     status: str
     evaluation_type: EvaluationType
     evaluation_type_settings: Optional[EvaluationTypeSettings]
+    custom_code_evaluation_id: Optional[
+        str
+    ]  # will be added when running custom code evaluation
     llm_app_prompt_template: Optional[str]
     variants: Optional[List[str]]
     app_name: str
@@ -70,13 +74,21 @@ class EvaluationScenario(BaseModel):
 class EvaluationScenarioUpdate(BaseModel):
     vote: Optional[str]
     score: Optional[str]
+    correct_answer: Optional[str]  # will be used when running custom code evaluation
     outputs: List[EvaluationScenarioOutput]
     evaluation_prompt_template: Optional[str]
     open_ai_key: Optional[str]
 
 
+class EvaluationScenarioScoreUpdate(BaseModel):
+    score: float
+
+
 class NewEvaluation(BaseModel):
     evaluation_type: EvaluationType
+    custom_code_evaluation_id: Optional[
+        str
+    ]  # will be added when running custom code evaluation
     evaluation_type_settings: Optional[EvaluationTypeSettings]
     app_name: str
     variants: List[str]
@@ -90,5 +102,40 @@ class DeleteEvaluation(BaseModel):
     evaluations_ids: List[str]
 
 
+class CreateCustomEvaluation(BaseModel):
+    evaluation_name: str
+    python_code: str
+    app_name: str
+
+
+class CustomEvaluationOutput(BaseModel):
+    id: str
+    app_name: str
+    evaluation_name: str
+    created_at: datetime
+
+
+class CustomEvaluationDetail(BaseModel):
+    id: str
+    app_name: str
+    evaluation_name: str
+    python_code: str
+    created_at: datetime
+    updated_at: datetime
+
+
+class CustomEvaluationNames(BaseModel):
+    id: str
+    evaluation_name: str
+
+
+class ExecuteCustomEvaluationCode(BaseModel):
+    inputs: List[Dict[str, Any]]
+    app_name: str
+    variant_name: str
+    correct_answer: str
+    outputs: List[Dict[str, Any]]
+
+
 class EvaluationWebhook(BaseModel):
     score: float
diff --git a/agenta-backend/agenta_backend/models/db_models.py b/agenta-backend/agenta_backend/models/db_models.py
@@ -86,6 +86,7 @@ class EvaluationScenarioOutput(EmbeddedModel):
 class EvaluationDB(Model):
     status: str
     evaluation_type: str
+    custom_code_evaluation_id: Optional[str]
     evaluation_type_settings: EvaluationTypeSettings
     llm_app_prompt_template: str
     variants: List[str]
@@ -115,6 +116,18 @@ class Config:
         collection = "evaluation_scenarios"
 
 
+class CustomEvaluationDB(Model):
+    evaluation_name: str
+    python_code: str
+    app_name: str
+    user: UserDB = Reference()
+    created_at: Optional[datetime] = Field(default=datetime.utcnow())
+    updated_at: Optional[datetime] = Field(default=datetime.utcnow())
+
+    class Config:
+        collection = "custom_evaluations"
+
+
 class TestSetDB(Model):
     name: str
     app_name: str

diff --git a/agenta-backend/agenta_backend/routers/evaluation_router.py b/agenta-backend/agenta_backend/routers/evaluation_router.py
@@ -1,22 +1,31 @@
 import os
+import random
 from bson import ObjectId
 from datetime import datetime
 from typing import List, Optional
-import random
 
+from fastapi.responses import JSONResponse
 from fastapi import HTTPException, APIRouter, Body, Depends
 
+from agenta_backend.services.helpers import format_inputs, format_outputs
 from agenta_backend.models.api.evaluation_model import (
+    CustomEvaluationNames,
     Evaluation,
     EvaluationScenario,
+    CustomEvaluationOutput,
+    CustomEvaluationDetail,
+    EvaluationScenarioScoreUpdate,
     EvaluationScenarioUpdate,
+    ExecuteCustomEvaluationCode,
     NewEvaluation,
     DeleteEvaluation,
     EvaluationType,
+    CreateCustomEvaluation,
     EvaluationUpdate,
     EvaluationWebhook,
 )
 from agenta_backend.services.results_service import (
+    fetch_average_score_for_custom_code_run,
     fetch_results_for_human_a_b_testing_evaluation,
     fetch_results_for_auto_exact_match_evaluation,
     fetch_results_for_auto_similarity_match_evaluation,
@@ -26,10 +35,17 @@
 )
 from agenta_backend.services.evaluation_service import (
     UpdateEvaluationScenarioError,
+    fetch_custom_evaluation_names,
+    fetch_custom_evaluations,
+    fetch_custom_evaluation_detail,
+    get_evaluation_scenario_score,
     update_evaluation_scenario,
+    update_evaluation_scenario_score,
     update_evaluation,
     create_new_evaluation,
     create_new_evaluation_scenario,
+    create_custom_code_evaluation,
+    execute_custom_code_evaluation,
 )
 from agenta_backend.services.db_manager import engine, query, get_user_object
 from agenta_backend.models.db_models import EvaluationDB, EvaluationScenarioDB
@@ -213,6 +229,60 @@ async def update_evaluation_scenario_router(
         raise HTTPException(status_code=500, detail=str(e)) from e
 
 
+@router.get("/evaluation_scenario/{evaluation_scenario_id}/score")
+async def get_evaluation_scenario_score_router(
+    evaluation_scenario_id: str,
+    stoken_session: SessionContainer = Depends(verify_session()),
+):
+    """Get the s
+
+    Args:
+        evaluation_scenario_id (str): _description_
+        stoken_session (SessionContainer, optional): _description_. Defaults to Depends(verify_session()).
+
+    Raises:
+        HTTPException: _description_
+        HTTPException: _description_
+        HTTPException: _description_
+
+    Returns:
+        _type_: _description_
+    """
+
+    # Get user and organization id
+    kwargs: dict = await get_user_and_org_id(stoken_session)
+    scenario_score = await get_evaluation_scenario_score(
+        evaluation_scenario_id, **kwargs
+    )
+    return scenario_score
+
+
+@router.put("/evaluation_scenario/{evaluation_scenario_id}/score")
+async def update_evaluation_scenario_score_router(
+    evaluation_scenario_id: str,
+    payload: EvaluationScenarioScoreUpdate,
+    stoken_session: SessionContainer = Depends(verify_session()),
+):
+    """Updates evaluation scenario score
+
+    Args:
+        evaluation_scenario_id (str): the evaluation scenario to update
+        score (float): the value to update
+
+    Raises:
+        HTTPException: server error if evaluation update went wrong
+    """
+
+    try:
+        # Get user and organization id
+        kwargs: dict = await get_user_and_org_id(stoken_session)
+        return await update_evaluation_scenario_score(
+            evaluation_scenario_id, payload.score, **kwargs
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e)) from e
+
+
 @router.get("/", response_model=List[Evaluation])
 async def fetch_list_evaluations(
     app_name: Optional[str] = None,
@@ -238,6 +308,7 @@ async def fetch_list_evaluations(
             id=str(evaluation.id),
             status=evaluation.status,
             evaluation_type=evaluation.evaluation_type,
+            custom_code_evaluation_id=evaluation.custom_code_evaluation_id,
             evaluation_type_settings=evaluation.evaluation_type_settings,
             llm_app_prompt_template=evaluation.llm_app_prompt_template,
             variants=evaluation.variants,
@@ -275,6 +346,7 @@ async def fetch_evaluation(
             id=str(evaluation.id),
             status=evaluation.status,
             evaluation_type=evaluation.evaluation_type,
+            custom_code_evaluation_id=evaluation.custom_code_evaluation_id,
             evaluation_type_settings=evaluation.evaluation_type_settings,
             llm_app_prompt_template=evaluation.llm_app_prompt_template,
             variants=evaluation.variants,
@@ -386,6 +458,147 @@ async def fetch_results(
         results = await fetch_results_for_auto_ai_critique(evaluation_id)
         return {"results_data": results}
 
+    elif evaluation.evaluation_type == EvaluationType.custom_code_run:
+        results = await fetch_average_score_for_custom_code_run(evaluation_id)
+        return {"avg_score": results}
+
+
+@router.post("/custom_evaluation/")
+async def create_custom_evaluation(
+    custom_evaluation_payload: CreateCustomEvaluation,
+    stoken_session: SessionContainer = Depends(verify_session()),
+):
+    """Create evaluation with custom python code.
+
+    Args:
+        \n custom_evaluation_payload (CreateCustomEvaluation): the required payload
+    """
+
+    # Get user and organization id
+    kwargs: dict = await get_user_and_org_id(stoken_session)
+
+    # create custom evaluation in database
+    evaluation_id = await create_custom_code_evaluation(
+        custom_evaluation_payload, **kwargs
+    )
+
+    return JSONResponse(
+        {
+            "status": "success",
+            "message": "Evaluation created successfully.",
+            "evaluation_id": evaluation_id,
+        },
+        status_code=200,
+    )
+
+
+@router.get(
+    "/custom_evaluation/list/{app_name}",
+    response_model=List[CustomEvaluationOutput],
+)
+async def list_custom_evaluations(
+    app_name: str,
+    stoken_session: SessionContainer = Depends(verify_session()),
+):
+    """List the custom code evaluations for a given app.
+
+    Args:
+        app_name (str): the name of the app
+
+    Returns:
+        List[CustomEvaluationOutput]: a list of custom evaluation
+    """
+
+    # Get user and organization id
+    kwargs: dict = await get_user_and_org_id(stoken_session)
+
+    # Fetch custom evaluations from database
+    evaluations = await fetch_custom_evaluations(app_name, **kwargs)
+    return evaluations
+
+
+@router.get(
+    "/custom_evaluation/{id}",
+    response_model=CustomEvaluationDetail,
+)
+async def get_custom_evaluation(
+    id: str,
+    stoken_session: SessionContainer = Depends(verify_session()),
+):
+    """Get the custom code evaluation detail.
+
+    Args:
+        id (str): the id of the custom evaluation
+
+    Returns:
+        CustomEvaluationDetail: Detail of the custom evaluation
+    """
+
+    # Get user and organization id
+    kwargs: dict = await get_user_and_org_id(stoken_session)
+
+    # Fetch custom evaluations from database
+    evaluation = await fetch_custom_evaluation_detail(id, **kwargs)
+    return evaluation
+
+
+@router.get(
+    "/custom_evaluation/{app_name}/names/",
+    response_model=List[CustomEvaluationNames],
+)
+async def get_custom_evaluation_names(
+    app_name: str, stoken_session: SessionContainer = Depends(verify_session())
+):
+    """Get the names of custom evaluation for a given app.
+
+    Args:
+        app_name (str): the name of the app the evaluation belongs to
+
+    Returns:
+        List[CustomEvaluationNames]: the list of name of custom evaluations
+    """
+    # Get user and organization id
+    kwargs: dict = await get_user_and_org_id(stoken_session)
+
+    custom_eval_names = await fetch_custom_evaluation_names(app_name, **kwargs)
+    return custom_eval_names
+
+
+@router.post(
+    "/custom_evaluation/execute/{evaluation_id}/",
+)
+async def execute_custom_evaluation(
+    evaluation_id: str,
+    payload: ExecuteCustomEvaluationCode,
+    stoken_session: SessionContainer = Depends(verify_session()),
+):
+    """Execute a custom evaluation code.
+
+    Args:
+        evaluation_id (str): the custom evaluation id
+        payload (ExecuteCustomEvaluationCode): the required payload
+
+    Returns:
+        float: the result of the evaluation custom code
+    """
+
+    # Get user and organization id
+    kwargs: dict = await get_user_and_org_id(stoken_session)
+
+    # Execute custom code evaluation
+    formatted_inputs = format_inputs(payload.inputs)
+    formatted_outputs = format_outputs(payload.outputs)
+    result = await execute_custom_code_evaluation(
+        evaluation_id,
+        payload.app_name,
+        formatted_outputs[payload.variant_name],  # gets the output of the app variant
+        payload.correct_answer,
+        payload.variant_name,
+        formatted_inputs,
+        **kwargs,
+    )
+    return result
+
 
 @router.post("/webhook_example_fake", response_model=EvaluationWebhook)
 async def webhook_example_fake():