Add configurable evaluators and custom evaluator documentation

Agenta-AI · Feb 15, 2024 · 5282e30 · 5282e30
1 parent f8ce686
commit 5282e30
Show file tree

Hide file tree

Showing 5 changed files with 85 additions and 28 deletions.
diff --git a/docs/basic_guides/automatic_evaluation.mdx b/docs/basic_guides/automatic_evaluation.mdx
@@ -7,11 +7,22 @@ The key to building production-ready LLM applications is to have a tight feedbac
 
 ## Configuring Evaluators
 
-Agenta comes with a set of built-in evaluators that can be configured. We are continuously adding more evaluators over time. By default, each project includes the following evaluators:
+Agenta comes with a set of built-in evaluators that can be configured. 
+
+By default, each project includes the following evaluators (which do not require configuration):
 - Exact match: This evaluator checks if the generated answer is an exact match to the expected answer. The aggregated result is the percentage of correct answers.
 
-The following configurable evaluators are available and need to be added and configured before use. To add an evaluator, go to the Evaluators tab and click on the "Add Evaluator" button. A modal will appear where you can select the evaluator you want to add and configure it.
+Additionally, the following configurable evaluators are available but need to be explicitly configured  and added before use.
+
+To add an evaluator, go to the Evaluators tab and click on the "Add Evaluator" button. A modal will appear where you can select the evaluator you want to add and configure it.
+
+<img height="600" className="dark:hidden" src="/images/basic_guides/15_accessing_evaluator_page_light.png" />
+<img height="600" className="hidden dark:block" src="/images/basic_guides/15_accessing_evaluator_page_dark.png" />
+
+<img height="600" className="dark:hidden" src="/images/basic_guides/16_new_evaluator_modal_light.png" />
+<img height="600" className="hidden dark:block" src="/images/basic_guides/16_new_evaluator_modal_dark.png" />
 
+**Configurable evaluators**
 - Regex match: This evaluator checks if the generated answer matches a regular expression pattern. You need to provide the regex expression and specify whether an answer is correct if it matches or does not match the regex.
 - Webhook evaluator: This evaluator sends the generated answer and the correct_answer to a webhook and expects a response indicating the correctness of the answer. You need to provide the URL of the webhook.
 - Similarity Match evaluator: This evaluator checks if the generated answer is similar to the expected answer. You need to provide the similarity threshold. It uses the Jaccard similarity to compare the answers.

diff --git a/docs/basic_guides/custom_evaluator.mdx b/docs/basic_guides/custom_evaluator.mdx
@@ -1,8 +1,13 @@
 ---
-title: 'Custom Evaluator'
-description: 'Learn how to create your custom evaluator with Agenta'
+title: 'Writing Custom Evaluators'
+description: 'Write the code for a custom evaluator on Agenta'
 ---
 
+Sometimes, the default evaluators on Agenta may not be sufficient for your specific use case. In such cases, you can create a custom evaluator to suit your specific needs. Custom evaluators are written in Python.
+
+For the moment, there are limitation on the code that can be written in the custom evaluator. Our backend uses RestrictedPython to execute the code which limits the libraries that can be used. 
+
+
 ## Accessing the Evaluator Page
 To create a custom evaluator on Agenta, simply click on the Evaluations button in the sidebar menu, and then select the "Evaluators" tab within the Evaluations page.
 <img height="600" className="dark:hidden" src="/images/basic_guides/15_accessing_evaluator_page_light.png" />
@@ -16,3 +21,37 @@ On the Evaluators tab, click on the "New Evaluator" button at the top right corn
 <img height="600" className="hidden dark:block" src="/images/basic_guides/16_new_evaluator_modal_dark.png" />
 Click on the "Create" button within the modal to confirm and complete the creation of your custom evaluator.
 
+## Evaluation code
+
+Your code should include on function called evaluate with the following signature:
+```python
+from typing import Dict
+
+def evaluate(
+    app_params: Dict[str, str],
+    inputs: Dict[str, str],
+    output: str,
+    correct_answer: str
+) -> float:
+```
+
+The function should return a float value which is the score of the evaluation. The score should be between 0 and 1. 0 means the evaluation failed and 1 means the evaluation passed.
+
+The parameters are as follows:
+1. <b>app_params: </b> A dictionary containing the configuration of the app. This would include the prompt, model and all the other parameters specified in the playground with the same naming.
+2. <b>inputs: </b> A dictionary containing the inputs of the app. 
+3. <b>output: </b> The generated output of the app. 
+4. <b>correct_answer: </b> The correct answer of the app. 
+
+For instance, exact match would be implemented as follows:
+```python
+from typing import Dict
+
+def evaluate(
+    app_params: Dict[str, str],
+    inputs: Dict[str, str],
+    output: str,
+    correct_answer: str
+) -> float:
+    return 1 if output == correct_answer else 0
+```
diff --git a/docs/basic_guides/human_evaluation.mdx b/docs/basic_guides/human_evaluation.mdx
@@ -2,40 +2,47 @@
 title: 'Human Evaluation'
 ---
 
-- <a href="/basic_guides/human_evaluation#a-b-test">A/B Test</a>
-- <a href="/basic_guides/human_evaluation#single-model-test">Single Model Test</a>
+Sometimes, you may need to evaluate the performance of your models using human judgment. This is where the Human Evaluation feature comes in. It allows you to conduct A/B tests and single model tests to evaluate the performance of your models using human judgment.
 
-## A/B Test
-To start a new evaluation with A/B Test, you need to:
-- Select two variants would you like to evaluate
+## Single Model Evaluation
+Single model test allows you to score the performance of a single LLM app manually. 
+
+To start a new evaluation with the single model test, you need to:
+- Select a variant would you like to evaluate
 - Select a testset you want to use for the evaluation
 
-### Invite Collaborators
-In A/B Test, you can invite members of your workspace to collaborate on the evaluation by sharing a link to the evaluation
+To start a new evaluation with the single model test, follow these steps:
 
-<Note>Please refer [here](/basic_guides/team_management) on how to add members to your workspace. </Note>
+1. Select the variant you would like to evaluate.
+2. Choose the testset you want to use for the evaluation.
 
-Click on the "Start a new evaluation" button to commence the evaluation process.
-Once the evaluation is initiated, you will be directed to the A/B Test Evaluation view. Here, you can perform the following actions:
+Click on the "Start a new evaluation" button to begin the evaluation process. Once the evaluation is initiated, you will be directed to the Single Model Test Evaluation view. Here, you can perform the following actions:
 
-- <b>Scoring between variants: </b> Evaluate and score the performance of each variant for the expected output.
+- <b>Score: </b> Enter a numerical score to evaluate the performance of the chosen variant.
 - <b>Additional Notes: </b> Add any relevant notes or comments to provide context or details about the evaluation.
-- <b>Export Results: </b> Utilize the "Export Results" functionality to save and export the results of the evaluation.
+- <b>Export Results: </b> Use the "Export Results" functionality to save and export the evaluation results.
 
-<img height="600" className="dark:hidden" src="/images/basic_guides/21_ab_test_view_light.png" />
-<img height="600" className="hidden dark:block" src="/images/basic_guides/21_ab_test_view_dark.png" />
+<img height="600" className="dark:hidden" src="/images/basic_guides/22_single_model_test_view_light.png" />
+<img height="600" className="hidden dark:block" src="/images/basic_guides/22_single_model_test_view_dark.png" />
 
-## Single Model Test
-To start a new evaluation with the single model test, you need to:
-- Select a variant would you like to evaluate
-- Select a testset you want to use for the evaluation
+## A/B Test
+A/B tests allow you to compare the performance of two different variants manually. For each data point you can select which variant is better, or if they are equally good or bad.
 
-Click on the "Start a new evaluation" button to commence the evaluation process.
-Once the evaluation is initiated, you will be directed to the Single Model Test Evaluation view. Here, you can perform the following actions:
+To start a new evaluation with an A/B Test, follow these steps:
 
-- <b>Score: </b> Enter a numerical score to evaluate the performance of the chosen variant.
+1. Select two variants that you would like to evaluate.
+2. Choose the testset you want to use for the evaluation.
+
+### Invite Collaborators
+
+In an A/B Test, you can invite members of your workspace to collaborate on the evaluation by sharing a link to the evaluation. For information on how to add members to your workspace, please refer to this guide.
+
+Click on the "Start a new evaluation" button to begin the evaluation process. Once the evaluation is initiated, you will be directed to the A/B Test Evaluation view. Here, you can perform the following actions:
+
+- <b>Scoring between variants: </b> Evaluate and score the performance of each variant for the expected output.
 - <b>Additional Notes: </b> Add any relevant notes or comments to provide context or details about the evaluation.
-- <b>Export Results: </b> Utilize the "Export Results" functionality to save and export the results of the evaluation.
+- <b>Export Results: </b> Use the "Export Results" functionality to save and export the evaluation results.
+
+<img height="600" className="dark:hidden" src="/images/basic_guides/21_ab_test_view_light.png" />
+<img height="600" className="hidden dark:block" src="/images/basic_guides/21_ab_test_view_dark.png" />
 
-<img height="600" className="dark:hidden" src="/images/basic_guides/22_single_model_test_view_light.png" />
-<img height="600" className="hidden dark:block" src="/images/basic_guides/22_single_model_test_view_dark.png" />
diff --git a/docs/images/basic_guides/15_accessing_evaluator_page_dark.png b/docs/images/basic_guides/15_accessing_evaluator_page_dark.png
diff --git a/docs/images/basic_guides/15_accessing_evaluator_page_light.png b/docs/images/basic_guides/15_accessing_evaluator_page_light.png