Merge pull request #1192 from kritinv/tutorial-dataset

tutorial datasets + other edits
confident-ai · Nov 27, 2024 · 80728d8 · 80728d8
2 parents a57bae9 + a66515d
commit 80728d8
Show file tree

Hide file tree

Showing 7 changed files with 312 additions and 400 deletions.
diff --git a/deepeval/synthesizer/synthesizer.py b/deepeval/synthesizer/synthesizer.py
@@ -471,11 +471,11 @@ async def _a_generate_from_context(
                 additional_metadata={
                     "evolutions": evolutions_used,
                     "synthetic_input_quality": scores[i],
-                    "context_quality": (
-                        context_scores[i]
-                        if context_scores is not None
-                        else None
-                    ),
+                    # "context_quality": (
+                    #     context_scores[i]
+                    #     if context_scores is not None
+                    #     else None
+                    # ),
                 },
             )
             goldens.append(golden)

diff --git a/docs/docs/tutorial-dataset-confident.mdx b/docs/docs/tutorial-dataset-confident.mdx
@@ -1,62 +1,78 @@
 ---
 id: tutorial-dataset-confident
-title: Pushing your Dataset
-sidebar_label: Pushing your Dataset
+title: Using Datasets for Evaluation
+sidebar_label: Using Datasets for Evaluation
 ---
 
-### 4. Pushing Dataset
+To **start using your datasets for evaluation**, you’ll need to:
 
-Next, we’ll be pushing the dataset to Confident AI so you can review your dataset.
+1. **Pull your dataset** from Confident AI.
+2. Compute the actual outputs and retrieval contexts, and **convert your goldens into test cases**.
+3. Begin running **evaluations**.
+
+In this tutorial, we’ll pull the synthetic dataset generated in the previous section and run evaluations on the dataset against the three metrics we’ve defined: Answer Relevancy, Faithfulness, and Professionalism.
+
+## Pulling Your Dataset
+
+To pull a dataset from Confident AI, simply call the `pull` method from an `EvaluationDataset` and specify the dataset alias you wish to retrieve. By default, `auto_convert_goldens_to_test_cases` is set to `True`, but we'll set it to `False` for this tutorial since the actual output is a required parameter in an `LLMTestCase`, and we haven't generated them yet.
 
 ```python
-dataset.pull(alias="Synthetic Test")
+from deepeval import EvaluationDataset
+
+dataset = EvaluationDataset()
+dataset.pull(alias="Patients Seeking Diagnosis", auto_convert_goldens_to_test_cases=False)
 ```
 
-### 5. Reviewing Dataset
+## Converting Goldens to Test Cases
 
-You can easily review synthetically generated datasets on Confident AI. This is especially important for teams, particularly when non-technical team members—such as domain experts or human reviewers—are involved. To get started, simply navigate to the datasets page on the platform and select the dataset you uploaded.
+Next, we'll convert the goldens in the dataset we pulled into `LLMTestCase`s and add them to our evaluation dataset. Although our goldens have contexts and expected outputs, we won’t need them for our current set of metrics.
 
-<div
-  style={{
-    display: "flex",
-    alignItems: "center",
-    justifyContent: "center",
-  }}
->
-  <img
-    src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_01.png"
-    alt="Datasets 1"
-    style={{
-      marginTop: "20px",
-      marginBottom: "20px",
-      height: "auto",
-      maxHeight: "800px",
-    }}
-  />
-</div>
+```python
+from deepeval.test_case import LLMTestCase
 
-Confident AI enables project collaborators to edit each golden directly on the platform, including inputs, actual outputs, retrieval context, and more.
+for golden in dataset.goldens:
+    # Compute actual output and retrieval context
+    actual_output = "..."  # Replace with logic to compute actual output
+    retrieval_context = "..."  # Replace with logic to compute retrieval context
+
+    dataset.add_test_case(
+        LLMTestCase(
+            input=golden.input,
+            actual_output=actual_output,
+            retrieval_context=retrieval_context
+        )
+    )
+  ```
 
-<div
-  style={{
-    display: "flex",
-    alignItems: "center",
-    justifyContent: "center",
-  }}
->
-  <img
-    src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_02.png"
-    alt="Datasets 2"
-    style={{
-      marginTop: "20px",
-      marginBottom: "20px",
-      height: "auto",
-      maxHeight: "800px",
-    }}
-  />
-</div>
+## Run Evaluations on your Dataset
+
+Finally, we'll redefine our three metrics and use the `evaluate` function to run evaluations on our synthetic dataset.
+
+```python
+from deepeval.metrics import AnswerRelevancy, Faithfulness, GEval
+from deepeval import evaluate
+
+# Metrics Definition
+answer_relevancy_metric = AnswerRelevancy()
+faithfulness_metric = Faithfulness()
+professionalism_metric = GEval(
+    name="Professionalism",
+    criteria=criteria,
+    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
+)
 
-You can also leave comments for other team members or push comments directly from your code. Lastly, you have the option to toggle finalization for each golden, streamlining the review process.
+evaluate(
+  dataset, 
+  metrics = [answer_relevancy_metric, faithfulness_metric, professionalism_metric],
+  hyperparameters={
+    "model": "gpt-4o"
+    "prompt template": "You are a..."
+    "temperature": 0.8
+  }
+)
+```
+
+Here are the final evaluation results:
 
 <div
   style={{
@@ -66,19 +82,14 @@ You can also leave comments for other team members or push comments directly fro
   }}
 >
   <img
-    src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_03.png"
-    alt="Datasets 3"
+    src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_07.png"
+    alt="Datasets 1"
     style={{
-      marginTop: "20px",
-      marginBottom: "20px",
       height: "auto",
       maxHeight: "800px",
+      marginBottom: "20px"
     }}
   />
 </div>
 
-Once dataset review is complete, engineers can easily pull the entire dataset within a single line of code and begin the evaluation process.
-
-```python
-dataset.pull(alias="Synthetic Test")
-```
+You can see that although we passed all 5 test cases previously, it's important to test on a larger dataset, as 4 out of the 15 test cases we generated are still failing. To learn more about iterating on your hyperparameters, you can [revisit this section](tutorial-evaluations-hyperparameters).