Skip to content

Commit

Permalink
Merge pull request #1192 from kritinv/tutorial-dataset
Browse files Browse the repository at this point in the history
tutorial datasets + other edits
  • Loading branch information
penguine-ip authored Nov 27, 2024
2 parents a57bae9 + a66515d commit 80728d8
Show file tree
Hide file tree
Showing 7 changed files with 312 additions and 400 deletions.
10 changes: 5 additions & 5 deletions deepeval/synthesizer/synthesizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -471,11 +471,11 @@ async def _a_generate_from_context(
additional_metadata={
"evolutions": evolutions_used,
"synthetic_input_quality": scores[i],
"context_quality": (
context_scores[i]
if context_scores is not None
else None
),
# "context_quality": (
# context_scores[i]
# if context_scores is not None
# else None
# ),
},
)
goldens.append(golden)
Expand Down
119 changes: 65 additions & 54 deletions docs/docs/tutorial-dataset-confident.mdx
Original file line number Diff line number Diff line change
@@ -1,62 +1,78 @@
---
id: tutorial-dataset-confident
title: Pushing your Dataset
sidebar_label: Pushing your Dataset
title: Using Datasets for Evaluation
sidebar_label: Using Datasets for Evaluation
---

### 4. Pushing Dataset
To **start using your datasets for evaluation**, you’ll need to:

Next, we’ll be pushing the dataset to Confident AI so you can review your dataset.
1. **Pull your dataset** from Confident AI.
2. Compute the actual outputs and retrieval contexts, and **convert your goldens into test cases**.
3. Begin running **evaluations**.

In this tutorial, we’ll pull the synthetic dataset generated in the previous section and run evaluations on the dataset against the three metrics we’ve defined: Answer Relevancy, Faithfulness, and Professionalism.

## Pulling Your Dataset

To pull a dataset from Confident AI, simply call the `pull` method from an `EvaluationDataset` and specify the dataset alias you wish to retrieve. By default, `auto_convert_goldens_to_test_cases` is set to `True`, but we'll set it to `False` for this tutorial since the actual output is a required parameter in an `LLMTestCase`, and we haven't generated them yet.

```python
dataset.pull(alias="Synthetic Test")
from deepeval import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="Patients Seeking Diagnosis", auto_convert_goldens_to_test_cases=False)
```

### 5. Reviewing Dataset
## Converting Goldens to Test Cases

You can easily review synthetically generated datasets on Confident AI. This is especially important for teams, particularly when non-technical team members—such as domain experts or human reviewers—are involved. To get started, simply navigate to the datasets page on the platform and select the dataset you uploaded.
Next, we'll convert the goldens in the dataset we pulled into `LLMTestCase`s and add them to our evaluation dataset. Although our goldens have contexts and expected outputs, we won’t need them for our current set of metrics.

<div
style={{
display: "flex",
alignItems: "center",
justifyContent: "center",
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_01.png"
alt="Datasets 1"
style={{
marginTop: "20px",
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
}}
/>
</div>
```python
from deepeval.test_case import LLMTestCase

Confident AI enables project collaborators to edit each golden directly on the platform, including inputs, actual outputs, retrieval context, and more.
for golden in dataset.goldens:
# Compute actual output and retrieval context
actual_output = "..." # Replace with logic to compute actual output
retrieval_context = "..." # Replace with logic to compute retrieval context

dataset.add_test_case(
LLMTestCase(
input=golden.input,
actual_output=actual_output,
retrieval_context=retrieval_context
)
)
```

<div
style={{
display: "flex",
alignItems: "center",
justifyContent: "center",
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_02.png"
alt="Datasets 2"
style={{
marginTop: "20px",
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
}}
/>
</div>
## Run Evaluations on your Dataset

Finally, we'll redefine our three metrics and use the `evaluate` function to run evaluations on our synthetic dataset.

```python
from deepeval.metrics import AnswerRelevancy, Faithfulness, GEval
from deepeval import evaluate

# Metrics Definition
answer_relevancy_metric = AnswerRelevancy()
faithfulness_metric = Faithfulness()
professionalism_metric = GEval(
name="Professionalism",
criteria=criteria,
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)

You can also leave comments for other team members or push comments directly from your code. Lastly, you have the option to toggle finalization for each golden, streamlining the review process.
evaluate(
dataset,
metrics = [answer_relevancy_metric, faithfulness_metric, professionalism_metric],
hyperparameters={
"model": "gpt-4o"
"prompt template": "You are a..."
"temperature": 0.8
}
)
```

Here are the final evaluation results:

<div
style={{
Expand All @@ -66,19 +82,14 @@ You can also leave comments for other team members or push comments directly fro
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_03.png"
alt="Datasets 3"
src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_07.png"
alt="Datasets 1"
style={{
marginTop: "20px",
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
marginBottom: "20px"
}}
/>
</div>

Once dataset review is complete, engineers can easily pull the entire dataset within a single line of code and begin the evaluation process.

```python
dataset.pull(alias="Synthetic Test")
```
You can see that although we passed all 5 test cases previously, it's important to test on a larger dataset, as 4 out of the 15 test cases we generated are still failing. To learn more about iterating on your hyperparameters, you can [revisit this section](tutorial-evaluations-hyperparameters).
Loading

0 comments on commit 80728d8

Please sign in to comment.