Skip to content

Commit

Permalink
Merge pull request #1187 from kritinv/tutorial-evaluations
Browse files Browse the repository at this point in the history
Tutorial evaluations
  • Loading branch information
penguine-ip authored Nov 26, 2024
2 parents 40dc200 + ecc9575 commit 789bc1a
Show file tree
Hide file tree
Showing 8 changed files with 984 additions and 112 deletions.
84 changes: 84 additions & 0 deletions docs/docs/tutorial-dataset-confident.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
id: tutorial-dataset-confident
title: Pushing your Dataset
sidebar_label: Pushing your Dataset
---

### 4. Pushing Dataset

Next, we’ll be pushing the dataset to Confident AI so you can review your dataset.

```python
dataset.pull(alias="Synthetic Test")
```

### 5. Reviewing Dataset

You can easily review synthetically generated datasets on Confident AI. This is especially important for teams, particularly when non-technical team members—such as domain experts or human reviewers—are involved. To get started, simply navigate to the datasets page on the platform and select the dataset you uploaded.

<div
style={{
display: "flex",
alignItems: "center",
justifyContent: "center",
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_01.png"
alt="Datasets 1"
style={{
marginTop: "20px",
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
}}
/>
</div>

Confident AI enables project collaborators to edit each golden directly on the platform, including inputs, actual outputs, retrieval context, and more.

<div
style={{
display: "flex",
alignItems: "center",
justifyContent: "center",
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_02.png"
alt="Datasets 2"
style={{
marginTop: "20px",
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
}}
/>
</div>

You can also leave comments for other team members or push comments directly from your code. Lastly, you have the option to toggle finalization for each golden, streamlining the review process.

<div
style={{
display: "flex",
alignItems: "center",
justifyContent: "center",
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_03.png"
alt="Datasets 3"
style={{
marginTop: "20px",
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
}}
/>
</div>

Once dataset review is complete, engineers can easily pull the entire dataset within a single line of code and begin the evaluation process.

```python
dataset.pull(alias="Synthetic Test")
```
116 changes: 12 additions & 104 deletions docs/docs/tutorial-dataset-synthesis.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,42 +7,30 @@ sidebar_label: Generating Synthetic Data

## Quick Summary

[Manually generating test data is time-consuming and can be non-comprehensive](https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms)), so we’ll be generating **synthetic data** to test our medical chatbot. This data should mimic your user's interaction with your LLM application, from typical usage behavior to unique test cases.
If you wish to evaluate your LLM application at scale, you can choose to curate your own dataset or synthetically generate an evaluation dataset. [Manually generating test data is time-consuming and often times not comprehensive](https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms). Therefore, we’ll be generating a **synthetic evaluation dataset** to evaluate our medical chatbot at scale.

DeepEval's synthesizer offers a fast and easy way to generate high-quality goldens (input, expected output, context) for your evaluation datasets in just a few lines of code. We'll be using 2 to generate our test data.
:::tip
Synthetic data should closely **mimic your users' interactions** with your LLM application, from typical use cases to unique edge cases.
:::

DeepEval's synthesizer offers a fast and easy way to generate high-quality goldens for your evaluation datasets in just a few lines of code. We'll be using the following 2 methods to generate our test data.

- **Generating from Documents**
- **Generating from Scratch**


:::info
DeepEval also allows you to [generate directly from contexts](https://docs.confident-ai.com/docs/synthesizer-generate-from-contexts), but since we don't have access to **pre-prepared contexts** and only our knowledge base, we'll be generating from documents intead.
:::

### Synthetic Data Generation Methods

#### From Documents

Generating synthetic data from documents invovles **extracting contexts** from your knowledge base and generating synthetic data based on these contexts. This will be especially helpful in testing our RAG engine when medical diagnoses are being performed.

#### From Scratch

We'll generate synthetic data from scratch to test non-RAG parts of our LLM application, like filling in user emails and information, and handling **non-diagnosis-related queries**.

:::caution Reminder
If you haven't already, log in to Confident AI with your Confident AI key by running the following command in your CLI. We'll be reviewing the dataset on the platform in this tutorial.

```bash
deepeval login
```

**Goldens** in DeepEval are similar to `LLMTestCases` but do not require an actual output and retrieval context, which are computed at evaluation time.
:::

## Generating Synthetic Data from Documents

### 1. Defining Style Configuration

DeepEval allows you to **customize the output style and format** of any `input` and/or `expected_output`. You can achieve this by creating an instance of the `Synthesizer` with specific `StylingConfig ` settings.
Let's begin by generating synthetic data for a typical use case for our medical chatbot: **patients seeking diagnosis**. We'll first need to define the styling configurations that will allow us to mimic this user behaviour.

:::tip
You can optionally **customize the output style and format** of any `input` and/or `expected_output` in your synthetic goldens, by configuring a `StylingConfig` object, which will be passed into your `Synthesizer`.
:::

```python
from deepeval.synthesizer.config import StylingConfig
Expand Down Expand Up @@ -92,86 +80,6 @@ dataset.generate_goldens_from_docs(
)
```

### 4. Pushing Dataset

Next, we’ll be pushing the dataset to Confident AI so you can review your dataset.

```python
dataset.pull(alias="Synthetic Test")
```

### 5. Reviewing Dataset

You can easily review synthetically generated datasets on Confident AI. This is especially important for teams, particularly when non-technical team members—such as domain experts or human reviewers—are involved. To get started, simply navigate to the datasets page on the platform and select the dataset you uploaded.

<div
style={{
display: "flex",
alignItems: "center",
justifyContent: "center",
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_01.png"
alt="Datasets 1"
style={{
marginTop: "20px",
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
}}
/>
</div>

Confident AI enables project collaborators to edit each golden directly on the platform, including inputs, actual outputs, retrieval context, and more.

<div
style={{
display: "flex",
alignItems: "center",
justifyContent: "center",
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_02.png"
alt="Datasets 2"
style={{
marginTop: "20px",
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
}}
/>
</div>

You can also leave comments for other team members or push comments directly from your code. Lastly, you have the option to toggle finalization for each golden, streamlining the review process.


<div
style={{
display: "flex",
alignItems: "center",
justifyContent: "center",
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_03.png"
alt="Datasets 3"
style={{
marginTop: "20px",
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
}}
/>
</div>

Once dataset review is complete, engineers can easily pull the entire dataset within a single line of code and begin the evaluation process.

```python
dataset.pull(alias="Synthetic Test")
```

### 6. Exploring Additional Generation Configurations

You may want to explore additional styling or quality configurations for your dataset generation. This allows you to create a dataset that is **diverse and include edge cases**.
Expand Down
161 changes: 161 additions & 0 deletions docs/docs/tutorial-evaluations-compare-test-runs.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
---
id: tutorial-evaluations-compare-test-runs
title: Comparing Evaluation Results
sidebar_label: Comparing Evaluation Results
---

In this section, you'll learn how to **compare evaluation results** for the same set of test cases, enabling you to identify improvements and regressions in your LLM performance.

:::tip
Detecting regressions is crucial as they reveal areas where your LLM's **performance has unexpectedly declined**.
:::

## Hyperparameters Iteration Recap

In the previous section, we updated our medical chatbot's model, temperature, and prompt template settings, and re-evaluated them on the same test cases and metrics, which resulted in the following report:

<div
style={{
display: "flex",
alignItems: "center",
justifyContent: "center",
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_evaluation_11.png"
style={{
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
}}
/>
</div>

We found that while all previously failing test cases now pass, one test case has regressed. The first test case, which previously achieved near-perfect scores, is now **failing Faithfulness and Professionalism**.

While addressing failing test cases is critical, it’s equally important to be evaluating improvements. Examining specific test cases where scores have increased and understanding the reasons behind these changes ensures that these **improvements align with our desired outcomes**.

:::note
Confident AI provides an easy and simple way to **compare evaluation results** for the same test cases. In the next step, we’ll explore how to use this feature.
:::

## Comparing Evaluations

To compare two evaluations, navigate to the **Comparing Test Runs** page, located as the third tab on the left navigation bar. Then, select the test run ID of the evaluation results you want to compare with your new results.

:::info
A **test run on Confident AI** represents a single evaluation of a collection of test cases using a defined set of metrics.
:::

<div
style={{
display: "flex",
alignItems: "center",
justifyContent: "center",
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_evaluation_15.png"
style={{
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
}}
/>
</div>

Once you select the test run to compare with, Confident AI will automatically align the test cases and visually highlight the differences—improvements are marked with green rows, while regressions are shown in red.

<div
style={{
display: "flex",
alignItems: "center",
justifyContent: "center",
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_evaluation_16.png"
style={{
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
}}
/>
</div>

:::info
Confident AI matches test cases based on the `input` of each `LLMTestCase`. If no matching test cases are found, no comparisons will be displayed.
:::

You can analyze each test case further by clicking on it to inspect individual regressing and improving metric scores. For instance, test cases 2, 4, and 5 show significant improvements in previously failing metrics, with their updated outputs aligning with our expectations.

<div
style={{
display: "flex",
alignItems: "center",
justifyContent: "center",
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_evaluation_13.png"
style={{
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
}}
/>
</div>

Let’s take a closer look at the regressing test case:

<div
style={{
display: "flex",
alignItems: "center",
justifyContent: "center",
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_evaluation_14.png"
style={{
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
}}
/>
</div>

Here, we observe that **introducing additional flexibility** into the prompt template may have inadvertently caused some confusion during the generation proccess. As a result, the chatbot appears uncertain about whether to proceed with diagnosing the patient or to request further details, and ultimately fails to meet the standards for Professionalism and Faithfulness.

:::tip
Increasing the complexity of your prompt template can make it harder for an LLM to process queries effectively. **Upgrading the LLM model** is one way to address this challenge.
:::

## Runing one Final Evaluation

Let’s iterate on our hyperparameters one last time by upgrading the underlying LLM model to GPT-4o. We'll re-compute the outputs and re-run the evaluation. Here are the final results:

<div
style={{
display: "flex",
alignItems: "center",
justifyContent: "center",
}}
>
<img
src="https://confident-bucket.s3.amazonaws.com/tutorial_evaluation_17.png"
style={{
marginBottom: "20px",
height: "auto",
maxHeight: "800px",
}}
/>
</div>

We've finally managed to pass all test cases! After multiple iterations, our medical chatbot has successfully passed all the test cases, despite initially failing the majority of them.

However, we've only evaluated 5 test cases so far. To truly evaluate your LLM application at scale, you'll need a larger and more diverse evaluation dataset. Such a dataset should include challenging scenarios and edge cases to rigorously test your model's capabilities. While you could manually curate this dataset, doing so can be both time-intensive and expensive.

:::note
In the next section, we'll dive into how you can **generate synthetic data** using DeepEval to efficiently scale the evaluation of your LLM application.
:::
Loading

0 comments on commit 789bc1a

Please sign in to comment.