-
Notifications
You must be signed in to change notification settings - Fork 312
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1187 from kritinv/tutorial-evaluations
Tutorial evaluations
- Loading branch information
Showing
8 changed files
with
984 additions
and
112 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
--- | ||
id: tutorial-dataset-confident | ||
title: Pushing your Dataset | ||
sidebar_label: Pushing your Dataset | ||
--- | ||
|
||
### 4. Pushing Dataset | ||
|
||
Next, we’ll be pushing the dataset to Confident AI so you can review your dataset. | ||
|
||
```python | ||
dataset.pull(alias="Synthetic Test") | ||
``` | ||
|
||
### 5. Reviewing Dataset | ||
|
||
You can easily review synthetically generated datasets on Confident AI. This is especially important for teams, particularly when non-technical team members—such as domain experts or human reviewers—are involved. To get started, simply navigate to the datasets page on the platform and select the dataset you uploaded. | ||
|
||
<div | ||
style={{ | ||
display: "flex", | ||
alignItems: "center", | ||
justifyContent: "center", | ||
}} | ||
> | ||
<img | ||
src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_01.png" | ||
alt="Datasets 1" | ||
style={{ | ||
marginTop: "20px", | ||
marginBottom: "20px", | ||
height: "auto", | ||
maxHeight: "800px", | ||
}} | ||
/> | ||
</div> | ||
|
||
Confident AI enables project collaborators to edit each golden directly on the platform, including inputs, actual outputs, retrieval context, and more. | ||
|
||
<div | ||
style={{ | ||
display: "flex", | ||
alignItems: "center", | ||
justifyContent: "center", | ||
}} | ||
> | ||
<img | ||
src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_02.png" | ||
alt="Datasets 2" | ||
style={{ | ||
marginTop: "20px", | ||
marginBottom: "20px", | ||
height: "auto", | ||
maxHeight: "800px", | ||
}} | ||
/> | ||
</div> | ||
|
||
You can also leave comments for other team members or push comments directly from your code. Lastly, you have the option to toggle finalization for each golden, streamlining the review process. | ||
|
||
<div | ||
style={{ | ||
display: "flex", | ||
alignItems: "center", | ||
justifyContent: "center", | ||
}} | ||
> | ||
<img | ||
src="https://confident-bucket.s3.amazonaws.com/tutorial_datasets_03.png" | ||
alt="Datasets 3" | ||
style={{ | ||
marginTop: "20px", | ||
marginBottom: "20px", | ||
height: "auto", | ||
maxHeight: "800px", | ||
}} | ||
/> | ||
</div> | ||
|
||
Once dataset review is complete, engineers can easily pull the entire dataset within a single line of code and begin the evaluation process. | ||
|
||
```python | ||
dataset.pull(alias="Synthetic Test") | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
--- | ||
id: tutorial-evaluations-compare-test-runs | ||
title: Comparing Evaluation Results | ||
sidebar_label: Comparing Evaluation Results | ||
--- | ||
|
||
In this section, you'll learn how to **compare evaluation results** for the same set of test cases, enabling you to identify improvements and regressions in your LLM performance. | ||
|
||
:::tip | ||
Detecting regressions is crucial as they reveal areas where your LLM's **performance has unexpectedly declined**. | ||
::: | ||
|
||
## Hyperparameters Iteration Recap | ||
|
||
In the previous section, we updated our medical chatbot's model, temperature, and prompt template settings, and re-evaluated them on the same test cases and metrics, which resulted in the following report: | ||
|
||
<div | ||
style={{ | ||
display: "flex", | ||
alignItems: "center", | ||
justifyContent: "center", | ||
}} | ||
> | ||
<img | ||
src="https://confident-bucket.s3.amazonaws.com/tutorial_evaluation_11.png" | ||
style={{ | ||
marginBottom: "20px", | ||
height: "auto", | ||
maxHeight: "800px", | ||
}} | ||
/> | ||
</div> | ||
|
||
We found that while all previously failing test cases now pass, one test case has regressed. The first test case, which previously achieved near-perfect scores, is now **failing Faithfulness and Professionalism**. | ||
|
||
While addressing failing test cases is critical, it’s equally important to be evaluating improvements. Examining specific test cases where scores have increased and understanding the reasons behind these changes ensures that these **improvements align with our desired outcomes**. | ||
|
||
:::note | ||
Confident AI provides an easy and simple way to **compare evaluation results** for the same test cases. In the next step, we’ll explore how to use this feature. | ||
::: | ||
|
||
## Comparing Evaluations | ||
|
||
To compare two evaluations, navigate to the **Comparing Test Runs** page, located as the third tab on the left navigation bar. Then, select the test run ID of the evaluation results you want to compare with your new results. | ||
|
||
:::info | ||
A **test run on Confident AI** represents a single evaluation of a collection of test cases using a defined set of metrics. | ||
::: | ||
|
||
<div | ||
style={{ | ||
display: "flex", | ||
alignItems: "center", | ||
justifyContent: "center", | ||
}} | ||
> | ||
<img | ||
src="https://confident-bucket.s3.amazonaws.com/tutorial_evaluation_15.png" | ||
style={{ | ||
marginBottom: "20px", | ||
height: "auto", | ||
maxHeight: "800px", | ||
}} | ||
/> | ||
</div> | ||
|
||
Once you select the test run to compare with, Confident AI will automatically align the test cases and visually highlight the differences—improvements are marked with green rows, while regressions are shown in red. | ||
|
||
<div | ||
style={{ | ||
display: "flex", | ||
alignItems: "center", | ||
justifyContent: "center", | ||
}} | ||
> | ||
<img | ||
src="https://confident-bucket.s3.amazonaws.com/tutorial_evaluation_16.png" | ||
style={{ | ||
marginBottom: "20px", | ||
height: "auto", | ||
maxHeight: "800px", | ||
}} | ||
/> | ||
</div> | ||
|
||
:::info | ||
Confident AI matches test cases based on the `input` of each `LLMTestCase`. If no matching test cases are found, no comparisons will be displayed. | ||
::: | ||
|
||
You can analyze each test case further by clicking on it to inspect individual regressing and improving metric scores. For instance, test cases 2, 4, and 5 show significant improvements in previously failing metrics, with their updated outputs aligning with our expectations. | ||
|
||
<div | ||
style={{ | ||
display: "flex", | ||
alignItems: "center", | ||
justifyContent: "center", | ||
}} | ||
> | ||
<img | ||
src="https://confident-bucket.s3.amazonaws.com/tutorial_evaluation_13.png" | ||
style={{ | ||
marginBottom: "20px", | ||
height: "auto", | ||
maxHeight: "800px", | ||
}} | ||
/> | ||
</div> | ||
|
||
Let’s take a closer look at the regressing test case: | ||
|
||
<div | ||
style={{ | ||
display: "flex", | ||
alignItems: "center", | ||
justifyContent: "center", | ||
}} | ||
> | ||
<img | ||
src="https://confident-bucket.s3.amazonaws.com/tutorial_evaluation_14.png" | ||
style={{ | ||
marginBottom: "20px", | ||
height: "auto", | ||
maxHeight: "800px", | ||
}} | ||
/> | ||
</div> | ||
|
||
Here, we observe that **introducing additional flexibility** into the prompt template may have inadvertently caused some confusion during the generation proccess. As a result, the chatbot appears uncertain about whether to proceed with diagnosing the patient or to request further details, and ultimately fails to meet the standards for Professionalism and Faithfulness. | ||
|
||
:::tip | ||
Increasing the complexity of your prompt template can make it harder for an LLM to process queries effectively. **Upgrading the LLM model** is one way to address this challenge. | ||
::: | ||
|
||
## Runing one Final Evaluation | ||
|
||
Let’s iterate on our hyperparameters one last time by upgrading the underlying LLM model to GPT-4o. We'll re-compute the outputs and re-run the evaluation. Here are the final results: | ||
|
||
<div | ||
style={{ | ||
display: "flex", | ||
alignItems: "center", | ||
justifyContent: "center", | ||
}} | ||
> | ||
<img | ||
src="https://confident-bucket.s3.amazonaws.com/tutorial_evaluation_17.png" | ||
style={{ | ||
marginBottom: "20px", | ||
height: "auto", | ||
maxHeight: "800px", | ||
}} | ||
/> | ||
</div> | ||
|
||
We've finally managed to pass all test cases! After multiple iterations, our medical chatbot has successfully passed all the test cases, despite initially failing the majority of them. | ||
|
||
However, we've only evaluated 5 test cases so far. To truly evaluate your LLM application at scale, you'll need a larger and more diverse evaluation dataset. Such a dataset should include challenging scenarios and edge cases to rigorously test your model's capabilities. While you could manually curate this dataset, doing so can be both time-intensive and expensive. | ||
|
||
:::note | ||
In the next section, we'll dive into how you can **generate synthetic data** using DeepEval to efficiently scale the evaluation of your LLM application. | ||
::: |
Oops, something went wrong.