-
Notifications
You must be signed in to change notification settings - Fork 312
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
91f7f2d
commit 878dd0d
Showing
4 changed files
with
143 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
--- | ||
id: metrics-json-correctness | ||
title: Json Correctness | ||
sidebar_label: Json Correctness | ||
--- | ||
|
||
import Equation from "@site/src/components/equation"; | ||
|
||
The json correctness metric measures whether your LLM application is able to generate `actual_output`s with the correct **json schema**. | ||
|
||
:::note | ||
|
||
The `JsonCorrectnessMetric` like the `ToolCorrectnessMetric` is not an LLM-eval, and you'll have to supply your expected Json schema when creating a `JsonCorrectnessMetric`. | ||
|
||
::: | ||
|
||
## Required Arguments | ||
|
||
To use the `JsonCorrectnessMetric`, you'll have to provide the following arguments when creating an `LLMTestCase`: | ||
|
||
- `input` | ||
- `actual_output` | ||
|
||
## Example | ||
|
||
```python | ||
from pydantic import BaseModel | ||
|
||
from deepeval import evaluate | ||
from deepeval.metrics import JsonCorrectnessMetric | ||
from deepeval.test_case import LLMTestCase | ||
|
||
class ExampleSchema(BaseModel): | ||
name: str | ||
|
||
metric = JsonCorrectnessMetric( | ||
expected_schema=ExampleSchema, | ||
model="gpt-4", | ||
include_reason=True | ||
) | ||
test_case = LLMTestCase( | ||
input="Output me a random Json with the 'name' key", | ||
# Replace this with the actual output from your LLM application | ||
actual_output="{'name': null}" | ||
) | ||
|
||
metric.measure(test_case) | ||
print(metric.score) | ||
print(metric.reason) | ||
``` | ||
|
||
There are one mandatory and six optional parameters when creating an `PromptAlignmentMetric`: | ||
|
||
- `expected_schema`: a `pydantic` `BaseModel` specifying the schema of the Json that is expected from your LLM. | ||
- [Optional] `threshold`: a float representing the minimum passing threshold, defaulted to 0.5. | ||
- [Optional] `model`: a string specifying which of OpenAI's GPT models to use to generate reasons, **OR** [any custom LLM model](metrics-introduction#using-a-custom-llm) of type `DeepEvalBaseLLM`. Defaulted to 'gpt-4o'. | ||
- [Optional] `include_reason`: a boolean which when set to `True`, will include a reason for its evaluation score. Defaulted to `True`. | ||
- [Optional] `strict_mode`: a boolean which when set to `True`, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to `False`. | ||
- [Optional] `async_mode`: a boolean which when set to `True`, enables [concurrent execution within the `measure()` method.](metrics-introduction#measuring-a-metric-in-async) Defaulted to `True`. | ||
- [Optional] `verbose_mode`: a boolean which when set to `True`, prints the intermediate steps used to calculate said metric to the console, as outlined in the [How Is It Calculated](#how-is-it-calculated) section. Defaulted to `False`. | ||
|
||
:::info | ||
Unlike other metrics, the `model` is used for generating reason instead of evaluation. It will only be used if the `actual_output` has the wrong schema, **AND** if `include_reason` is set to `True`. | ||
::: | ||
|
||
## How Is It Calculated? | ||
|
||
The `PromptAlignmentMetric` score is calculated according to the following equation: | ||
|
||
<Equation | ||
formula="\text{Json Corectness} = \begin{cases} | ||
1 & \text{If the actual output fits the expected schema}, \\ | ||
0 & \text{Otherwise} | ||
\end{cases}" | ||
/> | ||
|
||
The `JsonCorrectnessMetric` does not use an LLM for evaluation and instead uses the provided `expected_schema` to determine whether the `actual_output` can be loaded into the schema. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
--- | ||
id: metrics-prompt-alignment | ||
title: Prompt Alignment | ||
sidebar_label: Prompt Alignment | ||
--- | ||
|
||
import Equation from "@site/src/components/equation"; | ||
|
||
The prompt alignment metric measures whether your LLM application is able to generate `actual_output`s that aligns with any **instructions** specified in your prompt template. `deepeval`'s prompt alignment metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score. | ||
|
||
## Required Arguments | ||
|
||
To use the `PromptAlignmentMetric`, you'll have to provide the following arguments when creating an `LLMTestCase`: | ||
|
||
- `input` | ||
- `actual_output` | ||
|
||
## Example | ||
|
||
```python | ||
from deepeval import evaluate | ||
from deepeval.metrics import PromptAlignmentMetric | ||
from deepeval.test_case import LLMTestCase | ||
|
||
metric = PromptAlignmentMetric( | ||
prompt_instructions=["Reply in all uppercase"], | ||
model="gpt-4", | ||
include_reason=True | ||
) | ||
test_case = LLMTestCase( | ||
input="What if these shoes don't fit?", | ||
# Replace this with the actual output from your LLM application | ||
actual_output="We offer a 30-day full refund at no extra cost." | ||
) | ||
|
||
metric.measure(test_case) | ||
print(metric.score) | ||
print(metric.reason) | ||
``` | ||
|
||
There are one mandatory and six optional parameters when creating an `PromptAlignmentMetric`: | ||
|
||
- `prompt_instructions`: a list of strings specifying the instructions you want followed in your prompt template. | ||
- [Optional] `threshold`: a float representing the minimum passing threshold, defaulted to 0.5. | ||
- [Optional] `model`: a string specifying which of OpenAI's GPT models to use, **OR** [any custom LLM model](metrics-introduction#using-a-custom-llm) of type `DeepEvalBaseLLM`. Defaulted to 'gpt-4o'. | ||
- [Optional] `include_reason`: a boolean which when set to `True`, will include a reason for its evaluation score. Defaulted to `True`. | ||
- [Optional] `strict_mode`: a boolean which when set to `True`, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to `False`. | ||
- [Optional] `async_mode`: a boolean which when set to `True`, enables [concurrent execution within the `measure()` method.](metrics-introduction#measuring-a-metric-in-async) Defaulted to `True`. | ||
- [Optional] `verbose_mode`: a boolean which when set to `True`, prints the intermediate steps used to calculate said metric to the console, as outlined in the [How Is It Calculated](#how-is-it-calculated) section. Defaulted to `False`. | ||
|
||
## How Is It Calculated? | ||
|
||
The `PromptAlignmentMetric` score is calculated according to the following equation: | ||
|
||
<Equation formula="\text{Prompt Alignment} = \frac{\text{Number of Instructions Followed}}{\text{Total Number of Instructions}}" /> | ||
|
||
The `PromptAlignmentMetric` uses an LLM to classify whether each prompt instruction is followed in the `actual_output` using additional context from the `input`. | ||
|
||
:::tip | ||
|
||
By providing an initial list of `prompt_instructions` instead of the entire prompt template, the `PromptAlignmentMetric` is able to more accurately determine whether the core instructions laid out in your prompt template is followed. | ||
|
||
::: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters