Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: rag evaluation results presentation #7462

Merged
merged 18 commits into from
Apr 8, 2024
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 73 additions & 89 deletions proposals/text/7462-rag-evaluation.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
- Title: Proposal for presentation of RAG evaluation results
- Title: Proposal for presentation of evaluation results
- Decision driver: David S. Batista
- Start Date: 2024-04-03
- Proposal PR: #7462
Expand All @@ -16,19 +16,22 @@ RAG models are one of them most popular use cases for Haystack. We are adding su

# Detailed design

The output results of an evaluation pipeline composed of `evaluator` components are passed to a `RAGPipelineEvaluation`
The output results of an evaluation pipeline composed of `evaluator` components are passed to a `EvaluationResults`
(this is a placeholder name) which stores them internally and acts as an interface to access and present the results.

Example of the data structure that the `RAGPipelineEvaluation` class will receive for initialization:
The examples below are just for illustrative purposes and are subject to change.

Example of the data structure that the `EvaluationResults` class will receive for initialization:

```python

data = {
"queries": {
"inputs": {
"query_id": ["53c3b3e6", "225f87f7"],
"question": ["What is the capital of France?", "What is the capital of Spain?"],
"contexts": ["wiki_France", "wiki_Spain"],
"answer": ["Paris", "Madrid"]
"answer": ["Paris", "Madrid"],
"predicted_answer": ["Paris", "Madrid"]
},
"metrics":
[
Expand All @@ -39,37 +42,31 @@ data = {
{"name": "faithfulness", "scores": [0.135581, 0.695974, 0.749861, 0.041999]},
{"name": "semantic_answer_similarity", "scores": [0.971241, 0.159320, 0.019722, 1]}
],
"pipeline_answers": ["Paris", "Madrid"]
},

```

- At least the `query_id` or the `question and context` should be present in the data structure.
- At least one of the metrics should be present in the data structure.
- The `pipeline_answers` field is optional, it is used to compare the answers generated by the pipeline with the expected answers.


The `RAGPipelineEvaluation` class provides the following methods to different types of users:
The `EvaluationResults` class provides the following methods to different types of users:

Basic users:
- `evaluation_report()`
- `comparative_evaluation_summary()`
- `individual_aggregate_score_report()`
- `comparative_aggregate_score_report()`

Intermediate users:
- `detailed_evaluation_report()`
- `comparative_detailed_evaluation_report()`
- `individual_detailed_score_report()`
- `comparative_detailed_score_report`
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved

Advanced users:
- `find_thresholds()`
- `find_scores_below_threshold()`
- `find_inputs_below_threshold()`


### Methods description
An evaluation report that provides a summary of the performance of the model across all queries, showing the
aggregated scores for all available metrics.

```python
def evaluation_report():
def individual_aggregate_score_report():
```

Example output
Expand All @@ -87,30 +84,18 @@ Example output
A detailed evaluation report that provides the scores of all available metrics for all queries or a subset of queries.

```python
def get_detailed_scores(query_ids: Union[List[str], str] = "all"):
def individual_detailed_score_report(queries: Union[List[str], str] = "all"):
```

Example output

```bash
data = {
"query_id": ["53c3b3e6, 225f87f7, 8ac473ec, 97d284ca"],
"reciprocal_rank": [0.378064, 0.534964, 0.216058, 0.778642],
"single_hit": [1, 1, 0, 1],
"multi_hit": [0.706125, 0.454976, 0.445512, 0.250522],
"context_relevance": [0.805466, 0.410251, 0.750070, 0.361332],
"faithfulness": [0.135581, 0.695974, 0.749861, 0.041999],
"semantic_answer_similarity": [0.971241, 0.159320, 0.019722, 1],
"aggregated_score":
{
'Reciprocal Rank': 0.448,
'Single Hit': 0.5,
'Multi Hit': 0.540,
'Context Relevance': 0.537,
'Faithfulness': 0.452,
'Semantic Answer Similarity': 0.478
}
}
| query_id | reciprocal_rank | single_hit | multi_hit | context_relevance | faithfulness | semantic_answer_similarity |
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
|----------|-----------------|------------|-----------|-------------------|-------------|----------------------------|
| 53c3b3e6 | 0.378064 | 1 | 0.706125 | 0.805466 | 0.135581 | 0.971241 |
| 225f87f7 | 0.534964 | 1 | 0.454976 | 0.410251 | 0.695974 | 0.159320 |
| 8ac473ec | 0.216058 | 0 | 0.445512 | 0.750070 | 0.749861 | 0.019722 |
| 97d284ca | 0.778642 | 1 | 0.250522 | 0.361332 | 0.041999 | 1 |
```

### Comparative Evaluation Report
Expand All @@ -119,7 +104,7 @@ A comparative summary that compares the performance of the model with another mo
for all available metrics.

```python
def comparative_summary(self, other: "RAGPipelineEvaluation"):
def comparative_aggregate_score_report(self, other: "EvaluationResults"):
```

```bash
Expand Down Expand Up @@ -149,54 +134,19 @@ available metrics for all queries.


```python
def detailed_comparative_summary(self, other: "RAGPipelineEvaluation"):
def comparative_detailed_score_report(self, other: "EvaluationResults"):
```

```bash
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
{
"queries": {
"query_id": ["53c3b3e6, 225f87f7, 8ac473ec, 97d284ca"],
"question": ["What is the capital of France?", "What is the capital of Spain?"],
"contexts": ["wiki_France", "wiki_Spain"],
"answer": ["Paris", "Madrid"]
}
"pipeline_1": {
"reciprocal_rank": [0.378064, 0.534964, 0.216058, 0.778642],
"single_hit": [1, 1, 0, 1],
"multi_hit": [0.706125, 0.454976, 0.445512, 0.250522],
"context_relevance": [0.805466, 0.410251, 0.750070, 0.361332],
"faithfulness": [0.135581, 0.695974, 0.749861, 0.041999],
"semantic_answer_similarity": [0.971241, 0.159320, 0.019722, 1],
"aggregated_score":
{
'Reciprocal Rank': 0.448,
'Single Hit': 0.5,
'Multi Hit': 0.540,
'Context Relevance': 0.537,
'Faithfulness': 0.452,
'Semantic Answer Similarity': 0.478
}
},
"pipeline_2": {
"reciprocal_rank": [0.378064, 0.534964, 0.216058, 0.778642],
"single_hit": [1, 1, 0, 1],
"multi_hit": [0.706125, 0.454976, 0.445512, 0.250522],
"context_relevance": [0.805466, 0.410251, 0.750070, 0.361332],
"faithfulness": [0.135581, 0.695974, 0.749861, 0.041999],
"semantic_answer_similarity": [0.971241, 0.159320, 0.019722, 1],
"aggregated_score":
{
'Reciprocal Rank': 0.448,
'Single Hit': 0.5,
'Multi Hit': 0.540,
'Context Relevance': 0.537,
'Faithfulness': 0.452,
'Semantic Answer Similarity': 0.478
}
}
}
| query_id | reciprocal_rank_model_1 | single_hit_model_1 | multi_hit_model_1 | context_relevance_model_1 | faithfulness_model_1 | semantic_answer_similarity_model_1 | reciprocal_rank_model_2 | single_hit_model_2 | multi_hit_model_2 | context_relevance_model_2 | faithfulness_model_2 | semantic_answer_similarity_model_2 |
davidsbatista marked this conversation as resolved.
Show resolved Hide resolved
|----------|-------------------------|--------------------|-------------------|---------------------------|----------------------|------------------------------------|-------------------------|--------------------|-------------------|---------------------------|----------------------|------------------------------------|
| 53c3b3e6 | 0.378064 | 1 | 0.706125 | 0.805466 | 0.135581 | 0.971241 | 0.378064 | 1 | 0.706125 | 0.805466 | 0.135581 | 0.971241 |
| 225f87f7 | 0.534964 | 1 | 0.454976 | 0.410251 | 0.695974 | 0.159320 | 0.534964 | 1 | 0.454976 | 0.410251 | 0.695974 | 0.159320 |
| 8ac473ec | 0.216058 | 0 | 0.445512 | 0.750070 | 0.749861 | 0.019722 | 0.216058 | 0 | 0.445512 | 0.750070 | 0.749861 | 0.019722 |
| 97d284ca | 0.778642 | 1 | 0.250522 | 0.361332 | 0.041999 | 1 | 0.778642 | 1 | 0.250522 | 0.361332 | 0.041999 | 1 |
```


Have a method to find interesting scores thresholds, typically used for error analysis, for all metrics available.
Some potentially interesting thresholds to find are: the 25th percentile, the 75th percentile, the mean , the median.

Expand All @@ -206,7 +156,7 @@ def find_thresholds(self, metrics: List[str]) -> Dict[str, float]:

```bash
data = {
"thresholds": ["25th percentile", "75th percentile", "mean", "average"],
"thresholds": ["25th percentile", "75th percentile", "median", "average"],
"reciprocal_rank": [0.378064, 0.534964, 0.216058, 0.778642],
"context_relevance": [0.805466, 0.410251, 0.750070, 0.361332],
"faithfulness": [0.135581, 0.695974, 0.749861, 0.041999],
Expand All @@ -217,16 +167,15 @@ data = {
Then have another method that

```python
def get_scores_below_threshold(self, metric: str, threshold: float):
def find_inputs_below_threshold(self, metric: str, threshold: float):
"""Get the all the queries with a score below a certain threshold for a given metric"""
```


# Drawbacks

- Relying on pandas DataFrame internally makes it easy to perform many of the operations.
- Nevertheless, it can be burden, since we are making `pandas` a dependency of `haystack-ai`.
- Ideally all the proposed methods should be implemented in a way that doesn't require `pandas`.
- Having the output in a format table may not be flexible enough, and maybe too verbose for datasets with a large number of queries.
- Maybe the option to export to a .csv file would be better than having the output in a table format.
- Maybe a JSON format would be better with the option for advanced users to do further analysis and visualization.


# Adoption strategy
Expand All @@ -238,6 +187,41 @@ def get_scores_below_threshold(self, metric: str, threshold: float):
- A tutorial would be the best approach to teach users how to use this feature.
- Adding a new entry to the documentation.

# Unresolved questions
# User stories

### 1. I would like to get a single summary score for my RAG pipeline so I can compare several pipeline configurations.

Run `individual_aggregate_score_report()` and get the following output:

```bash
{'Reciprocal Rank': 0.448,
'Single Hit': 0.5,
'Multi Hit': 0.540,
'Context Relevance': 0.537,
'Faithfulness': 0.452,
'Semantic Answer Similarity': 0.478
}
```

### 2. I am not sure what evaluation metrics work best for my RAG pipeline, specially when using the more novel LLM-based

Use `context relevance` or `faithfulness`

### 3. My RAG pipeline has a low aggregate score, so I would like to see examples of specific inputs where the score was low to be able to diagnose what the issue could be.

Let's say it's a low score in `reciprocal_rank` and one already has an idea of what "low" is a query/question, then simply run:

find_inputs_below_threshold("reciprocal_rank", <threshold>)

If the low score is in `reciprocal_rank` one can first get thresholds for this metric using:

`find_thresholds(["reciprocal_rank"])`

this will give:

- 25th percentile: (Q1) the value below which 25% of the data falls.
- median percentile: (Q2) the value below which 50% of the data falls.
- 75th percentile: (Q3) the value below which 75% of the data falls.

- The `comparative_summary()` and the `comparative_detailed_summary()` methods need to be adopted to different definitions of what a correct answer is.
this can help to decide what is considered a low score, and then get, for instance, queries with a score below
the Q2 threshold using `find_inputs_below_threshold("context_relevance", threshold)`