deepset-ai · davidsbatista · Apr 8, 2024 · Apr 2, 2024 · Apr 4, 2024 · Apr 4, 2024
@@ -1,4 +1,4 @@
-- Title: Proposal for presentation of RAG evaluation results
+- Title: Proposal for presentation of evaluation results
 - Decision driver: David S. Batista
 - Start Date: 2024-04-03
 - Proposal PR: #7462
@@ -16,19 +16,22 @@ RAG models are one of them most popular use cases for Haystack. We are adding su
 
 # Detailed design
 
-The output results of an evaluation pipeline composed of `evaluator` components are passed to a `RAGPipelineEvaluation`
+The output results of an evaluation pipeline composed of `evaluator` components are passed to a `EvaluationResults`
 (this is a placeholder name) which stores them internally and acts as an interface to access and present the results.
 
-Example of the data structure that the `RAGPipelineEvaluation` class will receive for initialization:
+The examples below are just for illustrative purposes and are subject to change.
+
+Example of the data structure that the `EvaluationResults` class will receive for initialization:
 
 ```python
 
 data = {
-    "queries": {
+    "inputs": {
         "query_id": ["53c3b3e6", "225f87f7"],
         "question": ["What is the capital of France?", "What is the capital of Spain?"],
         "contexts": ["wiki_France", "wiki_Spain"],
-        "answer": ["Paris", "Madrid"]
+        "answer": ["Paris", "Madrid"],
+        "predicted_answer": ["Paris", "Madrid"]
     },
     "metrics":
         [
@@ -39,37 +42,31 @@ data = {
             {"name": "faithfulness", "scores": [0.135581, 0.695974, 0.749861, 0.041999]},
             {"name": "semantic_answer_similarity", "scores": [0.971241, 0.159320, 0.019722, 1]}
          ],
-    "pipeline_answers": ["Paris", "Madrid"]
     },
 
 ```
 
-- At least the `query_id` or the `question and context` should be present in the data structure.
-- At least one of the metrics should be present in the data structure.
-- The `pipeline_answers` field is optional, it is used to compare the answers generated by the pipeline with the expected answers.
-
-
-The `RAGPipelineEvaluation` class provides the following methods to different types of users:
+The `EvaluationResults` class provides the following methods to different types of users:
 
 Basic users:
-- `evaluation_report()`
-- `comparative_evaluation_summary()`
+- `individual_aggregate_score_report()`
+- `comparative_aggregate_score_report()`
 
 Intermediate users:
-- `detailed_evaluation_report()`
-- `comparative_detailed_evaluation_report()`
+- `individual_detailed_score_report()`
+- `comparative_detailed_score_report`
 
 Advanced users:
 - `find_thresholds()`
-- `find_scores_below_threshold()`
+- `find_inputs_below_threshold()`
 
 
 ### Methods description
 An evaluation report that provides a summary of the performance of the model across all queries, showing the
 aggregated scores for all available metrics.
 
 ```python
-def evaluation_report():
+def individual_aggregate_score_report():
 ```
 
 Example output
@@ -87,30 +84,18 @@ Example output
 A detailed evaluation report that provides the scores of all available metrics for all queries or a subset of queries.
 
 ```python
-def get_detailed_scores(query_ids: Union[List[str], str] = "all"):
+def individual_detailed_score_report(queries: Union[List[str], str] = "all"):
 ```
 
 Example output
 
 ```bash
-data  = {
-    "query_id": ["53c3b3e6, 225f87f7, 8ac473ec, 97d284ca"],
-    "reciprocal_rank": [0.378064, 0.534964, 0.216058, 0.778642],
-    "single_hit": [1, 1, 0, 1],
-    "multi_hit": [0.706125, 0.454976, 0.445512, 0.250522],
-    "context_relevance": [0.805466, 0.410251, 0.750070, 0.361332],
-    "faithfulness": [0.135581, 0.695974, 0.749861, 0.041999],
-    "semantic_answer_similarity": [0.971241, 0.159320, 0.019722, 1],
-    "aggregated_score":
-      {
-          'Reciprocal Rank': 0.448,
-          'Single Hit': 0.5,
-          'Multi Hit': 0.540,
-          'Context Relevance': 0.537,
-          'Faithfulness': 0.452,
-          'Semantic Answer Similarity': 0.478
-      }
-}
+| query_id | reciprocal_rank | single_hit | multi_hit | context_relevance | faithfulness | semantic_answer_similarity |
+|----------|-----------------|------------|-----------|-------------------|-------------|----------------------------|
+| 53c3b3e6 | 0.378064        | 1          | 0.706125  | 0.805466          | 0.135581    | 0.971241                   |
+| 225f87f7 | 0.534964        | 1          | 0.454976  | 0.410251          | 0.695974    | 0.159320                   |
+| 8ac473ec | 0.216058        | 0          | 0.445512  | 0.750070          | 0.749861    | 0.019722                   |
+| 97d284ca | 0.778642        | 1          | 0.250522  | 0.361332          | 0.041999    | 1                          |
 ```
 
 ### Comparative Evaluation Report
@@ -119,7 +104,7 @@ A comparative summary that compares the performance of the model with another mo
 for all available metrics.
 
 ```python
-def comparative_summary(self, other: "RAGPipelineEvaluation"):
+def comparative_aggregate_score_report(self, other: "EvaluationResults"):
 ```
 
 ```bash
@@ -149,54 +134,19 @@ available metrics for all queries.
 
 
 ```python
-def detailed_comparative_summary(self, other: "RAGPipelineEvaluation"):
+def comparative_detailed_score_report(self, other: "EvaluationResults"):
 ```
 
 ```bash
-{
-    "queries": {
-      "query_id": ["53c3b3e6, 225f87f7, 8ac473ec, 97d284ca"],
-      "question": ["What is the capital of France?", "What is the capital of Spain?"],
-      "contexts": ["wiki_France", "wiki_Spain"],
-      "answer": ["Paris", "Madrid"]  
-    }
-    "pipeline_1": {  
-        "reciprocal_rank": [0.378064, 0.534964, 0.216058, 0.778642],
-        "single_hit": [1, 1, 0, 1],
-        "multi_hit": [0.706125, 0.454976, 0.445512, 0.250522],
-        "context_relevance": [0.805466, 0.410251, 0.750070, 0.361332],
-        "faithfulness": [0.135581, 0.695974, 0.749861, 0.041999],
-        "semantic_answer_similarity": [0.971241, 0.159320, 0.019722, 1],
-        "aggregated_score":
-          {
-              'Reciprocal Rank': 0.448,
-              'Single Hit': 0.5,
-              'Multi Hit': 0.540,
-              'Context Relevance': 0.537,
-              'Faithfulness': 0.452,
-              'Semantic Answer Similarity': 0.478
-          }
-    },
-    "pipeline_2": {
-        "reciprocal_rank": [0.378064, 0.534964, 0.216058, 0.778642],
-        "single_hit": [1, 1, 0, 1],
-        "multi_hit": [0.706125, 0.454976, 0.445512, 0.250522],
-        "context_relevance": [0.805466, 0.410251, 0.750070, 0.361332],
-        "faithfulness": [0.135581, 0.695974, 0.749861, 0.041999],
-        "semantic_answer_similarity": [0.971241, 0.159320, 0.019722, 1],
-        "aggregated_score":
-          {
-              'Reciprocal Rank': 0.448,
-              'Single Hit': 0.5,
-              'Multi Hit': 0.540,
-              'Context Relevance': 0.537,
-              'Faithfulness': 0.452,
-              'Semantic Answer Similarity': 0.478
-          }
-      }
-}
+| query_id | reciprocal_rank_model_1 | single_hit_model_1 | multi_hit_model_1 | context_relevance_model_1 | faithfulness_model_1 | semantic_answer_similarity_model_1 | reciprocal_rank_model_2 | single_hit_model_2 | multi_hit_model_2 | context_relevance_model_2 | faithfulness_model_2 | semantic_answer_similarity_model_2 |
+|----------|-------------------------|--------------------|-------------------|---------------------------|----------------------|------------------------------------|-------------------------|--------------------|-------------------|---------------------------|----------------------|------------------------------------|
+| 53c3b3e6 | 0.378064                | 1                  | 0.706125          | 0.805466                  | 0.135581            | 0.971241                           | 0.378064                | 1                  | 0.706125          | 0.805466                  | 0.135581            | 0.971241                           |
+| 225f87f7 | 0.534964                | 1                  | 0.454976          | 0.410251                  | 0.695974            | 0.159320                           | 0.534964                | 1                  | 0.454976          | 0.410251                  | 0.695974            | 0.159320                           |
+| 8ac473ec | 0.216058                | 0                  | 0.445512          | 0.750070                  | 0.749861            | 0.019722                           | 0.216058                | 0                  | 0.445512          | 0.750070                  | 0.749861            | 0.019722                           |
+| 97d284ca | 0.778642                | 1                  | 0.250522          | 0.361332                  | 0.041999            | 1                                  | 0.778642                | 1                  | 0.250522          | 0.361332                  | 0.041999            | 1                                  |
 ```
 
+
 Have a method to find interesting scores thresholds, typically used for error analysis, for all metrics available.
 Some potentially interesting thresholds to find are: the 25th percentile, the 75th percentile, the mean , the median.
 
@@ -206,7 +156,7 @@ def find_thresholds(self, metrics: List[str]) -> Dict[str, float]:
 
 ```bash
 data  = {
-    "thresholds": ["25th percentile", "75th percentile", "mean", "average"],
+    "thresholds": ["25th percentile", "75th percentile", "median", "average"],
     "reciprocal_rank": [0.378064, 0.534964, 0.216058, 0.778642],
     "context_relevance": [0.805466, 0.410251, 0.750070, 0.361332],
     "faithfulness": [0.135581, 0.695974, 0.749861, 0.041999],
@@ -217,16 +167,15 @@ data  = {
 Then have another method that  
 
 ```python
-def get_scores_below_threshold(self, metric: str, threshold: float):
+def find_inputs_below_threshold(self, metric: str, threshold: float):
     """Get the all the queries with a score below a certain threshold for a given metric"""  
 ```
 
-
 # Drawbacks
 
-- Relying on pandas DataFrame internally makes it easy to perform many of the operations.
-- Nevertheless, it can be burden, since we are making `pandas` a dependency of `haystack-ai`.
-- Ideally all the proposed methods should be implemented in a way that doesn't require `pandas`.
+- Having the output in a format table may not be flexible enough, and maybe too verbose for datasets with a large number of queries.
+- Maybe the option to export to a .csv file would be better than having the output in a table format.
+- Maybe a JSON format would be better with the option for advanced users to do further analysis and visualization.
 
 
 # Adoption strategy
@@ -238,6 +187,41 @@ def get_scores_below_threshold(self, metric: str, threshold: float):
 - A tutorial would be the best approach to teach users how to use this feature.
 - Adding a new entry to the documentation.
 
-# Unresolved questions
+# User stories
+
+### 1. I would like to get a single summary score for my RAG pipeline so I can compare several pipeline configurations.
+
+Run `individual_aggregate_score_report()` and get the following output:
+
+```bash
+{'Reciprocal Rank': 0.448,
+ 'Single Hit': 0.5,
+ 'Multi Hit': 0.540,
+ 'Context Relevance': 0.537,
+ 'Faithfulness': 0.452,
+ 'Semantic Answer Similarity': 0.478
+ }
+ ```
+
+### 2. I am not sure what evaluation metrics work best for my RAG pipeline, specially when using the more novel LLM-based
+
+Use `context relevance` or `faithfulness`
+
+### 3. My RAG pipeline has a low aggregate score, so I would like to see examples of specific inputs where the score was low to be able to diagnose what the issue could be.
+
+Let's say it's a low score in `reciprocal_rank` and one already has an idea of what "low" is a query/question, then simply run:
+
+    find_inputs_below_threshold("reciprocal_rank", <threshold>)
+
+If the low score is in `reciprocal_rank` one can first get thresholds for this metric using:
+
+    `find_thresholds(["reciprocal_rank"])`
+
+this will give:
+
+- 25th percentile: (Q1) the value below which 25% of the data falls.
+- median percentile: (Q2) the value below which 50% of the data falls.
+- 75th percentile: (Q3) the value below which 75% of the data falls.
 
-- The `comparative_summary()` and the `comparative_detailed_summary()` methods need to be adopted to different definitions of what a correct answer is.
+this can help to decide what is considered a low score, and then get, for instance, queries with a score below
+the Q2 threshold using `find_inputs_below_threshold("context_relevance", threshold)`