diff --git a/.gitignore b/.gitignore
index 619a3353c..cf158d3c5 100644
--- a/.gitignore
+++ b/.gitignore
@@ -166,3 +166,4 @@ experiments/**/storage
**/fil-result/
experiments/baselines/fiqa/datasets
src/ragas/_version.py
+.python-version
diff --git a/README.md b/README.md
index 2d8fc4c69..64716bc63 100644
--- a/README.md
+++ b/README.md
@@ -55,18 +55,23 @@ This is a small example program you can run to see ragas in action!
```python
from ragas import evaluate
+from datasets import Dataset
import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"
-ds = Dataset({
- features: ['question','context','answer'],
- num_rows: 25
-})
-results = evaluate(ds)
+# prepare your huggingface dataset in the format
+# Dataset({
+# features: ['question','contexts','answer'],
+# num_rows: 25
+# })
+
+dataset: Dataset
+
+results = evaluate(dataset)
```
-If you want a more in-depth explanation of core components, check out our quick-start notebook
+If you want a more in-depth explanation of core components, check out our [quick-start notebook](./examples/quickstart.ipynb)
## :luggage: Metrics
Ragas measures your pipeline's performance against two dimensions
diff --git a/examples/quickstart.ipynb b/examples/quickstart.ipynb
index 2f7ccdc38..3793695eb 100644
--- a/examples/quickstart.ipynb
+++ b/examples/quickstart.ipynb
@@ -2,62 +2,101 @@
"cells": [
{
"cell_type": "markdown",
- "id": "aeb5819b",
+ "id": "2e63f667",
"metadata": {},
"source": [
- "# Quickstart"
+ "# Quickstart\n",
+ "\n",
+ "welcome to the ragas quickstart. We're going to get you up and running with ragas as qickly as you can so that you can go back to improving your Retrieval Augmented Generation pipelines while this library makes sure your changes are improving your entire pipeline.\n",
+ "\n",
+ "to kick things of lets start with the data"
]
},
{
"cell_type": "code",
"execution_count": 1,
- "id": "22c7dd25",
+ "id": "57585b55",
"metadata": {},
"outputs": [],
"source": [
- "# only run this if your have an editable install\n",
"%load_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "markdown",
- "id": "5af47053",
+ "id": "c77789bb",
+ "metadata": {},
+ "source": [
+ "Ragas also uses OpenAI for running a metric so make sure you have your openai key ready and available in your environment"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0b7179f7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "\n",
+ "os.environ[\"OPENAI_API_KEY\"] = \"your-openai-key\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "06c9fc7d",
"metadata": {},
"source": [
- "### load your data\n",
+ "## The Data\n",
+ "\n",
+ "Ragas performs a `ground_truth` free evaluation of your RAG pipelines. This is because for most people building a gold labeled dataset which represents in the distribution they get in production is a very expensive process.\n",
"\n",
- "For this quickstart we are going to be using a dataset that we prepared from [eli5](https://huggingface.co/datasets/eli5) dataset with the models response. The dataset is available in [huggingface](https://huggingface.co/datasets/explodinggradients/eli5-test).\n",
+ "Hence to work with ragas all you need are the following data\n",
+ "- question: `list[str]` - These are the questions you RAG pipeline will be evaluated on. \n",
+ "- answer: `list[str]` - The answer generated from the RAG pipeline and give to the user.\n",
+ "- contexts: `list[list[str]]` - The contexts which where passed into the LLM to answer the question.\n",
"\n",
- "The dataset is of the following format\n",
- "| column name | type | description |\n",
- "|----------------|-----------|-----------------------------------------------------------------------------------|\n",
- "| prompt | str | the prompt/question to answer |\n",
- "| context | str | context string that has any relevent priors the LLM needs to answer the questions |\n",
- "| references | list[str] | reference documents the LLM can use to respond to the prompt |\n",
- "| ground_truth | list[str] | accepted answers given by human annotators |\n",
- "| generated_text | str | the generated output from the LLM |"
+ "Ideally your list of questions should reflect the questions your users give, including those that you have been problamatic in the past.\n",
+ "\n",
+ "Here we're using an example dataset from on of the baselines we created for the [Financial Opinion Mining and Question Answering (fiqa) Dataset](https://sites.google.com/view/fiqa/) we created. If you want to want to know more about the baseline, feel free to check the `experiements/baseline` section"
]
},
{
"cell_type": "code",
"execution_count": 2,
- "id": "2bc9fb9d",
+ "id": "b658e02f",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
- "Found cached dataset parquet (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___parquet/explodinggradients--eli5-test-217d92ce20e19249/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)\n"
+ "Found cached dataset fiqa (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/ragas_eval/1.0.0/3dc7b639f5b4b16509a3299a2ceb78bf5fe98ee6b5fee25e7d5e4d290c88efb8)\n"
]
},
{
"data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "c4a622ce9f774cf7b79b46d9fcf05f69",
+ "version_major": 2,
+ "version_minor": 0
+ },
"text/plain": [
- "Dataset({\n",
- " features: ['context', 'prompt', 'ground_truth', 'references', 'generated_text'],\n",
- " num_rows: 500\n",
+ " 0%| | 0/1 [00:00, ?it/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ "DatasetDict({\n",
+ " baseline: Dataset({\n",
+ " features: ['question', 'ground_truths', 'answer', 'contexts'],\n",
+ " num_rows: 30\n",
+ " })\n",
"})"
]
},
@@ -67,159 +106,120 @@
}
],
"source": [
- "from datasets import load_dataset, concatenate_datasets\n",
+ "# data\n",
+ "from datasets import load_dataset\n",
"\n",
- "ds = load_dataset(\"explodinggradients/eli5-test\", split=\"test_eli5\")\n",
- "ds"
+ "fiqa_eval = load_dataset(\"explodinggradients/fiqa\", \"ragas_eval\")\n",
+ "fiqa_eval"
]
},
{
"cell_type": "markdown",
- "id": "1e9c0687",
+ "id": "84aa640f",
"metadata": {},
"source": [
- "### choose the metrics\n",
+ "## Metrics\n",
+ "\n",
+ "Ragas measures your pipeline's performance against two dimensions\n",
+ "\n",
+ "1. Factuality: measures the factual consistency of the generated answer against the given context.\n",
+ "2. Relevancy: measures how relevant retrieved contexts and the generated answer are to the question.\n",
+ "\n",
+ "Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors.\n",
"\n",
- "ragas provides you with a wide range of metrics to evaluate the generated answers based on the latest research. You can see the entire list [here](https://github.com/explodinggradients/ragas#metrics). For this quickstart we will be using 3 from each type we support.\n",
- "1. `edit_ratio` - obtained by dividing the Levenshtein distance by sum of number of characters in generated text and ground truth.\n",
- "2. `bleu_score` - It measures precision by comparing clipped n-grams in generated text to ground truth text.\n",
- "3. `bert_score` - measures the similarity between ground truth text answers and generated text using SBERT vector embeddings."
+ "now lets import these metrics and understand more about what they denote"
]
},
{
"cell_type": "code",
- "execution_count": 5,
- "id": "0b5abd7d",
+ "execution_count": 3,
+ "id": "f17bcf9d",
"metadata": {},
"outputs": [],
"source": [
- "from ragas.metrics import edit_ratio, bleu_score, bert_score"
+ "from ragas.metrics import context_relevancy, answer_relevancy, factuality"
]
},
{
"cell_type": "markdown",
- "id": "1d95d887",
+ "id": "ef8c5e60",
"metadata": {},
"source": [
- "now we can initialize the `Evaluation` object. This will load your metrics and data and run the evaluation for you."
+ "here you can see that we are using 3 metrics, but what do the represent?\n",
+ "\n",
+ "1. context_relevancy - a measure of how relevent the retrieved context is to the question. Conveys quality of the retrieval pipeline.\n",
+ "2. answer_relevancy - a measure of how relevent the answer is to the question\n",
+ "3. factuality - the factual consistancy of the answer to the context base on the question.\n",
+ "\n",
+ "**Note:** *`factuality` using OpenAI's API to compute the score. If you using this metric make sure you set the environment key `OPENAI_API_KEY` with your API key.*\n",
+ "\n",
+ "**Note:** *`context_relevancy` and `answer_relevancy` use very small LLMs to compute the score. It will run on CPU but having a GPU is recommended.*\n",
+ "\n",
+ "If you're interested in learning more, feel free to check the [docs](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md)"
]
},
{
- "cell_type": "code",
- "execution_count": 7,
- "id": "a77c805d",
- "metadata": {
- "scrolled": true
- },
- "outputs": [],
+ "cell_type": "markdown",
+ "id": "8d6ecd5a",
+ "metadata": {},
"source": [
- "from ragas.metrics import Evaluation\n",
+ "## Evaluation\n",
"\n",
- "e = Evaluation(\n",
- " metrics=[bert_score, edit_ratio, bleu_score],\n",
- " batched=False,\n",
- " batch_size=30,\n",
- ")"
+ "Running the evalutation is as simple as calling evaluate on the `Dataset` with the metrics of your choice."
]
},
{
"cell_type": "code",
- "execution_count": 18,
- "id": "e879f51b",
+ "execution_count": 8,
+ "id": "22eb6f97",
"metadata": {},
"outputs": [
- {
- "data": {
- "application/vnd.jupyter.widget-view+json": {
- "model_id": "",
- "version_major": 2,
- "version_minor": 0
- },
- "text/plain": [
- "Map: 0%| | 0/500 [00:00, ? examples/s]"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
{
"name": "stderr",
"output_type": "stream",
"text": [
- "/home/jjmachan/miniconda3/envs/bench/lib/python3.10/site-packages/nltk/translate/bleu_score.py:552: UserWarning: \n",
- "The hypothesis contains 0 counts of 2-gram overlaps.\n",
- "Therefore the BLEU score evaluates to 0, independently of\n",
- "how many N-gram overlaps of lower order it contains.\n",
- "Consider using lower n-gram order or use SmoothingFunction()\n",
- " warnings.warn(_msg)\n",
- "/home/jjmachan/miniconda3/envs/bench/lib/python3.10/site-packages/nltk/translate/bleu_score.py:552: UserWarning: \n",
- "The hypothesis contains 0 counts of 3-gram overlaps.\n",
- "Therefore the BLEU score evaluates to 0, independently of\n",
- "how many N-gram overlaps of lower order it contains.\n",
- "Consider using lower n-gram order or use SmoothingFunction()\n",
- " warnings.warn(_msg)\n",
- "/home/jjmachan/miniconda3/envs/bench/lib/python3.10/site-packages/nltk/translate/bleu_score.py:552: UserWarning: \n",
- "The hypothesis contains 0 counts of 4-gram overlaps.\n",
- "Therefore the BLEU score evaluates to 0, independently of\n",
- "how many N-gram overlaps of lower order it contains.\n",
- "Consider using lower n-gram order or use SmoothingFunction()\n",
- " warnings.warn(_msg)\n"
+ "Loading cached processed dataset at /home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/ragas_eval/1.0.0/3dc7b639f5b4b16509a3299a2ceb78bf5fe98ee6b5fee25e7d5e4d290c88efb8/cache-f5ed219a49e8fb1f.arrow\n",
+ "100%|█████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.95s/it]\n",
+ "100%|█████████████████████████████████████████████████████████████| 2/2 [01:09<00:00, 34.97s/it]\n",
+ "Loading cached processed dataset at /home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/ragas_eval/1.0.0/3dc7b639f5b4b16509a3299a2ceb78bf5fe98ee6b5fee25e7d5e4d290c88efb8/cache-2a93a2841bc4d586.arrow\n",
+ "100%|█████████████████████████████████████████████████████████████| 1/1 [00:07<00:00, 7.49s/it]\n"
]
- }
- ],
- "source": [
- "# run it with .eval()\n",
- "result = e.eval(ds[\"ground_truth\"], ds[\"generated_text\"])"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "31fbe76c",
- "metadata": {},
- "source": [
- "### analysing results\n",
- "\n",
- "The return `Result` object is used to analyse the results."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "id": "474c0aad",
- "metadata": {},
- "outputs": [
+ },
{
"data": {
- "text/html": [
- "
{'BERTScore_cosine': 0.37552570906095206, 'edit_ratio': 0.41482407945510713, 'BLEU': 0.010848577619569451}\n",
- "
\n"
- ],
"text/plain": [
- "\u001b[1m{\u001b[0m\u001b[32m'BERTScore_cosine'\u001b[0m: \u001b[1;36m0.37552570906095206\u001b[0m, \u001b[32m'edit_ratio'\u001b[0m: \u001b[1;36m0.41482407945510713\u001b[0m, \u001b[32m'BLEU'\u001b[0m: \u001b[1;36m0.010848577619569451\u001b[0m\u001b[1m}\u001b[0m\n"
+ "{'ragas_score': 0.860, 'context_relavency': 0.817, 'factuality': 0.892, 'answer_relevancy': 0.874}"
]
},
+ "execution_count": 8,
"metadata": {},
- "output_type": "display_data"
+ "output_type": "execute_result"
}
],
"source": [
- "from rich.pretty import pprint\n",
+ "from ragas import evaluate\n",
+ "\n",
+ "result = evaluate(\n",
+ " fiqa_eval[\"baseline\"], metrics=[context_relevancy, factuality, answer_relevancy]\n",
+ ")\n",
"\n",
- "pprint(result)"
+ "result"
]
},
{
"cell_type": "markdown",
- "id": "eb07bbec",
+ "id": "a2dc0ec2",
"metadata": {},
"source": [
- "you can access individual metric results via `result['']`. it also has a `.describe()` function to show the distribution of the results and you can access the individual score from `.scores` attribute."
+ "and there you have the it, all the scores you need. `ragas_score` gives you a single metric that you can use while the other onces measure the different parts of your pipeline.\n",
+ "\n",
+ "now if we want to dig into the results and figure out examples where your pipeline performed worse or really good you can easily convert it into a pandas array and use your standard analytics tools too!"
]
},
{
"cell_type": "code",
- "execution_count": 16,
- "id": "4c8c51b1",
+ "execution_count": 12,
+ "id": "8686bf53",
"metadata": {},
"outputs": [
{
@@ -243,104 +243,127 @@
" \n",
" \n",
" | \n",
- " BERTScore_cosine | \n",
- " edit_ratio | \n",
- " BLEU | \n",
+ " question | \n",
+ " ground_truths | \n",
+ " answer | \n",
+ " contexts | \n",
+ " context_relavency | \n",
+ " factuality | \n",
+ " answer_relevancy | \n",
"
\n",
" \n",
" \n",
" \n",
- " mean | \n",
- " 0.375526 | \n",
- " 0.414824 | \n",
- " 1.084858e-02 | \n",
- "
\n",
- " \n",
- " 25% | \n",
- " 0.212339 | \n",
- " 0.399876 | \n",
- " 3.489775e-155 | \n",
+ " 0 | \n",
+ " How to deposit a cheque issued to an associate... | \n",
+ " [Have the check reissued to the proper payee.J... | \n",
+ " \\nThe best way to deposit a cheque issued to a... | \n",
+ " [Just have the associate sign the back and the... | \n",
+ " 0.867 | \n",
+ " 1.0 | \n",
+ " 0.922 | \n",
"
\n",
" \n",
- " 50% | \n",
- " 0.332697 | \n",
- " 0.429187 | \n",
- " 4.318061e-79 | \n",
+ " 1 | \n",
+ " Can I send a money order from USPS as a business? | \n",
+ " [Sure you can. You can fill in whatever you w... | \n",
+ " \\nYes, you can send a money order from USPS as... | \n",
+ " [Sure you can. You can fill in whatever you w... | \n",
+ " 0.855 | \n",
+ " 1.0 | \n",
+ " 0.923 | \n",
"
\n",
" \n",
- " 75% | \n",
- " 0.532642 | \n",
- " 0.449509 | \n",
- " 1.525948e-05 | \n",
+ " 2 | \n",
+ " 1 EIN doing business under multiple business n... | \n",
+ " [You're confusing a lot of things here. Compan... | \n",
+ " \\nYes, it is possible to have one EIN doing bu... | \n",
+ " [You're confusing a lot of things here. Compan... | \n",
+ " 0.768 | \n",
+ " 1.0 | \n",
+ " 0.824 | \n",
"
\n",
" \n",
- " min | \n",
- " 0.007017 | \n",
- " 0.102182 | \n",
- " 4.029193e-232 | \n",
+ " 3 | \n",
+ " Applying for and receiving business credit | \n",
+ " [\"I'm afraid the great myth of limited liabili... | \n",
+ " \\nApplying for and receiving business credit c... | \n",
+ " [Set up a meeting with the bank that handles y... | \n",
+ " 0.781 | \n",
+ " 1.0 | \n",
+ " 0.830 | \n",
"
\n",
" \n",
- " max | \n",
- " 0.910680 | \n",
- " 0.572917 | \n",
- " 1.506915e-01 | \n",
- "
\n",
- " \n",
- " std | \n",
- " 0.207559 | \n",
- " 0.058072 | \n",
- " 2.343307e-02 | \n",
+ " 4 | \n",
+ " 401k Transfer After Business Closure | \n",
+ " [You should probably consult an attorney. Howe... | \n",
+ " \\nIf your employer has closed and you need to ... | \n",
+ " [The time horizon for your 401K/IRA is essenti... | \n",
+ " 0.737 | \n",
+ " 1.0 | \n",
+ " 0.753 | \n",
"
\n",
" \n",
"\n",
""
],
"text/plain": [
- " BERTScore_cosine edit_ratio BLEU\n",
- "mean 0.375526 0.414824 1.084858e-02\n",
- "25% 0.212339 0.399876 3.489775e-155\n",
- "50% 0.332697 0.429187 4.318061e-79\n",
- "75% 0.532642 0.449509 1.525948e-05\n",
- "min 0.007017 0.102182 4.029193e-232\n",
- "max 0.910680 0.572917 1.506915e-01\n",
- "std 0.207559 0.058072 2.343307e-02"
+ " question \\\n",
+ "0 How to deposit a cheque issued to an associate... \n",
+ "1 Can I send a money order from USPS as a business? \n",
+ "2 1 EIN doing business under multiple business n... \n",
+ "3 Applying for and receiving business credit \n",
+ "4 401k Transfer After Business Closure \n",
+ "\n",
+ " ground_truths \\\n",
+ "0 [Have the check reissued to the proper payee.J... \n",
+ "1 [Sure you can. You can fill in whatever you w... \n",
+ "2 [You're confusing a lot of things here. Compan... \n",
+ "3 [\"I'm afraid the great myth of limited liabili... \n",
+ "4 [You should probably consult an attorney. Howe... \n",
+ "\n",
+ " answer \\\n",
+ "0 \\nThe best way to deposit a cheque issued to a... \n",
+ "1 \\nYes, you can send a money order from USPS as... \n",
+ "2 \\nYes, it is possible to have one EIN doing bu... \n",
+ "3 \\nApplying for and receiving business credit c... \n",
+ "4 \\nIf your employer has closed and you need to ... \n",
+ "\n",
+ " contexts context_relavency \\\n",
+ "0 [Just have the associate sign the back and the... 0.867 \n",
+ "1 [Sure you can. You can fill in whatever you w... 0.855 \n",
+ "2 [You're confusing a lot of things here. Compan... 0.768 \n",
+ "3 [Set up a meeting with the bank that handles y... 0.781 \n",
+ "4 [The time horizon for your 401K/IRA is essenti... 0.737 \n",
+ "\n",
+ " factuality answer_relevancy \n",
+ "0 1.0 0.922 \n",
+ "1 1.0 0.923 \n",
+ "2 1.0 0.824 \n",
+ "3 1.0 0.830 \n",
+ "4 1.0 0.753 "
]
},
- "execution_count": 16,
+ "execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
- "from pandas import DataFrame\n",
- "\n",
- "# view with pandas\n",
- "df = DataFrame(result.describe())\n",
- "df"
+ "df = result.to_pandas()\n",
+ "df.head()"
]
},
{
- "cell_type": "code",
- "execution_count": 29,
- "id": "421c60ab",
+ "cell_type": "markdown",
+ "id": "f668fce1",
"metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Dataset({\n",
- " features: ['BERTScore_cosine', 'edit_ratio', 'BLEU'],\n",
- " num_rows: 500\n",
- "})"
- ]
- },
- "execution_count": 29,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
"source": [
- "result.scores"
+ "And thats it!\n",
+ "\n",
+ "You can check out the [ragas in action] notebook to get a feel of what is like to use it while trying to improve your pipelines.\n",
+ "\n",
+ "if you have any suggestion/feedbacks/things your not happy about, please do share it in the [issue section](https://github.com/explodinggradients/ragas/issues). We love hearing from you 😁"
]
}
],
@@ -360,7 +383,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.10.11"
+ "version": "3.10.12"
}
},
"nbformat": 4,
diff --git a/experiments/assesments/metrics_assesments.ipynb b/experiments/assesments/metrics_assesments.ipynb
index cb8f06208..257da13fe 100644
--- a/experiments/assesments/metrics_assesments.ipynb
+++ b/experiments/assesments/metrics_assesments.ipynb
@@ -64,7 +64,7 @@
"metadata": {},
"outputs": [],
"source": [
- "os.chdir('/Users/shahules/belar/src/')"
+ "os.chdir(\"/Users/shahules/belar/src/\")"
]
},
{
diff --git a/experiments/baselines/fiqa/dataset-exploration-and-baseline.ipynb b/experiments/baselines/fiqa/dataset-exploration-and-baseline.ipynb
index be1c69a9c..97ca5e67a 100644
--- a/experiments/baselines/fiqa/dataset-exploration-and-baseline.ipynb
+++ b/experiments/baselines/fiqa/dataset-exploration-and-baseline.ipynb
@@ -48,7 +48,11 @@
"from beir.datasets.data_loader import GenericDataLoader\n",
"\n",
"dataset = \"fiqa\"\n",
- "url = \"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip\".format(dataset)\n",
+ "url = (\n",
+ " \"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip\".format(\n",
+ " dataset\n",
+ " )\n",
+ ")\n",
"data_path = util.download_and_unzip(url, \"datasets\")"
]
},
@@ -218,7 +222,7 @@
"source": [
"with open(os.path.join(data_path, \"corpus.jsonl\")) as f:\n",
" cs = [pd.Series(json.loads(l)) for l in f.readlines()]\n",
- " \n",
+ "\n",
"corpus_df = pd.DataFrame(cs)\n",
"corpus_df"
]
@@ -299,9 +303,7 @@
}
],
"source": [
- "corpus_df = corpus_df.rename(columns={\n",
- " \"_id\": \"corpus-id\", \"text\": \"ground_truth\"\n",
- "})\n",
+ "corpus_df = corpus_df.rename(columns={\"_id\": \"corpus-id\", \"text\": \"ground_truth\"})\n",
"corpus_df = corpus_df.drop(columns=[\"title\", \"metadata\"])\n",
"corpus_df[\"corpus-id\"] = corpus_df[\"corpus-id\"].astype(int)\n",
"corpus_df.head()"
@@ -387,9 +389,7 @@
" qs = [pd.Series(json.loads(l)) for l in f.readlines()]\n",
"\n",
"queries_df = pd.DataFrame(qs)\n",
- "queries_df = queries_df.rename(columns={\n",
- " \"_id\": \"query-id\", \"text\": \"question\"\n",
- "})\n",
+ "queries_df = queries_df.rename(columns={\"_id\": \"query-id\", \"text\": \"question\"})\n",
"queries_df = queries_df.drop(columns=[\"metadata\"])\n",
"queries_df[\"query-id\"] = queries_df[\"query-id\"].astype(int)\n",
"queries_df.head()"
@@ -474,10 +474,10 @@
"splits = [\"dev\", \"test\", \"train\"]\n",
"split_df = {}\n",
"for s in splits:\n",
- " split_df[s] = pd.read_csv(\n",
- " os.path.join(data_path, f\"qrels/{s}.tsv\"), sep=\"\\t\"\n",
- " ).drop(columns=[\"score\"])\n",
- " \n",
+ " split_df[s] = pd.read_csv(os.path.join(data_path, f\"qrels/{s}.tsv\"), sep=\"\\t\").drop(\n",
+ " columns=[\"score\"]\n",
+ " )\n",
+ "\n",
"split_df[\"dev\"].head()"
]
},
@@ -515,10 +515,14 @@
" df = queries_df.merge(split_df[split], on=\"query-id\")\n",
" df = df.merge(corpus_df, on=\"corpus-id\")\n",
" df = df.drop(columns=[\"corpus-id\"])\n",
- " grouped = df.groupby('query-id').apply(lambda x: pd.Series({\n",
- " 'question': x['question'].sample().values[0],\n",
- " 'ground_truths': x['ground_truth'].tolist()\n",
- " }))\n",
+ " grouped = df.groupby(\"query-id\").apply(\n",
+ " lambda x: pd.Series(\n",
+ " {\n",
+ " \"question\": x[\"question\"].sample().values[0],\n",
+ " \"ground_truths\": x[\"ground_truth\"].tolist(),\n",
+ " }\n",
+ " )\n",
+ " )\n",
"\n",
" grouped = grouped.reset_index()\n",
" grouped = grouped.drop(columns=\"query-id\")\n",
@@ -797,11 +801,8 @@
"assert os.path.exists(path_to_ds_repo), f\"{path_to_ds_repo} doesnot exist!\"\n",
"\n",
"for s in final_split_df:\n",
- " final_split_df[s].to_csv(\n",
- " os.path.join(path_to_ds_repo, f\"{s}.csv\"),\n",
- " index=False\n",
- " )\n",
- " \n",
+ " final_split_df[s].to_csv(os.path.join(path_to_ds_repo, f\"{s}.csv\"), index=False)\n",
+ "\n",
"corpus_df.to_csv(os.path.join(path_to_ds_repo, \"corpus.csv\"), index=False)"
]
},
@@ -1009,18 +1010,11 @@
"from llama_index.node_parser import SimpleNodeParser\n",
"from langchain.text_splitter import TokenTextSplitter\n",
"\n",
- "spliter = TokenTextSplitter(\n",
- " chunk_size = 100,\n",
- " chunk_overlap = 50\n",
- ")\n",
+ "spliter = TokenTextSplitter(chunk_size=100, chunk_overlap=50)\n",
"\n",
- "parser = SimpleNodeParser(\n",
- " text_splitter=spliter\n",
- ")\n",
+ "parser = SimpleNodeParser(text_splitter=spliter)\n",
"\n",
- "nodes = parser.get_nodes_from_documents(\n",
- " documents=docs\n",
- ")"
+ "nodes = parser.get_nodes_from_documents(documents=docs)"
]
},
{
@@ -1088,16 +1082,12 @@
"source": [
"# create index\n",
"index = GPTVectorStoreIndex.from_documents(\n",
- " documents=docs, \n",
+ " documents=docs,\n",
" service_context=openai_sc,\n",
")\n",
"\n",
"# query with embed_model specified\n",
- "qe = index.as_query_engine(\n",
- " mode=\"embedding\", \n",
- " verbose=True, \n",
- " service_context=openai_sc\n",
- ")"
+ "qe = index.as_query_engine(mode=\"embedding\", verbose=True, service_context=openai_sc)"
]
},
{
@@ -1171,10 +1161,7 @@
"\n",
"# query with embed_model specified\n",
"qe = index.as_query_engine(\n",
- " mode=\"embedding\", \n",
- " verbose=True, \n",
- " service_context=openai_sc,\n",
- " use_async = False\n",
+ " mode=\"embedding\", verbose=True, service_context=openai_sc, use_async=False\n",
")"
]
},
@@ -1195,15 +1182,13 @@
"\n",
"# configure retriever\n",
"retriever = VectorIndexRetriever(\n",
- " index=index, \n",
+ " index=index,\n",
" similarity_top_k=3,\n",
")\n",
"\n",
"# configure response synthesizer\n",
"response_synthesizer = ResponseSynthesizer.from_args(\n",
- " node_postprocessors=[\n",
- " SimilarityPostprocessor(similarity_cutoff=0.7)\n",
- " ]\n",
+ " node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)]\n",
")\n",
"\n",
"# assemble query engine\n",
@@ -1257,9 +1242,10 @@
" r = qe.query(row[\"question\"])\n",
" row[\"answer\"] = r.response\n",
" row[\"contexts\"] = [sn.node.text for sn in r.source_nodes]\n",
- " \n",
+ "\n",
" return row\n",
"\n",
+ "\n",
"# generate_response(test_ds[0])"
]
},
@@ -1530,10 +1516,7 @@
"from ragas.metrics import factuality, answer_relevancy, context_relevancy\n",
"from ragas import evaluate\n",
"\n",
- "evaluate(\n",
- " gen_ds, \n",
- " metrics=[factuality, answer_relevancy, context_relevancy]\n",
- ")"
+ "evaluate(gen_ds, metrics=[factuality, answer_relevancy, context_relevancy])"
]
},
{
diff --git a/experiments/baselines/fiqa/improving-baselines.ipynb b/experiments/baselines/fiqa/improving-baselines.ipynb
index 5d5a8fc50..23002df8c 100644
--- a/experiments/baselines/fiqa/improving-baselines.ipynb
+++ b/experiments/baselines/fiqa/improving-baselines.ipynb
@@ -22,7 +22,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
- "Found cached dataset fiqa (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/main/1.0.0/953cfddc4a440cf2e290172be2563e5b51a953f2e4266940fc2b311e135cea69)\n"
+ "Found cached dataset fiqa (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/main/1.0.0/3dc7b639f5b4b16509a3299a2ceb78bf5fe98ee6b5fee25e7d5e4d290c88efb8)\n"
]
},
{
@@ -97,10 +97,7 @@
"\n",
"# query with embed_model specified\n",
"qe = index.as_query_engine(\n",
- " mode=\"embedding\", \n",
- " verbose=True, \n",
- " service_context=openai_sc,\n",
- " use_async = False\n",
+ " mode=\"embedding\", verbose=True, service_context=openai_sc, use_async=False\n",
")"
]
},
@@ -121,15 +118,13 @@
"\n",
"# configure retriever\n",
"retriever = VectorIndexRetriever(\n",
- " index=index, \n",
+ " index=index,\n",
" similarity_top_k=1,\n",
")\n",
"\n",
"# configure response synthesizer\n",
"response_synthesizer = ResponseSynthesizer.from_args(\n",
- " node_postprocessors=[\n",
- " SimilarityPostprocessor(similarity_cutoff=0.7)\n",
- " ]\n",
+ " node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)]\n",
")\n",
"\n",
"# assemble query engine\n",
@@ -150,9 +145,10 @@
" r = qe.query(row[\"question\"])\n",
" row[\"answer\"] = r.response\n",
" row[\"contexts\"] = [sn.node.text for sn in r.source_nodes]\n",
- " \n",
+ "\n",
" return row\n",
"\n",
+ "\n",
"# generate_response(test_ds[0])"
]
},
@@ -272,10 +268,7 @@
"from ragas.metrics import factuality, answer_relevancy, context_relevancy\n",
"from ragas import evaluate\n",
"\n",
- "evaluate(\n",
- " gen_ds, \n",
- " metrics=[factuality, answer_relevancy, context_relevancy]\n",
- ")"
+ "evaluate(gen_ds, metrics=[factuality, answer_relevancy, context_relevancy])"
]
},
{
@@ -304,10 +297,7 @@
"\n",
"# query with embed_model specified\n",
"qe = index.as_query_engine(\n",
- " mode=\"embedding\", \n",
- " verbose=True, \n",
- " service_context=openai_sc,\n",
- " use_async = False\n",
+ " mode=\"embedding\", verbose=True, service_context=openai_sc, use_async=False\n",
")"
]
},
@@ -328,15 +318,13 @@
"\n",
"# configure retriever\n",
"retriever = VectorIndexRetriever(\n",
- " index=index, \n",
+ " index=index,\n",
" similarity_top_k=1,\n",
")\n",
"\n",
"# configure response synthesizer\n",
"response_synthesizer = ResponseSynthesizer.from_args(\n",
- " node_postprocessors=[\n",
- " SimilarityPostprocessor(similarity_cutoff=0.7)\n",
- " ]\n",
+ " node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)]\n",
")\n",
"\n",
"# assemble query engine\n",
@@ -357,15 +345,16 @@
" r = qe.query(row[\"question\"])\n",
" row[\"answer\"] = r.response\n",
" row[\"contexts\"] = [sn.node.text for sn in r.source_nodes]\n",
- " \n",
+ "\n",
" return row\n",
"\n",
+ "\n",
"# generate_response(test_ds[0])"
]
},
{
"cell_type": "code",
- "execution_count": 12,
+ "execution_count": 7,
"id": "661ad12b",
"metadata": {},
"outputs": [
@@ -383,13 +372,6 @@
"metadata": {},
"output_type": "display_data"
},
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "Retrying langchain.llms.openai.completion_with_retry.._completion_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..\n"
- ]
- },
{
"data": {
"text/plain": [
@@ -399,7 +381,7 @@
"})"
]
},
- "execution_count": 12,
+ "execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
@@ -411,21 +393,29 @@
},
{
"cell_type": "code",
- "execution_count": 13,
+ "execution_count": 8,
"id": "96e08092",
"metadata": {},
"outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "46e26286ecbc4a0891f8ee228898ca20",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Downloading model.safetensors: 0%| | 0.00/892M [00:00, ?B/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
{
"name": "stderr",
"output_type": "stream",
"text": [
- "/home/jjmachan/miniconda3/envs/bench/lib/python3.10/site-packages/transformers/models/t5/tokenization_t5_fast.py:155: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.\n",
- "For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.\n",
- "- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.\n",
- "- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.\n",
- "- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.\n",
- " warnings.warn(\n",
- "100%|█████████████████████████████████████████████████████████████| 2/2 [00:57<00:00, 28.53s/it]\n"
+ "100%|█████████████████████████████████████████████████████████████| 2/2 [00:56<00:00, 28.39s/it]\n"
]
},
{
@@ -446,7 +436,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
- "100%|█████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.47s/it]\n"
+ "100%|█████████████████████████████████████████████████████████████| 1/1 [00:08<00:00, 8.04s/it]\n"
]
},
{
@@ -467,16 +457,16 @@
"name": "stderr",
"output_type": "stream",
"text": [
- "100%|█████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.58s/it]\n"
+ "100%|█████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.73s/it]\n"
]
},
{
"data": {
"text/plain": [
- "{'NLI_score': 0.798888888888889, 'answer_relevancy': 0.8641, 'context_relavency': 0.8236333333333333, 'ragas_score': 0.8280100357048794}"
+ "{'ragas_score': 0.8386, 'factuality': 0.8289, 'answer_relevancy': 0.8646, 'context_relavency': 0.8236}"
]
},
- "execution_count": 13,
+ "execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
@@ -486,10 +476,47 @@
"from ragas.metrics import factuality, answer_relevancy, context_relevancy\n",
"from ragas import evaluate\n",
"\n",
- "evaluate(\n",
- " gen_ds, \n",
- " metrics=[factuality, answer_relevancy, context_relevancy]\n",
- ")"
+ "evaluate(gen_ds, metrics=[factuality, answer_relevancy, context_relevancy])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "87054feb",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.jupyter.widget-view+json": {
+ "model_id": "da6babe02adf49369a6d708487eeb068",
+ "version_major": 2,
+ "version_minor": 0
+ },
+ "text/plain": [
+ "Creating CSV from Arrow format: 0%| | 0/1 [00:00, ?ba/s]"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ "82699"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# save to fiqa hub\n",
+ "import os\n",
+ "\n",
+ "path_to_dataset = \"../../../../datasets/fiqa/\"\n",
+ "\n",
+ "gen_ds.to_csv(os.path.join(path_to_dataset, \"baseline_chunk100_k1.csv\"))"
]
},
{
@@ -502,7 +529,7 @@
},
{
"cell_type": "code",
- "execution_count": 21,
+ "execution_count": 16,
"id": "15f4c130",
"metadata": {},
"outputs": [],
@@ -510,7 +537,7 @@
"from llama_index.indices.postprocessor.cohere_rerank import CohereRerank\n",
"import os\n",
"\n",
- "top_k = 4 \n",
+ "top_k = 4\n",
"cohere_rerank = CohereRerank(api_key=os.environ[\"COHERE_API_KEY\"], top_n=top_k)\n",
"reranking_qe = index.as_query_engine(\n",
" similarity_top_k=top_k,\n",
@@ -520,7 +547,7 @@
},
{
"cell_type": "code",
- "execution_count": 22,
+ "execution_count": 17,
"id": "6a73b189",
"metadata": {},
"outputs": [],
@@ -529,18 +556,26 @@
" r = reranking_qe.query(row[\"question\"])\n",
" row[\"answer\"] = r.response\n",
" row[\"contexts\"] = [sn.node.text for sn in r.source_nodes]\n",
- " \n",
+ "\n",
" return row\n",
"\n",
+ "\n",
"# generate_response(fiqa_test[0])"
]
},
{
"cell_type": "code",
- "execution_count": 23,
+ "execution_count": 18,
"id": "32bd4281",
"metadata": {},
"outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "Parameter 'function'= of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.\n"
+ ]
+ },
{
"data": {
"application/vnd.jupyter.widget-view+json": {
@@ -564,7 +599,7 @@
"})"
]
},
- "execution_count": 23,
+ "execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
@@ -645,16 +680,36 @@
"from ragas.metrics import factuality, answer_relevancy, context_relevancy\n",
"from ragas import evaluate\n",
"\n",
- "evaluate(\n",
- " gen_ds, \n",
- " metrics=[factuality, answer_relevancy, context_relevancy]\n",
- ")"
+ "evaluate(gen_ds, metrics=[factuality, answer_relevancy, context_relevancy])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "4301895f",
+ "metadata": {},
+ "outputs": [
+ {
+ "ename": "NameError",
+ "evalue": "name 'gen_ds' is not defined",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
+ "Cell \u001b[0;32mIn[3], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m evals[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcohere_reranked\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m \u001b[43mgen_ds\u001b[49m\n\u001b[1;32m 2\u001b[0m evals\n",
+ "\u001b[0;31mNameError\u001b[0m: name 'gen_ds' is not defined"
+ ]
+ }
+ ],
+ "source": [
+ "evals[\"cohere_reranked\"] = gen_ds\n",
+ "evals"
]
},
{
"cell_type": "code",
"execution_count": null,
- "id": "a0991e58",
+ "id": "02cb461c",
"metadata": {},
"outputs": [],
"source": []
@@ -676,7 +731,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.10.11"
+ "version": "3.10.12"
}
},
"nbformat": 4,
diff --git a/requirements/dev.txt b/requirements/dev.txt
index 9b805b103..3f3a1f505 100644
--- a/requirements/dev.txt
+++ b/requirements/dev.txt
@@ -4,3 +4,4 @@ isort
black[jupyter]
pyright
langchain
+notebook
diff --git a/src/ragas/evaluation.py b/src/ragas/evaluation.py
index a612ab3b2..51f7c30fe 100644
--- a/src/ragas/evaluation.py
+++ b/src/ragas/evaluation.py
@@ -107,20 +107,6 @@ def __post_init__(self):
if len(values) == 3:
self["ragas_score"] = len(values) / np.sum(1.0 / np.array(values))
- def describe(self):
- description = {}
- for cn in self.scores.column_names:
- description[cn] = {
- "mean": np.mean(self.scores[cn]),
- "25%": np.percentile(self.scores[cn], 25),
- "50%": np.percentile(self.scores[cn], 50),
- "75%": np.percentile(self.scores[cn], 75),
- "min": np.min(self.scores[cn]),
- "max": np.max(self.scores[cn]),
- "std": np.std(self.scores[cn]),
- }
- return description
-
def to_pandas(self, batch_size: int | None = None, batched: bool = False):
if self.dataset is None:
raise ValueError("dataset is not provided for the results class")
@@ -132,6 +118,6 @@ def to_pandas(self, batch_size: int | None = None, batched: bool = False):
def __repr__(self) -> str:
scores = self.copy()
ragas_score = scores.pop("ragas_score")
- score_strs = [f"'ragas_score': {ragas_score:0.3f}"]
- score_strs.extend([f"'{k}': {v:0.3f}" for k, v in scores.items()])
+ score_strs = [f"'ragas_score': {ragas_score:0.4f}"]
+ score_strs.extend([f"'{k}': {v:0.4f}" for k, v in scores.items()])
return "{" + ", ".join(score_strs) + "}"