diff --git a/.gitignore b/.gitignore index 619a3353c..cf158d3c5 100644 --- a/.gitignore +++ b/.gitignore @@ -166,3 +166,4 @@ experiments/**/storage **/fil-result/ experiments/baselines/fiqa/datasets src/ragas/_version.py +.python-version diff --git a/README.md b/README.md index 2d8fc4c69..64716bc63 100644 --- a/README.md +++ b/README.md @@ -55,18 +55,23 @@ This is a small example program you can run to see ragas in action! ```python from ragas import evaluate +from datasets import Dataset import os os.environ["OPENAI_API_KEY"] = "your-openai-key" -ds = Dataset({ - features: ['question','context','answer'], - num_rows: 25 -}) -results = evaluate(ds) +# prepare your huggingface dataset in the format +# Dataset({ +# features: ['question','contexts','answer'], +# num_rows: 25 +# }) + +dataset: Dataset + +results = evaluate(dataset) ``` -If you want a more in-depth explanation of core components, check out our quick-start notebook +If you want a more in-depth explanation of core components, check out our [quick-start notebook](./examples/quickstart.ipynb) ## :luggage: Metrics Ragas measures your pipeline's performance against two dimensions diff --git a/examples/quickstart.ipynb b/examples/quickstart.ipynb index 2f7ccdc38..3793695eb 100644 --- a/examples/quickstart.ipynb +++ b/examples/quickstart.ipynb @@ -2,62 +2,101 @@ "cells": [ { "cell_type": "markdown", - "id": "aeb5819b", + "id": "2e63f667", "metadata": {}, "source": [ - "# Quickstart" + "# Quickstart\n", + "\n", + "welcome to the ragas quickstart. We're going to get you up and running with ragas as qickly as you can so that you can go back to improving your Retrieval Augmented Generation pipelines while this library makes sure your changes are improving your entire pipeline.\n", + "\n", + "to kick things of lets start with the data" ] }, { "cell_type": "code", "execution_count": 1, - "id": "22c7dd25", + "id": "57585b55", "metadata": {}, "outputs": [], "source": [ - "# only run this if your have an editable install\n", "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", - "id": "5af47053", + "id": "c77789bb", + "metadata": {}, + "source": [ + "Ragas also uses OpenAI for running a metric so make sure you have your openai key ready and available in your environment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0b7179f7", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "os.environ[\"OPENAI_API_KEY\"] = \"your-openai-key\"" + ] + }, + { + "cell_type": "markdown", + "id": "06c9fc7d", "metadata": {}, "source": [ - "### load your data\n", + "## The Data\n", + "\n", + "Ragas performs a `ground_truth` free evaluation of your RAG pipelines. This is because for most people building a gold labeled dataset which represents in the distribution they get in production is a very expensive process.\n", "\n", - "For this quickstart we are going to be using a dataset that we prepared from [eli5](https://huggingface.co/datasets/eli5) dataset with the models response. The dataset is available in [huggingface](https://huggingface.co/datasets/explodinggradients/eli5-test).\n", + "Hence to work with ragas all you need are the following data\n", + "- question: `list[str]` - These are the questions you RAG pipeline will be evaluated on. \n", + "- answer: `list[str]` - The answer generated from the RAG pipeline and give to the user.\n", + "- contexts: `list[list[str]]` - The contexts which where passed into the LLM to answer the question.\n", "\n", - "The dataset is of the following format\n", - "| column name | type | description |\n", - "|----------------|-----------|-----------------------------------------------------------------------------------|\n", - "| prompt | str | the prompt/question to answer |\n", - "| context | str | context string that has any relevent priors the LLM needs to answer the questions |\n", - "| references | list[str] | reference documents the LLM can use to respond to the prompt |\n", - "| ground_truth | list[str] | accepted answers given by human annotators |\n", - "| generated_text | str | the generated output from the LLM |" + "Ideally your list of questions should reflect the questions your users give, including those that you have been problamatic in the past.\n", + "\n", + "Here we're using an example dataset from on of the baselines we created for the [Financial Opinion Mining and Question Answering (fiqa) Dataset](https://sites.google.com/view/fiqa/) we created. If you want to want to know more about the baseline, feel free to check the `experiements/baseline` section" ] }, { "cell_type": "code", "execution_count": 2, - "id": "2bc9fb9d", + "id": "b658e02f", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "Found cached dataset parquet (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___parquet/explodinggradients--eli5-test-217d92ce20e19249/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)\n" + "Found cached dataset fiqa (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/ragas_eval/1.0.0/3dc7b639f5b4b16509a3299a2ceb78bf5fe98ee6b5fee25e7d5e4d290c88efb8)\n" ] }, { "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "c4a622ce9f774cf7b79b46d9fcf05f69", + "version_major": 2, + "version_minor": 0 + }, "text/plain": [ - "Dataset({\n", - " features: ['context', 'prompt', 'ground_truth', 'references', 'generated_text'],\n", - " num_rows: 500\n", + " 0%| | 0/1 [00:00{'BERTScore_cosine': 0.37552570906095206, 'edit_ratio': 0.41482407945510713, 'BLEU': 0.010848577619569451}\n", - "\n" - ], "text/plain": [ - "\u001b[1m{\u001b[0m\u001b[32m'BERTScore_cosine'\u001b[0m: \u001b[1;36m0.37552570906095206\u001b[0m, \u001b[32m'edit_ratio'\u001b[0m: \u001b[1;36m0.41482407945510713\u001b[0m, \u001b[32m'BLEU'\u001b[0m: \u001b[1;36m0.010848577619569451\u001b[0m\u001b[1m}\u001b[0m\n" + "{'ragas_score': 0.860, 'context_relavency': 0.817, 'factuality': 0.892, 'answer_relevancy': 0.874}" ] }, + "execution_count": 8, "metadata": {}, - "output_type": "display_data" + "output_type": "execute_result" } ], "source": [ - "from rich.pretty import pprint\n", + "from ragas import evaluate\n", + "\n", + "result = evaluate(\n", + " fiqa_eval[\"baseline\"], metrics=[context_relevancy, factuality, answer_relevancy]\n", + ")\n", "\n", - "pprint(result)" + "result" ] }, { "cell_type": "markdown", - "id": "eb07bbec", + "id": "a2dc0ec2", "metadata": {}, "source": [ - "you can access individual metric results via `result['']`. it also has a `.describe()` function to show the distribution of the results and you can access the individual score from `.scores` attribute." + "and there you have the it, all the scores you need. `ragas_score` gives you a single metric that you can use while the other onces measure the different parts of your pipeline.\n", + "\n", + "now if we want to dig into the results and figure out examples where your pipeline performed worse or really good you can easily convert it into a pandas array and use your standard analytics tools too!" ] }, { "cell_type": "code", - "execution_count": 16, - "id": "4c8c51b1", + "execution_count": 12, + "id": "8686bf53", "metadata": {}, "outputs": [ { @@ -243,104 +243,127 @@ " \n", " \n", " \n", - " BERTScore_cosine\n", - " edit_ratio\n", - " BLEU\n", + " question\n", + " ground_truths\n", + " answer\n", + " contexts\n", + " context_relavency\n", + " factuality\n", + " answer_relevancy\n", " \n", " \n", " \n", " \n", - " mean\n", - " 0.375526\n", - " 0.414824\n", - " 1.084858e-02\n", - " \n", - " \n", - " 25%\n", - " 0.212339\n", - " 0.399876\n", - " 3.489775e-155\n", + " 0\n", + " How to deposit a cheque issued to an associate...\n", + " [Have the check reissued to the proper payee.J...\n", + " \\nThe best way to deposit a cheque issued to a...\n", + " [Just have the associate sign the back and the...\n", + " 0.867\n", + " 1.0\n", + " 0.922\n", " \n", " \n", - " 50%\n", - " 0.332697\n", - " 0.429187\n", - " 4.318061e-79\n", + " 1\n", + " Can I send a money order from USPS as a business?\n", + " [Sure you can. You can fill in whatever you w...\n", + " \\nYes, you can send a money order from USPS as...\n", + " [Sure you can. You can fill in whatever you w...\n", + " 0.855\n", + " 1.0\n", + " 0.923\n", " \n", " \n", - " 75%\n", - " 0.532642\n", - " 0.449509\n", - " 1.525948e-05\n", + " 2\n", + " 1 EIN doing business under multiple business n...\n", + " [You're confusing a lot of things here. Compan...\n", + " \\nYes, it is possible to have one EIN doing bu...\n", + " [You're confusing a lot of things here. Compan...\n", + " 0.768\n", + " 1.0\n", + " 0.824\n", " \n", " \n", - " min\n", - " 0.007017\n", - " 0.102182\n", - " 4.029193e-232\n", + " 3\n", + " Applying for and receiving business credit\n", + " [\"I'm afraid the great myth of limited liabili...\n", + " \\nApplying for and receiving business credit c...\n", + " [Set up a meeting with the bank that handles y...\n", + " 0.781\n", + " 1.0\n", + " 0.830\n", " \n", " \n", - " max\n", - " 0.910680\n", - " 0.572917\n", - " 1.506915e-01\n", - " \n", - " \n", - " std\n", - " 0.207559\n", - " 0.058072\n", - " 2.343307e-02\n", + " 4\n", + " 401k Transfer After Business Closure\n", + " [You should probably consult an attorney. Howe...\n", + " \\nIf your employer has closed and you need to ...\n", + " [The time horizon for your 401K/IRA is essenti...\n", + " 0.737\n", + " 1.0\n", + " 0.753\n", " \n", " \n", "\n", "" ], "text/plain": [ - " BERTScore_cosine edit_ratio BLEU\n", - "mean 0.375526 0.414824 1.084858e-02\n", - "25% 0.212339 0.399876 3.489775e-155\n", - "50% 0.332697 0.429187 4.318061e-79\n", - "75% 0.532642 0.449509 1.525948e-05\n", - "min 0.007017 0.102182 4.029193e-232\n", - "max 0.910680 0.572917 1.506915e-01\n", - "std 0.207559 0.058072 2.343307e-02" + " question \\\n", + "0 How to deposit a cheque issued to an associate... \n", + "1 Can I send a money order from USPS as a business? \n", + "2 1 EIN doing business under multiple business n... \n", + "3 Applying for and receiving business credit \n", + "4 401k Transfer After Business Closure \n", + "\n", + " ground_truths \\\n", + "0 [Have the check reissued to the proper payee.J... \n", + "1 [Sure you can. You can fill in whatever you w... \n", + "2 [You're confusing a lot of things here. Compan... \n", + "3 [\"I'm afraid the great myth of limited liabili... \n", + "4 [You should probably consult an attorney. Howe... \n", + "\n", + " answer \\\n", + "0 \\nThe best way to deposit a cheque issued to a... \n", + "1 \\nYes, you can send a money order from USPS as... \n", + "2 \\nYes, it is possible to have one EIN doing bu... \n", + "3 \\nApplying for and receiving business credit c... \n", + "4 \\nIf your employer has closed and you need to ... \n", + "\n", + " contexts context_relavency \\\n", + "0 [Just have the associate sign the back and the... 0.867 \n", + "1 [Sure you can. You can fill in whatever you w... 0.855 \n", + "2 [You're confusing a lot of things here. Compan... 0.768 \n", + "3 [Set up a meeting with the bank that handles y... 0.781 \n", + "4 [The time horizon for your 401K/IRA is essenti... 0.737 \n", + "\n", + " factuality answer_relevancy \n", + "0 1.0 0.922 \n", + "1 1.0 0.923 \n", + "2 1.0 0.824 \n", + "3 1.0 0.830 \n", + "4 1.0 0.753 " ] }, - "execution_count": 16, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "from pandas import DataFrame\n", - "\n", - "# view with pandas\n", - "df = DataFrame(result.describe())\n", - "df" + "df = result.to_pandas()\n", + "df.head()" ] }, { - "cell_type": "code", - "execution_count": 29, - "id": "421c60ab", + "cell_type": "markdown", + "id": "f668fce1", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Dataset({\n", - " features: ['BERTScore_cosine', 'edit_ratio', 'BLEU'],\n", - " num_rows: 500\n", - "})" - ] - }, - "execution_count": 29, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ - "result.scores" + "And thats it!\n", + "\n", + "You can check out the [ragas in action] notebook to get a feel of what is like to use it while trying to improve your pipelines.\n", + "\n", + "if you have any suggestion/feedbacks/things your not happy about, please do share it in the [issue section](https://github.com/explodinggradients/ragas/issues). We love hearing from you 😁" ] } ], @@ -360,7 +383,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.11" + "version": "3.10.12" } }, "nbformat": 4, diff --git a/experiments/assesments/metrics_assesments.ipynb b/experiments/assesments/metrics_assesments.ipynb index cb8f06208..257da13fe 100644 --- a/experiments/assesments/metrics_assesments.ipynb +++ b/experiments/assesments/metrics_assesments.ipynb @@ -64,7 +64,7 @@ "metadata": {}, "outputs": [], "source": [ - "os.chdir('/Users/shahules/belar/src/')" + "os.chdir(\"/Users/shahules/belar/src/\")" ] }, { diff --git a/experiments/baselines/fiqa/dataset-exploration-and-baseline.ipynb b/experiments/baselines/fiqa/dataset-exploration-and-baseline.ipynb index be1c69a9c..97ca5e67a 100644 --- a/experiments/baselines/fiqa/dataset-exploration-and-baseline.ipynb +++ b/experiments/baselines/fiqa/dataset-exploration-and-baseline.ipynb @@ -48,7 +48,11 @@ "from beir.datasets.data_loader import GenericDataLoader\n", "\n", "dataset = \"fiqa\"\n", - "url = \"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip\".format(dataset)\n", + "url = (\n", + " \"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip\".format(\n", + " dataset\n", + " )\n", + ")\n", "data_path = util.download_and_unzip(url, \"datasets\")" ] }, @@ -218,7 +222,7 @@ "source": [ "with open(os.path.join(data_path, \"corpus.jsonl\")) as f:\n", " cs = [pd.Series(json.loads(l)) for l in f.readlines()]\n", - " \n", + "\n", "corpus_df = pd.DataFrame(cs)\n", "corpus_df" ] @@ -299,9 +303,7 @@ } ], "source": [ - "corpus_df = corpus_df.rename(columns={\n", - " \"_id\": \"corpus-id\", \"text\": \"ground_truth\"\n", - "})\n", + "corpus_df = corpus_df.rename(columns={\"_id\": \"corpus-id\", \"text\": \"ground_truth\"})\n", "corpus_df = corpus_df.drop(columns=[\"title\", \"metadata\"])\n", "corpus_df[\"corpus-id\"] = corpus_df[\"corpus-id\"].astype(int)\n", "corpus_df.head()" @@ -387,9 +389,7 @@ " qs = [pd.Series(json.loads(l)) for l in f.readlines()]\n", "\n", "queries_df = pd.DataFrame(qs)\n", - "queries_df = queries_df.rename(columns={\n", - " \"_id\": \"query-id\", \"text\": \"question\"\n", - "})\n", + "queries_df = queries_df.rename(columns={\"_id\": \"query-id\", \"text\": \"question\"})\n", "queries_df = queries_df.drop(columns=[\"metadata\"])\n", "queries_df[\"query-id\"] = queries_df[\"query-id\"].astype(int)\n", "queries_df.head()" @@ -474,10 +474,10 @@ "splits = [\"dev\", \"test\", \"train\"]\n", "split_df = {}\n", "for s in splits:\n", - " split_df[s] = pd.read_csv(\n", - " os.path.join(data_path, f\"qrels/{s}.tsv\"), sep=\"\\t\"\n", - " ).drop(columns=[\"score\"])\n", - " \n", + " split_df[s] = pd.read_csv(os.path.join(data_path, f\"qrels/{s}.tsv\"), sep=\"\\t\").drop(\n", + " columns=[\"score\"]\n", + " )\n", + "\n", "split_df[\"dev\"].head()" ] }, @@ -515,10 +515,14 @@ " df = queries_df.merge(split_df[split], on=\"query-id\")\n", " df = df.merge(corpus_df, on=\"corpus-id\")\n", " df = df.drop(columns=[\"corpus-id\"])\n", - " grouped = df.groupby('query-id').apply(lambda x: pd.Series({\n", - " 'question': x['question'].sample().values[0],\n", - " 'ground_truths': x['ground_truth'].tolist()\n", - " }))\n", + " grouped = df.groupby(\"query-id\").apply(\n", + " lambda x: pd.Series(\n", + " {\n", + " \"question\": x[\"question\"].sample().values[0],\n", + " \"ground_truths\": x[\"ground_truth\"].tolist(),\n", + " }\n", + " )\n", + " )\n", "\n", " grouped = grouped.reset_index()\n", " grouped = grouped.drop(columns=\"query-id\")\n", @@ -797,11 +801,8 @@ "assert os.path.exists(path_to_ds_repo), f\"{path_to_ds_repo} doesnot exist!\"\n", "\n", "for s in final_split_df:\n", - " final_split_df[s].to_csv(\n", - " os.path.join(path_to_ds_repo, f\"{s}.csv\"),\n", - " index=False\n", - " )\n", - " \n", + " final_split_df[s].to_csv(os.path.join(path_to_ds_repo, f\"{s}.csv\"), index=False)\n", + "\n", "corpus_df.to_csv(os.path.join(path_to_ds_repo, \"corpus.csv\"), index=False)" ] }, @@ -1009,18 +1010,11 @@ "from llama_index.node_parser import SimpleNodeParser\n", "from langchain.text_splitter import TokenTextSplitter\n", "\n", - "spliter = TokenTextSplitter(\n", - " chunk_size = 100,\n", - " chunk_overlap = 50\n", - ")\n", + "spliter = TokenTextSplitter(chunk_size=100, chunk_overlap=50)\n", "\n", - "parser = SimpleNodeParser(\n", - " text_splitter=spliter\n", - ")\n", + "parser = SimpleNodeParser(text_splitter=spliter)\n", "\n", - "nodes = parser.get_nodes_from_documents(\n", - " documents=docs\n", - ")" + "nodes = parser.get_nodes_from_documents(documents=docs)" ] }, { @@ -1088,16 +1082,12 @@ "source": [ "# create index\n", "index = GPTVectorStoreIndex.from_documents(\n", - " documents=docs, \n", + " documents=docs,\n", " service_context=openai_sc,\n", ")\n", "\n", "# query with embed_model specified\n", - "qe = index.as_query_engine(\n", - " mode=\"embedding\", \n", - " verbose=True, \n", - " service_context=openai_sc\n", - ")" + "qe = index.as_query_engine(mode=\"embedding\", verbose=True, service_context=openai_sc)" ] }, { @@ -1171,10 +1161,7 @@ "\n", "# query with embed_model specified\n", "qe = index.as_query_engine(\n", - " mode=\"embedding\", \n", - " verbose=True, \n", - " service_context=openai_sc,\n", - " use_async = False\n", + " mode=\"embedding\", verbose=True, service_context=openai_sc, use_async=False\n", ")" ] }, @@ -1195,15 +1182,13 @@ "\n", "# configure retriever\n", "retriever = VectorIndexRetriever(\n", - " index=index, \n", + " index=index,\n", " similarity_top_k=3,\n", ")\n", "\n", "# configure response synthesizer\n", "response_synthesizer = ResponseSynthesizer.from_args(\n", - " node_postprocessors=[\n", - " SimilarityPostprocessor(similarity_cutoff=0.7)\n", - " ]\n", + " node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)]\n", ")\n", "\n", "# assemble query engine\n", @@ -1257,9 +1242,10 @@ " r = qe.query(row[\"question\"])\n", " row[\"answer\"] = r.response\n", " row[\"contexts\"] = [sn.node.text for sn in r.source_nodes]\n", - " \n", + "\n", " return row\n", "\n", + "\n", "# generate_response(test_ds[0])" ] }, @@ -1530,10 +1516,7 @@ "from ragas.metrics import factuality, answer_relevancy, context_relevancy\n", "from ragas import evaluate\n", "\n", - "evaluate(\n", - " gen_ds, \n", - " metrics=[factuality, answer_relevancy, context_relevancy]\n", - ")" + "evaluate(gen_ds, metrics=[factuality, answer_relevancy, context_relevancy])" ] }, { diff --git a/experiments/baselines/fiqa/improving-baselines.ipynb b/experiments/baselines/fiqa/improving-baselines.ipynb index 5d5a8fc50..23002df8c 100644 --- a/experiments/baselines/fiqa/improving-baselines.ipynb +++ b/experiments/baselines/fiqa/improving-baselines.ipynb @@ -22,7 +22,7 @@ "name": "stderr", "output_type": "stream", "text": [ - "Found cached dataset fiqa (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/main/1.0.0/953cfddc4a440cf2e290172be2563e5b51a953f2e4266940fc2b311e135cea69)\n" + "Found cached dataset fiqa (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/main/1.0.0/3dc7b639f5b4b16509a3299a2ceb78bf5fe98ee6b5fee25e7d5e4d290c88efb8)\n" ] }, { @@ -97,10 +97,7 @@ "\n", "# query with embed_model specified\n", "qe = index.as_query_engine(\n", - " mode=\"embedding\", \n", - " verbose=True, \n", - " service_context=openai_sc,\n", - " use_async = False\n", + " mode=\"embedding\", verbose=True, service_context=openai_sc, use_async=False\n", ")" ] }, @@ -121,15 +118,13 @@ "\n", "# configure retriever\n", "retriever = VectorIndexRetriever(\n", - " index=index, \n", + " index=index,\n", " similarity_top_k=1,\n", ")\n", "\n", "# configure response synthesizer\n", "response_synthesizer = ResponseSynthesizer.from_args(\n", - " node_postprocessors=[\n", - " SimilarityPostprocessor(similarity_cutoff=0.7)\n", - " ]\n", + " node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)]\n", ")\n", "\n", "# assemble query engine\n", @@ -150,9 +145,10 @@ " r = qe.query(row[\"question\"])\n", " row[\"answer\"] = r.response\n", " row[\"contexts\"] = [sn.node.text for sn in r.source_nodes]\n", - " \n", + "\n", " return row\n", "\n", + "\n", "# generate_response(test_ds[0])" ] }, @@ -272,10 +268,7 @@ "from ragas.metrics import factuality, answer_relevancy, context_relevancy\n", "from ragas import evaluate\n", "\n", - "evaluate(\n", - " gen_ds, \n", - " metrics=[factuality, answer_relevancy, context_relevancy]\n", - ")" + "evaluate(gen_ds, metrics=[factuality, answer_relevancy, context_relevancy])" ] }, { @@ -304,10 +297,7 @@ "\n", "# query with embed_model specified\n", "qe = index.as_query_engine(\n", - " mode=\"embedding\", \n", - " verbose=True, \n", - " service_context=openai_sc,\n", - " use_async = False\n", + " mode=\"embedding\", verbose=True, service_context=openai_sc, use_async=False\n", ")" ] }, @@ -328,15 +318,13 @@ "\n", "# configure retriever\n", "retriever = VectorIndexRetriever(\n", - " index=index, \n", + " index=index,\n", " similarity_top_k=1,\n", ")\n", "\n", "# configure response synthesizer\n", "response_synthesizer = ResponseSynthesizer.from_args(\n", - " node_postprocessors=[\n", - " SimilarityPostprocessor(similarity_cutoff=0.7)\n", - " ]\n", + " node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)]\n", ")\n", "\n", "# assemble query engine\n", @@ -357,15 +345,16 @@ " r = qe.query(row[\"question\"])\n", " row[\"answer\"] = r.response\n", " row[\"contexts\"] = [sn.node.text for sn in r.source_nodes]\n", - " \n", + "\n", " return row\n", "\n", + "\n", "# generate_response(test_ds[0])" ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 7, "id": "661ad12b", "metadata": {}, "outputs": [ @@ -383,13 +372,6 @@ "metadata": {}, "output_type": "display_data" }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Retrying langchain.llms.openai.completion_with_retry.._completion_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..\n" - ] - }, { "data": { "text/plain": [ @@ -399,7 +381,7 @@ "})" ] }, - "execution_count": 12, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } @@ -411,21 +393,29 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 8, "id": "96e08092", "metadata": {}, "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "46e26286ecbc4a0891f8ee228898ca20", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Downloading model.safetensors: 0%| | 0.00/892M [00:00 of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.\n" + ] + }, { "data": { "application/vnd.jupyter.widget-view+json": { @@ -564,7 +599,7 @@ "})" ] }, - "execution_count": 23, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" } @@ -645,16 +680,36 @@ "from ragas.metrics import factuality, answer_relevancy, context_relevancy\n", "from ragas import evaluate\n", "\n", - "evaluate(\n", - " gen_ds, \n", - " metrics=[factuality, answer_relevancy, context_relevancy]\n", - ")" + "evaluate(gen_ds, metrics=[factuality, answer_relevancy, context_relevancy])" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "4301895f", + "metadata": {}, + "outputs": [ + { + "ename": "NameError", + "evalue": "name 'gen_ds' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[3], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m evals[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcohere_reranked\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m \u001b[43mgen_ds\u001b[49m\n\u001b[1;32m 2\u001b[0m evals\n", + "\u001b[0;31mNameError\u001b[0m: name 'gen_ds' is not defined" + ] + } + ], + "source": [ + "evals[\"cohere_reranked\"] = gen_ds\n", + "evals" ] }, { "cell_type": "code", "execution_count": null, - "id": "a0991e58", + "id": "02cb461c", "metadata": {}, "outputs": [], "source": [] @@ -676,7 +731,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.11" + "version": "3.10.12" } }, "nbformat": 4, diff --git a/requirements/dev.txt b/requirements/dev.txt index 9b805b103..3f3a1f505 100644 --- a/requirements/dev.txt +++ b/requirements/dev.txt @@ -4,3 +4,4 @@ isort black[jupyter] pyright langchain +notebook diff --git a/src/ragas/evaluation.py b/src/ragas/evaluation.py index a612ab3b2..51f7c30fe 100644 --- a/src/ragas/evaluation.py +++ b/src/ragas/evaluation.py @@ -107,20 +107,6 @@ def __post_init__(self): if len(values) == 3: self["ragas_score"] = len(values) / np.sum(1.0 / np.array(values)) - def describe(self): - description = {} - for cn in self.scores.column_names: - description[cn] = { - "mean": np.mean(self.scores[cn]), - "25%": np.percentile(self.scores[cn], 25), - "50%": np.percentile(self.scores[cn], 50), - "75%": np.percentile(self.scores[cn], 75), - "min": np.min(self.scores[cn]), - "max": np.max(self.scores[cn]), - "std": np.std(self.scores[cn]), - } - return description - def to_pandas(self, batch_size: int | None = None, batched: bool = False): if self.dataset is None: raise ValueError("dataset is not provided for the results class") @@ -132,6 +118,6 @@ def to_pandas(self, batch_size: int | None = None, batched: bool = False): def __repr__(self) -> str: scores = self.copy() ragas_score = scores.pop("ragas_score") - score_strs = [f"'ragas_score': {ragas_score:0.3f}"] - score_strs.extend([f"'{k}': {v:0.3f}" for k, v in scores.items()]) + score_strs = [f"'ragas_score': {ragas_score:0.4f}"] + score_strs.extend([f"'{k}': {v:0.4f}" for k, v in scores.items()]) return "{" + ", ".join(score_strs) + "}"