diff --git a/RAGoon : Improve Large Language Models retrieval using dynamic web-search.ipynb b/RAGoon : Improve Large Language Models retrieval using dynamic web-search.ipynb
new file mode 100644
index 0000000..1e60f08
--- /dev/null
+++ b/RAGoon : Improve Large Language Models retrieval using dynamic web-search.ipynb
@@ -0,0 +1,256 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "provenance": [],
+ "authorship_tag": "ABX9TyMpTSF4s+d5/bhZIL8AVxpK",
+ "include_colab_link": true
+ },
+ "kernelspec": {
+ "name": "python3",
+ "display_name": "Python 3"
+ },
+ "language_info": {
+ "name": "python"
+ }
+ },
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "view-in-github",
+ "colab_type": "text"
+ },
+ "source": [
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# RAGoon : Improve Large Language Models retrieval using dynamic web-search ⚡\n",
+ "[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)\n",
+ "\n",
+ "![Plot](https://github.com/louisbrulenaudet/ragoon/blob/main/thumbnail.png?raw=true)\n",
+ "\n",
+ "RAGoon is a Python library that aims to improve the performance of language models by providing contextually relevant information through retrieval-based querying, web scraping, and data augmentation techniques. It offers an integration of various APIs, enabling users to retrieve information from the web, enrich it with domain-specific knowledge, and feed it to language models for more informed responses.\n",
+ "\n",
+ "RAGoon's core functionality revolves around the concept of few-shot learning, where language models are provided with a small set of high-quality examples to enhance their understanding and generate more accurate outputs. By curating and retrieving relevant data from the web, RAGoon equips language models with the necessary context and knowledge to tackle complex queries and generate insightful responses.\n",
+ "\n",
+ "## Usage Example\n",
+ "Here's an example of how to use RAGoon:\n",
+ "\n",
+ "```python\n",
+ "from groq import Groq\n",
+ "# from openai import OpenAI\n",
+ "from ragoon import RAGoon\n",
+ "\n",
+ "# Initialize RAGoon instance\n",
+ "ragoon = RAGoon(\n",
+ " google_api_key=\"your_google_api_key\",\n",
+ " google_cx=\"your_google_cx\",\n",
+ " completion_client=Groq(api_key=\"your_groq_api_key\")\n",
+ ")\n",
+ "\n",
+ "# Search and get results\n",
+ "query = \"I want to do a left join in python polars\"\n",
+ "results = ragoon.search(\n",
+ " query=query,\n",
+ " completion_model=\"Llama3-70b-8192\",\n",
+ " max_tokens=512,\n",
+ " temperature=1,\n",
+ ")\n",
+ "\n",
+ "# Print results\n",
+ "print(results)\n",
+ "```\n",
+ "\n",
+ "## Citing this project\n",
+ "If you use this code in your research, please use the following BibTeX entry.\n",
+ "\n",
+ "```BibTeX\n",
+ "@misc{louisbrulenaudet2024,\n",
+ "\tauthor = {Louis Brulé Naudet},\n",
+ "\ttitle = {RAGoon : Improve Large Language Models retrieval using dynamic web-search},\n",
+ "\thowpublished = {\\url{https://github.com/louisbrulenaudet/ragoon}},\n",
+ "\tyear = {2024}\n",
+ "}\n",
+ "```\n",
+ "## Feedback\n",
+ "If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com)."
+ ],
+ "metadata": {
+ "id": "FtezbftS8uie"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Installation"
+ ],
+ "metadata": {
+ "id": "z-JUKvX_9EV5"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "Zbhnjl9c6Ztc",
+ "outputId": "5d02bc21-b845-4d61-8cad-a0c6a60ec78e"
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "Collecting ragoon\n",
+ " Downloading ragoon-0.0.3-py3-none-any.whl (14 kB)\n",
+ "Collecting groq\n",
+ " Downloading groq-0.8.0-py3-none-any.whl (105 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m105.4/105.4 kB\u001b[0m \u001b[31m2.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hCollecting openai\n",
+ " Downloading openai-1.30.3-py3-none-any.whl (320 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m320.6/320.6 kB\u001b[0m \u001b[31m5.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hRequirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (4.12.3)\n",
+ "Collecting httpx\n",
+ " Downloading httpx-0.27.0-py3-none-any.whl (75 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.6/75.6 kB\u001b[0m \u001b[31m6.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hRequirement already satisfied: google-api-python-client in /usr/local/lib/python3.10/dist-packages (2.84.0)\n",
+ "Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from groq) (3.7.1)\n",
+ "Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from groq) (1.7.0)\n",
+ "Requirement already satisfied: pydantic<3,>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from groq) (2.7.1)\n",
+ "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from groq) (1.3.1)\n",
+ "Requirement already satisfied: typing-extensions<5,>=4.7 in /usr/local/lib/python3.10/dist-packages (from groq) (4.11.0)\n",
+ "Requirement already satisfied: tqdm>4 in /usr/local/lib/python3.10/dist-packages (from openai) (4.66.4)\n",
+ "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4) (2.5)\n",
+ "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx) (2024.2.2)\n",
+ "Collecting httpcore==1.* (from httpx)\n",
+ " Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m6.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hRequirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx) (3.7)\n",
+ "Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx)\n",
+ " Downloading h11-0.14.0-py3-none-any.whl (58 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m4.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+ "\u001b[?25hRequirement already satisfied: httplib2<1dev,>=0.15.0 in /usr/local/lib/python3.10/dist-packages (from google-api-python-client) (0.22.0)\n",
+ "Requirement already satisfied: google-auth<3.0.0dev,>=1.19.0 in /usr/local/lib/python3.10/dist-packages (from google-api-python-client) (2.27.0)\n",
+ "Requirement already satisfied: google-auth-httplib2>=0.1.0 in /usr/local/lib/python3.10/dist-packages (from google-api-python-client) (0.1.1)\n",
+ "Requirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5 in /usr/local/lib/python3.10/dist-packages (from google-api-python-client) (2.11.1)\n",
+ "Requirement already satisfied: uritemplate<5,>=3.0.1 in /usr/local/lib/python3.10/dist-packages (from google-api-python-client) (4.1.1)\n",
+ "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->groq) (1.2.1)\n",
+ "Requirement already satisfied: googleapis-common-protos<2.0.dev0,>=1.56.2 in /usr/local/lib/python3.10/dist-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (1.63.0)\n",
+ "Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0.dev0,>=3.19.5 in /usr/local/lib/python3.10/dist-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (3.20.3)\n",
+ "Requirement already satisfied: requests<3.0.0.dev0,>=2.18.0 in /usr/local/lib/python3.10/dist-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (2.31.0)\n",
+ "Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth<3.0.0dev,>=1.19.0->google-api-python-client) (5.3.3)\n",
+ "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth<3.0.0dev,>=1.19.0->google-api-python-client) (0.4.0)\n",
+ "Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth<3.0.0dev,>=1.19.0->google-api-python-client) (4.9)\n",
+ "Requirement already satisfied: pyparsing!=3.0.0,!=3.0.1,!=3.0.2,!=3.0.3,<4,>=2.4.2 in /usr/local/lib/python3.10/dist-packages (from httplib2<1dev,>=0.15.0->google-api-python-client) (3.1.2)\n",
+ "Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=1.9.0->groq) (0.7.0)\n",
+ "Requirement already satisfied: pydantic-core==2.18.2 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=1.9.0->groq) (2.18.2)\n",
+ "Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3.0.0dev,>=1.19.0->google-api-python-client) (0.6.0)\n",
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0.dev0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (3.3.2)\n",
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0.dev0,>=2.18.0->google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0dev,>=1.31.5->google-api-python-client) (2.0.7)\n",
+ "Installing collected packages: ragoon, h11, httpcore, httpx, openai, groq\n",
+ "Successfully installed groq-0.8.0 h11-0.14.0 httpcore-1.0.5 httpx-0.27.0 openai-1.30.3 ragoon-0.0.3\n"
+ ]
+ }
+ ],
+ "source": [
+ "!pip3 install ragoon groq openai beautifulsoup4 httpx google-api-python-client"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Configuration"
+ ],
+ "metadata": {
+ "id": "64PHbyTT9KGW"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "from google.colab import userdata\n",
+ "from groq import Groq\n",
+ "from ragoon import RAGoon"
+ ],
+ "metadata": {
+ "id": "VdrJbXcH67J2"
+ },
+ "execution_count": 2,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "# Usage"
+ ],
+ "metadata": {
+ "id": "pZaAfdeb9OBG"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Initialize RAGoon instance\n",
+ "ragoon = RAGoon(\n",
+ " google_api_key=userdata.get(\"google_api_key\"),\n",
+ " google_cx=userdata.get(\"google_cx\"),\n",
+ " completion_client=Groq(api_key=userdata.get(\"groq_api_key\"))\n",
+ ")\n",
+ "\n",
+ "# Search and get results\n",
+ "query = \"I want to do a left join in python polars\"\n",
+ "results = ragoon.search(\n",
+ " query=query,\n",
+ " completion_model=\"Llama3-70b-8192\",\n",
+ " max_tokens=512,\n",
+ " temperature=1,\n",
+ ")\n",
+ "\n",
+ "# Print results\n",
+ "print(results)"
+ ],
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/"
+ },
+ "id": "bjmmC7sp6jhM",
+ "outputId": "e06b49fe-b4b5-4e6c-dfb7-d8c8dc130beb"
+ },
+ "execution_count": 3,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "An error occurred while scraping https://towardsdatascience.com/understand-polars-lack-of-indexes-526ea75e413: Redirect response '307 Temporary Redirect' for url 'https://towardsdatascience.com/understand-polars-lack-of-indexes-526ea75e413'\n",
+ "Redirect location: 'https://medium.com/m/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Funderstand-polars-lack-of-indexes-526ea75e413'\n",
+ "For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/307\n",
+ "An error occurred while scraping https://www.reddit.com/r/Python/comments/ululk1/i_used_a_new_dataframe_library_polars_to_wrangle/: Client error '403 Blocked' for url 'https://www.reddit.com/r/Python/comments/ululk1/i_used_a_new_dataframe_library_polars_to_wrangle/'\n",
+ "For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stderr",
+ "text": [
+ "WARNING:bs4.dammit:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.\n"
+ ]
+ },
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": [
+ "['Python API reference\\nManipulation/selection\\npolars.DataF...\\npolars.DataFrame.join\\n#\\nDataFrame.\\njoin\\n(\\nother\\n:\\nDataFrame\\n,\\non\\n:\\nstr\\n|\\nExpr\\n|\\nSequence\\n[\\nstr\\n|\\nExpr\\n]\\n|\\nNone\\n=\\nNone\\n,\\nhow\\n:\\nJoinStrategy\\n=\\n\\'inner\\'\\n,\\n*\\n,\\nleft_on\\n:\\nstr\\n|\\nExpr\\n|\\nSequence\\n[\\nstr\\n|\\nExpr\\n]\\n|\\nNone\\n=\\nNone\\n,\\nright_on\\n:\\nstr\\n|\\nExpr\\n|\\nSequence\\n[\\nstr\\n|\\nExpr\\n]\\n|\\nNone\\n=\\nNone\\n,\\nsuffix\\n:\\nstr\\n=\\n\\'_right\\'\\n,\\nvalidate\\n:\\nJoinValidation\\n=\\n\\'m:m\\'\\n,\\njoin_nulls\\n:\\nbool\\n=\\nFalse\\n,\\ncoalesce\\n:\\nbool\\n|\\nNone\\n=\\nNone\\n,\\n)\\n→\\nDataFrame\\n[source]\\n#\\nJoin in SQL-like fashion.\\nParameters\\n:\\nother\\nDataFrame to join with.\\non\\nName(s) of the join columns in both DataFrames.\\nhow\\n{‘inner’, ‘left’, ‘full’, ‘semi’, ‘anti’, ‘cross’}\\nJoin strategy.\\ninner\\nReturns rows that have matching values in both tables\\nleft\\nReturns all rows from the left table, and the matched rows from the\\nright table\\nfull\\nReturns all rows when there is a match in either left or right table\\ncross\\nReturns the Cartesian product of rows from both tables\\nsemi\\nFilter rows that have a match in the right table.\\nanti\\nFilter rows that do not have a match in the right table.\\nNote\\nA left join preserves the row order of the left DataFrame.\\nleft_on\\nName(s) of the left join column(s).\\nright_on\\nName(s) of the right join column(s).\\nsuffix\\nSuffix to append to columns with a duplicate name.\\nvalidate: {‘m:m’, ‘m:1’, ‘1:m’, ‘1:1’}\\nChecks if join is of specified type.\\nmany_to_many\\n“m:m”: default, does not result in checks\\none_to_one\\n“1:1”: check if join keys are unique in both left and right datasets\\none_to_many\\n“1:m”: check if join keys are unique in left dataset\\nmany_to_one\\n“m:1”: check if join keys are unique in right dataset\\nNote\\nThis is currently not supported the streaming engine.\\njoin_nulls\\nJoin on null values. By default null values will never produce matches.\\ncoalesce\\nCoalescing behavior (merging of join columns).\\n- None: -> join specific.\\n- True: -> Always coalesce join columns.\\n- False: -> Never coalesce join columns.\\nReturns\\n:\\nDataFrame\\nSee also\\njoin_asof\\nNotes\\nFor joining on columns with categorical data, see\\npolars.StringCache\\n.\\nExamples\\n>>>\\ndf\\n=\\npl\\n.\\nDataFrame\\n(\\n...\\n{\\n...\\n\"foo\"\\n:\\n[\\n1\\n,\\n2\\n,\\n3\\n],\\n...\\n\"bar\"\\n:\\n[\\n6.0\\n,\\n7.0\\n,\\n8.0\\n],\\n...\\n\"ham\"\\n:\\n[\\n\"a\"\\n,\\n\"b\"\\n,\\n\"c\"\\n],\\n...\\n}\\n...\\n)\\n>>>\\nother_df\\n=\\npl\\n.\\nDataFrame\\n(\\n...\\n{\\n...\\n\"apple\"\\n:\\n[\\n\"x\"\\n,\\n\"y\"\\n,\\n\"z\"\\n],\\n...\\n\"ham\"\\n:\\n[\\n\"a\"\\n,\\n\"b\"\\n,\\n\"d\"\\n],\\n...\\n}\\n...\\n)\\n>>>\\ndf\\n.\\njoin\\n(\\nother_df\\n,\\non\\n=\\n\"ham\"\\n)\\nshape: (2, 4)\\n┌─────┬─────┬─────┬───────┐\\n│ foo ┆ bar ┆ ham ┆ apple │\\n│ --- ┆ --- ┆ --- ┆ --- │\\n│ i64 ┆ f64 ┆ str ┆ str │\\n╞═════╪═════╪═════╪═══════╡\\n│ 1 ┆ 6.0 ┆ a ┆ x │\\n│ 2 ┆ 7.0 ┆ b ┆ y │\\n└─────┴─────┴─────┴───────┘\\n>>>\\ndf\\n.\\njoin\\n(\\nother_df\\n,\\non\\n=\\n\"ham\"\\n,\\nhow\\n=\\n\"full\"\\n)\\nshape: (4, 5)\\n┌──────┬──────┬──────┬───────┬───────────┐\\n│ foo ┆ bar ┆ ham ┆ apple ┆ ham_right │\\n│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\\n│ i64 ┆ f64 ┆ str ┆ str ┆ str │\\n╞══════╪══════╪══════╪═══════╪═══════════╡\\n│ 1 ┆ 6.0 ┆ a ┆ x ┆ a │\\n│ 2 ┆ 7.0 ┆ b ┆ y ┆ b │\\n│ null ┆ null ┆ null ┆ z ┆ d │\\n│ 3 ┆ 8.0 ┆ c ┆ null ┆ null │\\n└──────┴──────┴──────┴───────┴───────────┘\\n>>>\\ndf\\n.\\njoin\\n(\\nother_df\\n,\\non\\n=\\n\"ham\"\\n,\\nhow\\n=\\n\"left\"\\n)\\nshape: (3, 4)\\n┌─────┬─────┬─────┬───────┐\\n│ foo ┆ bar ┆ ham ┆ apple │\\n│ --- ┆ --- ┆ --- ┆ --- │\\n│ i64 ┆ f64 ┆ str ┆ str │\\n╞═════╪═════╪═════╪═══════╡\\n│ 1 ┆ 6.0 ┆ a ┆ x │\\n│ 2 ┆ 7.0 ┆ b ┆ y │\\n│ 3 ┆ 8.0 ┆ c ┆ null │\\n└─────┴─────┴─────┴───────┘\\n>>>\\ndf\\n.\\njoin\\n(\\nother_df\\n,\\non\\n=\\n\"ham\"\\n,\\nhow\\n=\\n\"semi\"\\n)\\nshape: (2, 3)\\n┌─────┬─────┬─────┐\\n│ foo ┆ bar ┆ ham │\\n│ --- ┆ --- ┆ --- │\\n│ i64 ┆ f64 ┆ str │\\n╞═════╪═════╪═════╡\\n│ 1 ┆ 6.0 ┆ a │\\n│ 2 ┆ 7.0 ┆ b │\\n└─────┴─────┴─────┘\\n>>>\\ndf\\n.\\njoin\\n(\\nother_df\\n,\\non\\n=\\n\"ham\"\\n,\\nhow\\n=\\n\"anti\"\\n)\\nshape: (1, 3)\\n┌─────┬─────┬─────┐\\n│ foo ┆ bar ┆ ham │\\n│ --- ┆ --- ┆ --- │\\n│ i64 ┆ f64 ┆ str │\\n╞═════╪═════╪═════╡\\n│ 3 ┆ 8.0 ┆ c │\\n└─────┴─────┴─────┘\\nprevious\\npolars.DataFrame.iter_slices\\nnext\\npolars.DataFrame.join_asof\\nOn this page\\nDataFrame.join()', '', 'Polars user guide\\npola-rs/polars\\nUser guide\\nUser guide\\nGetting started\\nInstallation\\nConcepts\\nConcepts\\nData types\\nData types\\nOverview\\nCategorical data\\nData structures\\nContexts\\nExpressions\\nLazy / eager API\\nStreaming API\\nExpressions\\nExpressions\\nBasic operators\\nColumn selections\\nFunctions\\nCasting\\nStrings\\nAggregation\\nMissing data\\nWindow functions\\nFolds\\nLists and Arrays\\nExpression plugins\\nUser-defined functions (Python)\\nThe Struct datatype\\nNumpy\\nTransformations\\nTransformations\\nJoins\\nJoins\\nTable of contents\\nJoin strategies\\nInner join\\nLeft join\\nOuter join\\nCross join\\nSemi join\\nAnti join\\nAsof join\\nConcatenation\\nPivots\\nMelts\\nTime series\\nTime series\\nParsing\\nFiltering\\nGrouping\\nResampling\\nTime zones\\nLazy API\\nLazy API\\nUsage\\nOptimizations\\nSchema\\nQuery plan\\nQuery execution\\nStreaming\\nIO\\nIO\\nCSV\\nExcel\\nParquet\\nJSON files\\nMultiple\\nDatabases\\nCloud storage\\nGoogle BigQuery\\nSQL\\nSQL\\nIntroduction\\nSHOW TABLES\\nSELECT\\nCREATE\\nCommon Table Expressions\\nMigrating\\nMigrating\\nComing from Pandas\\nComing from Apache Spark\\nEcosystem\\nMisc\\nMisc\\nMultiprocessing\\nVisualization\\nComparison with other tools\\nAPI reference\\nDevelopment\\nDevelopment\\nContributing\\nContributing\\nIDE configuration\\nTest suite\\nContinuous integration\\nCode style\\nVersioning\\nReleases\\nReleases\\nChangelog\\nUpgrade guides\\nUpgrade guides\\nVersion 0.20\\nVersion 0.19\\nTable of contents\\nJoin strategies\\nInner join\\nLeft join\\nOuter join\\nCross join\\nSemi join\\nAnti join\\nAsof join\\nJoins\\nJoin strategies\\nPolars supports the following join strategies by specifying the\\nhow\\nargument:\\nStrategy\\nDescription\\ninner\\nReturns row with matching keys in\\nboth\\nframes. Non-matching rows in either the left or right frame are discarded.\\nleft\\nReturns all rows in the left dataframe, whether or not a match in the right-frame is found. Non-matching rows have their right columns null-filled.\\nfull\\nReturns all rows from both the left and right dataframe. If no match is found in one frame, columns from the other frame are null-filled.\\ncross\\nReturns the Cartesian product of all rows from the left frame with all rows from the right frame. Duplicates rows are retained; the table length of\\nA\\ncross-joined with\\nB\\nis always\\nlen(A) × len(B)\\n.\\nsemi\\nReturns all rows from the left frame in which the join key is also present in the right frame.\\nanti\\nReturns all rows from the left frame in which the join key is\\nnot\\npresent in the right frame.\\nA separate\\ncoalesce\\nparameter determines whether to merge key columns with the same name from the left and right\\nframes.\\nInner join\\nAn\\ninner\\njoin produces a\\nDataFrame\\nthat contains only the rows where the join key exists in both\\nDataFrames\\n. Let\\'s\\ntake for example the following two\\nDataFrames\\n:\\nPython\\nRust\\nDataFrame\\ndf_customers\\n=\\npl\\n.\\nDataFrame\\n(\\n{\\n\"customer_id\"\\n:\\n[\\n1\\n,\\n2\\n,\\n3\\n],\\n\"name\"\\n:\\n[\\n\"Alice\"\\n,\\n\"Bob\"\\n,\\n\"Charlie\"\\n],\\n}\\n)\\nprint\\n(\\ndf_customers\\n)\\nDataFrame\\nlet\\ndf_customers\\n=\\ndf\\n!\\n(\\n\"customer_id\"\\n=>\\n&\\n[\\n1\\n,\\n2\\n,\\n3\\n],\\n\"name\"\\n=>\\n&\\n[\\n\"Alice\"\\n,\\n\"Bob\"\\n,\\n\"Charlie\"\\n],\\n)\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_customers\\n);\\nshape: (3, 2)\\n┌─────────────┬─────────┐\\n│ customer_id ┆ name │\\n│ --- ┆ --- │\\n│ i64 ┆ str │\\n╞═════════════╪═════════╡\\n│ 1 ┆ Alice │\\n│ 2 ┆ Bob │\\n│ 3 ┆ Charlie │\\n└─────────────┴─────────┘\\nPython\\nRust\\nDataFrame\\ndf_orders\\n=\\npl\\n.\\nDataFrame\\n(\\n{\\n\"order_id\"\\n:\\n[\\n\"a\"\\n,\\n\"b\"\\n,\\n\"c\"\\n],\\n\"customer_id\"\\n:\\n[\\n1\\n,\\n2\\n,\\n2\\n],\\n\"amount\"\\n:\\n[\\n100\\n,\\n200\\n,\\n300\\n],\\n}\\n)\\nprint\\n(\\ndf_orders\\n)\\nDataFrame\\nlet\\ndf_orders\\n=\\ndf\\n!\\n(\\n\"order_id\"\\n=>\\n&\\n[\\n\"a\"\\n,\\n\"b\"\\n,\\n\"c\"\\n],\\n\"customer_id\"\\n=>\\n&\\n[\\n1\\n,\\n2\\n,\\n2\\n],\\n\"amount\"\\n=>\\n&\\n[\\n100\\n,\\n200\\n,\\n300\\n],\\n)\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_orders\\n);\\nshape: (3, 3)\\n┌──────────┬─────────────┬────────┐\\n│ order_id ┆ customer_id ┆ amount │\\n│ --- ┆ --- ┆ --- │\\n│ str ┆ i64 ┆ i64 │\\n╞══════════╪═════════════╪════════╡\\n│ a ┆ 1 ┆ 100 │\\n│ b ┆ 2 ┆ 200 │\\n│ c ┆ 2 ┆ 300 │\\n└──────────┴─────────────┴────────┘\\nTo get a\\nDataFrame\\nwith the orders and their associated customer we can do an\\ninner\\njoin on the\\ncustomer_id\\ncolumn:\\nPython\\nRust\\njoin\\ndf_inner_customer_join\\n=\\ndf_customers\\n.\\njoin\\n(\\ndf_orders\\n,\\non\\n=\\n\"customer_id\"\\n,\\nhow\\n=\\n\"inner\"\\n)\\nprint\\n(\\ndf_inner_customer_join\\n)\\njoin\\nlet\\ndf_inner_customer_join\\n=\\ndf_customers\\n.\\nclone\\n()\\n.\\nlazy\\n()\\n.\\njoin\\n(\\ndf_orders\\n.\\nclone\\n().\\nlazy\\n(),\\n[\\ncol\\n(\\n\"customer_id\"\\n)],\\n[\\ncol\\n(\\n\"customer_id\"\\n)],\\nJoinArgs\\n::\\nnew\\n(\\nJoinType\\n::\\nInner\\n),\\n)\\n.\\ncollect\\n()\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_inner_customer_join\\n);\\nshape: (3, 4)\\n┌─────────────┬───────┬──────────┬────────┐\\n│ customer_id ┆ name ┆ order_id ┆ amount │\\n│ --- ┆ --- ┆ --- ┆ --- │\\n│ i64 ┆ str ┆ str ┆ i64 │\\n╞═════════════╪═══════╪══════════╪════════╡\\n│ 1 ┆ Alice ┆ a ┆ 100 │\\n│ 2 ┆ Bob ┆ b ┆ 200 │\\n│ 2 ┆ Bob ┆ c ┆ 300 │\\n└─────────────┴───────┴──────────┴────────┘\\nLeft join\\nThe\\nleft\\nouter join produces a\\nDataFrame\\nthat contains all the rows from the left\\nDataFrame\\nand only the rows from\\nthe right\\nDataFrame\\nwhere the join key exists in the left\\nDataFrame\\n. If we now take the example from above and want\\nto have a\\nDataFrame\\nwith all the customers and their associated orders (regardless of whether they have placed an\\norder or not) we can do a\\nleft\\njoin:\\nPython\\nRust\\njoin\\ndf_left_join\\n=\\ndf_customers\\n.\\njoin\\n(\\ndf_orders\\n,\\non\\n=\\n\"customer_id\"\\n,\\nhow\\n=\\n\"left\"\\n)\\nprint\\n(\\ndf_left_join\\n)\\njoin\\nlet\\ndf_left_join\\n=\\ndf_customers\\n.\\nclone\\n()\\n.\\nlazy\\n()\\n.\\njoin\\n(\\ndf_orders\\n.\\nclone\\n().\\nlazy\\n(),\\n[\\ncol\\n(\\n\"customer_id\"\\n)],\\n[\\ncol\\n(\\n\"customer_id\"\\n)],\\nJoinArgs\\n::\\nnew\\n(\\nJoinType\\n::\\nLeft\\n),\\n)\\n.\\ncollect\\n()\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_left_join\\n);\\nshape: (4, 4)\\n┌─────────────┬─────────┬──────────┬────────┐\\n│ customer_id ┆ name ┆ order_id ┆ amount │\\n│ --- ┆ --- ┆ --- ┆ --- │\\n│ i64 ┆ str ┆ str ┆ i64 │\\n╞═════════════╪═════════╪══════════╪════════╡\\n│ 1 ┆ Alice ┆ a ┆ 100 │\\n│ 2 ┆ Bob ┆ b ┆ 200 │\\n│ 2 ┆ Bob ┆ c ┆ 300 │\\n│ 3 ┆ Charlie ┆ null ┆ null │\\n└─────────────┴─────────┴──────────┴────────┘\\nNotice, that the fields for the customer with the\\ncustomer_id\\nof\\n3\\nare null, as there are no orders for this\\ncustomer.\\nOuter join\\nThe\\nfull\\nouter join produces a\\nDataFrame\\nthat contains all the rows from both\\nDataFrames\\n. Columns are null, if the\\njoin key does not exist in the source\\nDataFrame\\n. Doing a\\nfull\\nouter join on the two\\nDataFrames\\nfrom above produces\\na similar\\nDataFrame\\nto the\\nleft\\njoin:\\nPython\\nRust\\njoin\\ndf_outer_join\\n=\\ndf_customers\\n.\\njoin\\n(\\ndf_orders\\n,\\non\\n=\\n\"customer_id\"\\n,\\nhow\\n=\\n\"full\"\\n)\\nprint\\n(\\ndf_outer_join\\n)\\njoin\\nlet\\ndf_full_join\\n=\\ndf_customers\\n.\\nclone\\n()\\n.\\nlazy\\n()\\n.\\njoin\\n(\\ndf_orders\\n.\\nclone\\n().\\nlazy\\n(),\\n[\\ncol\\n(\\n\"customer_id\"\\n)],\\n[\\ncol\\n(\\n\"customer_id\"\\n)],\\nJoinArgs\\n::\\nnew\\n(\\nJoinType\\n::\\nFull\\n),\\n)\\n.\\ncollect\\n()\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_full_join\\n);\\nshape: (4, 5)\\n┌─────────────┬─────────┬──────────┬───────────────────┬────────┐\\n│ customer_id ┆ name ┆ order_id ┆ customer_id_right ┆ amount │\\n│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\\n│ i64 ┆ str ┆ str ┆ i64 ┆ i64 │\\n╞═════════════╪═════════╪══════════╪═══════════════════╪════════╡\\n│ 1 ┆ Alice ┆ a ┆ 1 ┆ 100 │\\n│ 2 ┆ Bob ┆ b ┆ 2 ┆ 200 │\\n│ 2 ┆ Bob ┆ c ┆ 2 ┆ 300 │\\n│ 3 ┆ Charlie ┆ null ┆ null ┆ null │\\n└─────────────┴─────────┴──────────┴───────────────────┴────────┘\\nPython\\nRust\\njoin\\ndf_outer_coalesce_join\\n=\\ndf_customers\\n.\\njoin\\n(\\ndf_orders\\n,\\non\\n=\\n\"customer_id\"\\n,\\nhow\\n=\\n\"full\"\\n,\\ncoalesce\\n=\\nTrue\\n)\\nprint\\n(\\ndf_outer_coalesce_join\\n)\\njoin\\nlet\\ndf_full_join\\n=\\ndf_customers\\n.\\nclone\\n()\\n.\\nlazy\\n()\\n.\\njoin\\n(\\ndf_orders\\n.\\nclone\\n().\\nlazy\\n(),\\n[\\ncol\\n(\\n\"customer_id\"\\n)],\\n[\\ncol\\n(\\n\"customer_id\"\\n)],\\nJoinArgs\\n::\\nnew\\n(\\nJoinType\\n::\\nFull\\n).\\nwith_coalesce\\n(\\nJoinCoalesce\\n::\\nCoalesceColumns\\n),\\n)\\n.\\ncollect\\n()\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_full_join\\n);\\nshape: (4, 4)\\n┌─────────────┬─────────┬──────────┬────────┐\\n│ customer_id ┆ name ┆ order_id ┆ amount │\\n│ --- ┆ --- ┆ --- ┆ --- │\\n│ i64 ┆ str ┆ str ┆ i64 │\\n╞═════════════╪═════════╪══════════╪════════╡\\n│ 1 ┆ Alice ┆ a ┆ 100 │\\n│ 2 ┆ Bob ┆ b ┆ 200 │\\n│ 2 ┆ Bob ┆ c ┆ 300 │\\n│ 3 ┆ Charlie ┆ null ┆ null │\\n└─────────────┴─────────┴──────────┴────────┘\\nCross join\\nA\\ncross\\njoin is a Cartesian product of the two\\nDataFrames\\n. This means that every row in the left\\nDataFrame\\nis\\njoined with every row in the right\\nDataFrame\\n. The\\ncross\\njoin is useful for creating a\\nDataFrame\\nwith all possible\\ncombinations of the columns in two\\nDataFrames\\n. Let\\'s take for example the following two\\nDataFrames\\n.\\nPython\\nRust\\nDataFrame\\ndf_colors\\n=\\npl\\n.\\nDataFrame\\n(\\n{\\n\"color\"\\n:\\n[\\n\"red\"\\n,\\n\"blue\"\\n,\\n\"green\"\\n],\\n}\\n)\\nprint\\n(\\ndf_colors\\n)\\nDataFrame\\nlet\\ndf_colors\\n=\\ndf\\n!\\n(\\n\"color\"\\n=>\\n&\\n[\\n\"red\"\\n,\\n\"blue\"\\n,\\n\"green\"\\n],\\n)\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_colors\\n);\\nshape: (3, 1)\\n┌───────┐\\n│ color │\\n│ --- │\\n│ str │\\n╞═══════╡\\n│ red │\\n│ blue │\\n│ green │\\n└───────┘\\nPython\\nRust\\nDataFrame\\ndf_sizes\\n=\\npl\\n.\\nDataFrame\\n(\\n{\\n\"size\"\\n:\\n[\\n\"S\"\\n,\\n\"M\"\\n,\\n\"L\"\\n],\\n}\\n)\\nprint\\n(\\ndf_sizes\\n)\\nDataFrame\\nlet\\ndf_sizes\\n=\\ndf\\n!\\n(\\n\"size\"\\n=>\\n&\\n[\\n\"S\"\\n,\\n\"M\"\\n,\\n\"L\"\\n],\\n)\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_sizes\\n);\\nshape: (3, 1)\\n┌──────┐\\n│ size │\\n│ --- │\\n│ str │\\n╞══════╡\\n│ S │\\n│ M │\\n│ L │\\n└──────┘\\nWe can now create a\\nDataFrame\\ncontaining all possible combinations of the colors and sizes with a\\ncross\\njoin:\\nPython\\nRust\\njoin\\ndf_cross_join\\n=\\ndf_colors\\n.\\njoin\\n(\\ndf_sizes\\n,\\nhow\\n=\\n\"cross\"\\n)\\nprint\\n(\\ndf_cross_join\\n)\\njoin\\nlet\\ndf_cross_join\\n=\\ndf_colors\\n.\\nclone\\n()\\n.\\nlazy\\n()\\n.\\ncross_join\\n(\\ndf_sizes\\n.\\nclone\\n().\\nlazy\\n())\\n.\\ncollect\\n()\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_cross_join\\n);\\nshape: (9, 2)\\n┌───────┬──────┐\\n│ color ┆ size │\\n│ --- ┆ --- │\\n│ str ┆ str │\\n╞═══════╪══════╡\\n│ red ┆ S │\\n│ red ┆ M │\\n│ red ┆ L │\\n│ blue ┆ S │\\n│ blue ┆ M │\\n│ blue ┆ L │\\n│ green ┆ S │\\n│ green ┆ M │\\n│ green ┆ L │\\n└───────┴──────┘\\nThe\\ninner\\n,\\nleft\\n,\\nfull\\nand\\ncross\\njoin strategies are standard amongst dataframe libraries. We provide more\\ndetails on the less familiar\\nsemi\\n,\\nanti\\nand\\nasof\\njoin strategies below.\\nSemi join\\nThe\\nsemi\\njoin returns all rows from the left frame in which the join key is also present in the right frame. Consider\\nthe following scenario: a car rental company has a\\nDataFrame\\nshowing the cars that it owns with each car having a\\nunique\\nid\\n.\\nPython\\nRust\\nDataFrame\\ndf_cars\\n=\\npl\\n.\\nDataFrame\\n(\\n{\\n\"id\"\\n:\\n[\\n\"a\"\\n,\\n\"b\"\\n,\\n\"c\"\\n],\\n\"make\"\\n:\\n[\\n\"ford\"\\n,\\n\"toyota\"\\n,\\n\"bmw\"\\n],\\n}\\n)\\nprint\\n(\\ndf_cars\\n)\\nDataFrame\\nlet\\ndf_cars\\n=\\ndf\\n!\\n(\\n\"id\"\\n=>\\n&\\n[\\n\"a\"\\n,\\n\"b\"\\n,\\n\"c\"\\n],\\n\"make\"\\n=>\\n&\\n[\\n\"ford\"\\n,\\n\"toyota\"\\n,\\n\"bmw\"\\n],\\n)\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_cars\\n);\\nshape: (3, 2)\\n┌─────┬────────┐\\n│ id ┆ make │\\n│ --- ┆ --- │\\n│ str ┆ str │\\n╞═════╪════════╡\\n│ a ┆ ford │\\n│ b ┆ toyota │\\n│ c ┆ bmw │\\n└─────┴────────┘\\nThe company has another\\nDataFrame\\nshowing each repair job carried out on a vehicle.\\nPython\\nRust\\nDataFrame\\ndf_repairs\\n=\\npl\\n.\\nDataFrame\\n(\\n{\\n\"id\"\\n:\\n[\\n\"c\"\\n,\\n\"c\"\\n],\\n\"cost\"\\n:\\n[\\n100\\n,\\n200\\n],\\n}\\n)\\nprint\\n(\\ndf_repairs\\n)\\nDataFrame\\nlet\\ndf_repairs\\n=\\ndf\\n!\\n(\\n\"id\"\\n=>\\n&\\n[\\n\"c\"\\n,\\n\"c\"\\n],\\n\"cost\"\\n=>\\n&\\n[\\n100\\n,\\n200\\n],\\n)\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_repairs\\n);\\nshape: (2, 2)\\n┌─────┬──────┐\\n│ id ┆ cost │\\n│ --- ┆ --- │\\n│ str ┆ i64 │\\n╞═════╪══════╡\\n│ c ┆ 100 │\\n│ c ┆ 200 │\\n└─────┴──────┘\\nYou want to answer this question: which of the cars have had repairs carried out?\\nAn inner join does not answer this question directly as it produces a\\nDataFrame\\nwith multiple rows for each car that\\nhas had multiple repair jobs:\\nPython\\nRust\\njoin\\ndf_inner_join\\n=\\ndf_cars\\n.\\njoin\\n(\\ndf_repairs\\n,\\non\\n=\\n\"id\"\\n,\\nhow\\n=\\n\"inner\"\\n)\\nprint\\n(\\ndf_inner_join\\n)\\njoin\\nlet\\ndf_inner_join\\n=\\ndf_cars\\n.\\nclone\\n()\\n.\\nlazy\\n()\\n.\\ninner_join\\n(\\ndf_repairs\\n.\\nclone\\n().\\nlazy\\n(),\\ncol\\n(\\n\"id\"\\n),\\ncol\\n(\\n\"id\"\\n))\\n.\\ncollect\\n()\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_inner_join\\n);\\nshape: (2, 3)\\n┌─────┬──────┬──────┐\\n│ id ┆ make ┆ cost │\\n│ --- ┆ --- ┆ --- │\\n│ str ┆ str ┆ i64 │\\n╞═════╪══════╪══════╡\\n│ c ┆ bmw ┆ 100 │\\n│ c ┆ bmw ┆ 200 │\\n└─────┴──────┴──────┘\\nHowever, a semi join produces a single row for each car that has had a repair job carried out.\\nPython\\nRust\\njoin\\ndf_semi_join\\n=\\ndf_cars\\n.\\njoin\\n(\\ndf_repairs\\n,\\non\\n=\\n\"id\"\\n,\\nhow\\n=\\n\"semi\"\\n)\\nprint\\n(\\ndf_semi_join\\n)\\njoin\\nlet\\ndf_semi_join\\n=\\ndf_cars\\n.\\nclone\\n()\\n.\\nlazy\\n()\\n.\\njoin\\n(\\ndf_repairs\\n.\\nclone\\n().\\nlazy\\n(),\\n[\\ncol\\n(\\n\"id\"\\n)],\\n[\\ncol\\n(\\n\"id\"\\n)],\\nJoinArgs\\n::\\nnew\\n(\\nJoinType\\n::\\nSemi\\n),\\n)\\n.\\ncollect\\n()\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_semi_join\\n);\\nshape: (1, 2)\\n┌─────┬──────┐\\n│ id ┆ make │\\n│ --- ┆ --- │\\n│ str ┆ str │\\n╞═════╪══════╡\\n│ c ┆ bmw │\\n└─────┴──────┘\\nAnti join\\nContinuing this example, an alternative question might be: which of the cars have\\nnot\\nhad a repair job carried out?\\nAn anti join produces a\\nDataFrame\\nshowing all the cars from\\ndf_cars\\nwhere the\\nid\\nis not present in\\nthe\\ndf_repairs\\nDataFrame\\n.\\nPython\\nRust\\njoin\\ndf_anti_join\\n=\\ndf_cars\\n.\\njoin\\n(\\ndf_repairs\\n,\\non\\n=\\n\"id\"\\n,\\nhow\\n=\\n\"anti\"\\n)\\nprint\\n(\\ndf_anti_join\\n)\\njoin\\nlet\\ndf_anti_join\\n=\\ndf_cars\\n.\\nclone\\n()\\n.\\nlazy\\n()\\n.\\njoin\\n(\\ndf_repairs\\n.\\nclone\\n().\\nlazy\\n(),\\n[\\ncol\\n(\\n\"id\"\\n)],\\n[\\ncol\\n(\\n\"id\"\\n)],\\nJoinArgs\\n::\\nnew\\n(\\nJoinType\\n::\\nAnti\\n),\\n)\\n.\\ncollect\\n()\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_anti_join\\n);\\nshape: (2, 2)\\n┌─────┬────────┐\\n│ id ┆ make │\\n│ --- ┆ --- │\\n│ str ┆ str │\\n╞═════╪════════╡\\n│ a ┆ ford │\\n│ b ┆ toyota │\\n└─────┴────────┘\\nAsof join\\nAn\\nasof\\njoin is like a left join except that we match on nearest key rather than equal keys.\\nIn Polars we can do an asof join with the\\njoin_asof\\nmethod.\\nConsider the following scenario: a stock market broker has a\\nDataFrame\\ncalled\\ndf_trades\\nshowing transactions it has\\nmade for different stocks.\\nPython\\nRust\\nDataFrame\\ndf_trades\\n=\\npl\\n.\\nDataFrame\\n(\\n{\\n\"time\"\\n:\\n[\\ndatetime\\n(\\n2020\\n,\\n1\\n,\\n1\\n,\\n9\\n,\\n1\\n,\\n0\\n),\\ndatetime\\n(\\n2020\\n,\\n1\\n,\\n1\\n,\\n9\\n,\\n1\\n,\\n0\\n),\\ndatetime\\n(\\n2020\\n,\\n1\\n,\\n1\\n,\\n9\\n,\\n3\\n,\\n0\\n),\\ndatetime\\n(\\n2020\\n,\\n1\\n,\\n1\\n,\\n9\\n,\\n6\\n,\\n0\\n),\\n],\\n\"stock\"\\n:\\n[\\n\"A\"\\n,\\n\"B\"\\n,\\n\"B\"\\n,\\n\"C\"\\n],\\n\"trade\"\\n:\\n[\\n101\\n,\\n299\\n,\\n301\\n,\\n500\\n],\\n}\\n)\\nprint\\n(\\ndf_trades\\n)\\nDataFrame\\nuse\\nchrono\\n::\\nprelude\\n::\\n*\\n;\\nlet\\ndf_trades\\n=\\ndf\\n!\\n(\\n\"time\"\\n=>\\n&\\n[\\nNaiveDate\\n::\\nfrom_ymd_opt\\n(\\n2020\\n,\\n1\\n,\\n1\\n).\\nunwrap\\n().\\nand_hms_opt\\n(\\n9\\n,\\n1\\n,\\n0\\n).\\nunwrap\\n(),\\nNaiveDate\\n::\\nfrom_ymd_opt\\n(\\n2020\\n,\\n1\\n,\\n1\\n).\\nunwrap\\n().\\nand_hms_opt\\n(\\n9\\n,\\n1\\n,\\n0\\n).\\nunwrap\\n(),\\nNaiveDate\\n::\\nfrom_ymd_opt\\n(\\n2020\\n,\\n1\\n,\\n1\\n).\\nunwrap\\n().\\nand_hms_opt\\n(\\n9\\n,\\n3\\n,\\n0\\n).\\nunwrap\\n(),\\nNaiveDate\\n::\\nfrom_ymd_opt\\n(\\n2020\\n,\\n1\\n,\\n1\\n).\\nunwrap\\n().\\nand_hms_opt\\n(\\n9\\n,\\n6\\n,\\n0\\n).\\nunwrap\\n(),\\n],\\n\"stock\"\\n=>\\n&\\n[\\n\"A\"\\n,\\n\"B\"\\n,\\n\"B\"\\n,\\n\"C\"\\n],\\n\"trade\"\\n=>\\n&\\n[\\n101\\n,\\n299\\n,\\n301\\n,\\n500\\n],\\n)\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_trades\\n);\\nshape: (4, 3)\\n┌─────────────────────┬───────┬───────┐\\n│ time ┆ stock ┆ trade │\\n│ --- ┆ --- ┆ --- │\\n│ datetime[μs] ┆ str ┆ i64 │\\n╞═════════════════════╪═══════╪═══════╡\\n│ 2020-01-01 09:01:00 ┆ A ┆ 101 │\\n│ 2020-01-01 09:01:00 ┆ B ┆ 299 │\\n│ 2020-01-01 09:03:00 ┆ B ┆ 301 │\\n│ 2020-01-01 09:06:00 ┆ C ┆ 500 │\\n└─────────────────────┴───────┴───────┘\\nThe broker has another\\nDataFrame\\ncalled\\ndf_quotes\\nshowing prices it has quoted for these stocks.\\nPython\\nRust\\nDataFrame\\ndf_quotes\\n=\\npl\\n.\\nDataFrame\\n(\\n{\\n\"time\"\\n:\\n[\\ndatetime\\n(\\n2020\\n,\\n1\\n,\\n1\\n,\\n9\\n,\\n0\\n,\\n0\\n),\\ndatetime\\n(\\n2020\\n,\\n1\\n,\\n1\\n,\\n9\\n,\\n2\\n,\\n0\\n),\\ndatetime\\n(\\n2020\\n,\\n1\\n,\\n1\\n,\\n9\\n,\\n4\\n,\\n0\\n),\\ndatetime\\n(\\n2020\\n,\\n1\\n,\\n1\\n,\\n9\\n,\\n6\\n,\\n0\\n),\\n],\\n\"stock\"\\n:\\n[\\n\"A\"\\n,\\n\"B\"\\n,\\n\"C\"\\n,\\n\"A\"\\n],\\n\"quote\"\\n:\\n[\\n100\\n,\\n300\\n,\\n501\\n,\\n102\\n],\\n}\\n)\\nprint\\n(\\ndf_quotes\\n)\\nDataFrame\\nlet\\ndf_quotes\\n=\\ndf\\n!\\n(\\n\"time\"\\n=>\\n&\\n[\\nNaiveDate\\n::\\nfrom_ymd_opt\\n(\\n2020\\n,\\n1\\n,\\n1\\n).\\nunwrap\\n().\\nand_hms_opt\\n(\\n9\\n,\\n0\\n,\\n0\\n).\\nunwrap\\n(),\\nNaiveDate\\n::\\nfrom_ymd_opt\\n(\\n2020\\n,\\n1\\n,\\n1\\n).\\nunwrap\\n().\\nand_hms_opt\\n(\\n9\\n,\\n2\\n,\\n0\\n).\\nunwrap\\n(),\\nNaiveDate\\n::\\nfrom_ymd_opt\\n(\\n2020\\n,\\n1\\n,\\n1\\n).\\nunwrap\\n().\\nand_hms_opt\\n(\\n9\\n,\\n4\\n,\\n0\\n).\\nunwrap\\n(),\\nNaiveDate\\n::\\nfrom_ymd_opt\\n(\\n2020\\n,\\n1\\n,\\n1\\n).\\nunwrap\\n().\\nand_hms_opt\\n(\\n9\\n,\\n6\\n,\\n0\\n).\\nunwrap\\n(),\\n],\\n\"stock\"\\n=>\\n&\\n[\\n\"A\"\\n,\\n\"B\"\\n,\\n\"C\"\\n,\\n\"A\"\\n],\\n\"quote\"\\n=>\\n&\\n[\\n100\\n,\\n300\\n,\\n501\\n,\\n102\\n],\\n)\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_quotes\\n);\\nshape: (4, 3)\\n┌─────────────────────┬───────┬───────┐\\n│ time ┆ stock ┆ quote │\\n│ --- ┆ --- ┆ --- │\\n│ datetime[μs] ┆ str ┆ i64 │\\n╞═════════════════════╪═══════╪═══════╡\\n│ 2020-01-01 09:00:00 ┆ A ┆ 100 │\\n│ 2020-01-01 09:02:00 ┆ B ┆ 300 │\\n│ 2020-01-01 09:04:00 ┆ C ┆ 501 │\\n│ 2020-01-01 09:06:00 ┆ A ┆ 102 │\\n└─────────────────────┴───────┴───────┘\\nYou want to produce a\\nDataFrame\\nshowing for each trade the most recent quote provided\\nbefore\\nthe trade. You do this\\nwith\\njoin_asof\\n(using the default\\nstrategy = \"backward\"\\n).\\nTo avoid joining between trades on one stock with a quote on another you must specify an exact preliminary join on the\\nstock column with\\nby=\"stock\"\\n.\\nPython\\nRust\\njoin_asof\\ndf_asof_join\\n=\\ndf_trades\\n.\\njoin_asof\\n(\\ndf_quotes\\n,\\non\\n=\\n\"time\"\\n,\\nby\\n=\\n\"stock\"\\n)\\nprint\\n(\\ndf_asof_join\\n)\\njoin_asof\\nlet\\ndf_asof_join\\n=\\ndf_trades\\n.\\njoin_asof_by\\n(\\n&\\ndf_quotes\\n,\\n\"time\"\\n,\\n\"time\"\\n,\\n[\\n\"stock\"\\n],\\n[\\n\"stock\"\\n],\\nAsofStrategy\\n::\\nBackward\\n,\\nNone\\n,\\n)\\n?\\n;\\nprintln!\\n(\\n\"{}\"\\n,\\n&\\ndf_asof_join\\n);\\nshape: (4, 4)\\n┌─────────────────────┬───────┬───────┬───────┐\\n│ time ┆ stock ┆ trade ┆ quote │\\n│ --- ┆ --- ┆ --- ┆ --- │\\n│ datetime[μs] ┆ str ┆ i64 ┆ i64 │\\n╞═════════════════════╪═══════╪═══════╪═══════╡\\n│ 2020-01-01 09:01:00 ┆ A ┆ 101 ┆ 100 │\\n│ 2020-01-01 09:01:00 ┆ B ┆ 299 ┆ null │\\n│ 2020-01-01 09:03:00 ┆ B ┆ 301 ┆ 300 │\\n│ 2020-01-01 09:06:00 ┆ C ┆ 500 ┆ 501 │\\n└─────────────────────┴───────┴───────┴───────┘\\nIf you want to make sure that only quotes within a certain time range are joined to the trades you can specify\\nthe\\ntolerance\\nargument. In this case we want to make sure that the last preceding quote is within 1 minute of the\\ntrade so we set\\ntolerance = \"1m\"\\n.\\nPython\\ndf_asof_tolerance_join\\n=\\ndf_trades\\n.\\njoin_asof\\n(\\ndf_quotes\\n,\\non\\n=\\n\"time\"\\n,\\nby\\n=\\n\"stock\"\\n,\\ntolerance\\n=\\n\"1m\"\\n)\\nprint\\n(\\ndf_asof_tolerance_join\\n)\\nshape: (4, 4)\\n┌─────────────────────┬───────┬───────┬───────┐\\n│ time ┆ stock ┆ trade ┆ quote │\\n│ --- ┆ --- ┆ --- ┆ --- │\\n│ datetime[μs] ┆ str ┆ i64 ┆ i64 │\\n╞═════════════════════╪═══════╪═══════╪═══════╡\\n│ 2020-01-01 09:01:00 ┆ A ┆ 101 ┆ 100 │\\n│ 2020-01-01 09:01:00 ┆ B ┆ 299 ┆ null │\\n│ 2020-01-01 09:03:00 ┆ B ┆ 301 ┆ 300 │\\n│ 2020-01-01 09:06:00 ┆ C ┆ 500 ┆ null │\\n└─────────────────────┴───────┴───────┴───────┘', '', 'pola-rs\\n/\\npolars\\nPublic\\nNotifications\\nFork\\n1.6k\\nStar\\n26.8k\\nCode\\nIssues\\n1.6k\\nPull requests\\n90\\nActions\\nProjects\\n1\\nSecurity\\nInsights\\nAdditional navigation options\\nCode\\nIssues\\nPull requests\\nActions\\nProjects\\nSecurity\\nInsights\\nNew issue\\nHave a question about this project?\\nSign up for a free GitHub account to open an issue and contact its maintainers and the community.\\nSign up for GitHub\\nBy clicking “Sign up for GitHub”, you agree to our\\nterms of service\\nand\\nprivacy statement\\n. We’ll occasionally send you account related emails.\\nAlready on GitHub?\\nSign in\\nto your account\\nJump to bottom\\nIntermittent out-of-memory performing join after 0.20.6 update\\n#14201\\nClosed\\n2 tasks done\\ndavid-waterworth\\nopened this issue\\nFeb 1, 2024\\n· 24 comments\\n· Fixed by\\n#14264\\nClosed\\n2 tasks done\\nIntermittent out-of-memory performing join after 0.20.6 update\\n#14201\\ndavid-waterworth\\nopened this issue\\nFeb 1, 2024\\n· 24 comments\\n· Fixed by\\n#14264\\nAssignees\\nLabels\\nA-dtype\\nArea: data types in general\\naccepted\\nReady for implementation\\nbug\\nSomething isn\\'t working\\nP-high\\nPriority: high\\nperformance\\nPerformance issues or improvements\\npython\\nRelated to Python Polars\\nregression\\nIssue introduced by a new release\\nComments\\nCopy link\\ndavid-waterworth\\ncommented\\nFeb 1, 2024\\nChecks\\nI have checked that this issue has not already been reported.\\nI have confirmed this bug exists on the\\nlatest version\\nof Polars.\\nReproducible example\\nI\\'ve not found a reliable way of reproducing.\\nLog output\\nNo response\\nIssue description\\nAfter upgrading to 0.20.6 I started experiencing high memory/cpu usage when using the\\njoin\\noperator. I have a large script that I\\'m in the process of migrating from a jupyter/pandas implementation which has several steps so I\\'ve not been able to reliably reproduce. The issue occurs randomly (i.e. different lines in the script) but the general symptoms are:\\nThe\\nleft\\ntable as ~90k rows x 15 columns. The\\nright\\ntable is very small (i.e. 500 rows, 2 columns) and is a\\nlookup\\n(i.e. match on a key column and return a value). So the code is simply:\\ndf = left.join(right, on=\"key\", how=\"left\")\\nThe \"key\" column is pl.String, both left and right tables contain nested types.\\nThis type of join is repeated multiple times (basically appending columns by performing lookups) and usually takes <1s to execute. But randomly one of them will start using excessive CPU, memory will start linearly increasing until my swap is exhausted. But which operation seems random.\\nI tried saving the two tables to parquet to reproduce but no luck, it only happens when I run my script end to end (and not reliably in the same place)\\nI downgraded to 0.20.5 and the issue hasn\\'t re-occurred.\\nIt\\'s happening on 2 machines (my workstation and in a docker container on AWS)\\nI\\'m not using lazy execution, that\\'s on my todo after I\\'ve finished migrating the original code.\\nI\\'m not really sure what else i can do to help?\\nExpected behavior\\nStable memory usage\\nInstalled versions\\nReplace this line with the output of pl.show_versions(). Leave the backticks in place.\\nThe text was updated successfully, but these errors were encountered:\\n👍\\n4\\ncoinflip112, jsarbach, LevMartinZachar, and ion-elgreco reacted with thumbs up emoji\\nAll reactions\\n👍\\n4 reactions\\ndavid-waterworth\\nadded\\nbug\\nSomething isn\\'t working\\nneeds triage\\nAwaiting prioritization by a maintainer\\npython\\nRelated to Python Polars\\nlabels\\nFeb 1, 2024\\nCopy link\\nContributor\\njsarbach\\ncommented\\nFeb 2, 2024\\nI experience the same, with lazy execution.\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nhagsted\\ncommented\\nFeb 2, 2024\\nI have also seen and endless running process. Polars 0.20.5 is fine. I have tried to make a minimal example, but is not able to reproduce with a something simple.\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nContributor\\ncoinflip112\\ncommented\\nFeb 2, 2024\\nMe and my colleagues experience similar issues. Lazy reads (parquet) with filters and selects seemed to read much more data to memory after upgrade then before. Some of us were successful when downgrading to\\n0.20.5\\nsome of us had to go down to\\n0.20.3\\n.\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nhagsted\\ncommented\\nFeb 2, 2024\\nI might add, that I am not sure if it is the join of if is the read_csv functions that are slow.\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nantonl\\ncommented\\nFeb 2, 2024\\nI\\'m seeing out of memory crashes also.\\n@david-waterworth\\nare you using the conda-forge build of polars?\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nAuthor\\ndavid-waterworth\\ncommented\\nFeb 2, 2024\\n@antonl\\nno I\\'m using PyPI (via poetry).\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nCollaborator\\ndeanm0000\\ncommented\\nFeb 2, 2024\\nCan you put some numbers and specifics behind this?\\nHow long does it usually take?\\nHow long does it take in the bad cases? Does it get stuck in an endless loop and never stop or it\\'s just slower than you\\'d like?\\nWhat is the cpu activity like in good and bad case?\\nWhat is the memory usage in the good and bad case? (for example by looking at htop or win task manager)\\nYou mention you have nested types, are you talking about just structs or structs with lists? lists with structs?\\nDo you make any mutations, (if so how many and of what sort?) to your key variable before the join?\\nIf you cast to categorical before the join, do you still have the issue:\\nwith pl.StringCache():\\n left=left.with_columns(pl.col(\\'key\\').cast(pl.Categorical))\\n right=right.with_columns(pl.col(\\'key\\').cast(pl.Categorical))\\ndf=left.join(right, on=\"key\", how=\"left\")\\nFor reference I can run this standalone simulation with no issue.\\nimport polars as pl\\nimport numpy as np\\n\\n\\ndef gen_long_string(str_len, n_rows):\\n rng = np.random.default_rng()\\n return rng.integers(low=96, high=122, size=n_rows * str_len, dtype=\"uint32\").view(\\n f\"U{str_len}\"\\n )\\n\\nleft_n=90_000\\nleft = pl.DataFrame({\\n \\'key\\':gen_long_string(20,int(left_n/5)).tolist()*5,\\n **{str(x):np.random.normal(0,1,int(left_n/5)*5) for x in range(1,16)} \\n})\\n\\nright_n=500\\nright = pl.DataFrame({\\n \\'key\\':left.sample(right_n).get_column(\\'key\\').to_list(),\\n \\'zz\\':np.random.normal(0,1,right_n)\\n})\\n\\nleft.join(right, on=\"key\", how=\"left\")\\n👍\\n1\\nion-elgreco reacted with thumbs up emoji\\nAll reactions\\n👍\\n1 reaction\\nSorry, something went wrong.\\nCopy link\\nAuthor\\ndavid-waterworth\\ncommented\\nFeb 3, 2024\\n•\\nedited\\n@deanm0000\\nyes I can also run a similar stand-alone example with the exact same data that causes the issue in my script (i.e. I added a breakpoint and saved the 2 df\\'s as parquet then loaded them in a notebook).\\nIf I run with\\nv0.20.5\\nit takes < 0.5s and the CPU/memory usage is negligible.\\nIf I run with\\nv0.20.6\\nit never completes as it crashes my machine. This takes a few minutes, initially all my cores spin up to 100% for maybe 30 seconds or so, then after that the CPU usage drops to one core but the memory starts ramping up and then eventually fails.\\nI have multiple struct fields, list[str] and list[int] and list[struct].\\nThere\\'s several steps as I\\'m migrating a bunch of manually run notebooks, and I more or less replicated the original steps, so at this points it\\'s not overly optimised or refactored.\\nFirst I construct the main dataframe from a list (~90k elements) of dicts\\ntasks_df = pl.from_dicts(tasks, schema=schema)\\nThe I add rank, I wasn\\'t sure how to do this as a single operation so I generated the rank over a subset and joined it back (the filter is also over a list[int] column.\\nmasked_rank = (\\n tasks_df.filter(pl.col(\\'missing_metadata\\').is_null())\\n .with_columns(rank=pl.col(\\'priority\\').rank(method=\\'ordinal\\').over(\\'rule_name\\', \\'primary_equipment\\'))\\n .select(\"row_id\", \"rank\")\\n )\\n tasks_df = tasks_df.join(masked_rank, on=\\'row_id\\', how=\\'left\\')\\nThen there\\'s a series of other joins, which is where the issue usually surfaces.\\ni.e.\\ntasks_df = tasks_df.join(types_df, on=\"template\", how=\"left\")\\nThe only mutation of the key variable I think is a rename (which I assume isn\\'t technically a mutation?). But one of the joins that has defn failed does a\\ngroup by\\non the key variable of the left table before the join (this is supposed to create one row per key in the right table, and converting the value to a list[str].\\nprimary_equipment_types_df = (\\n templates_df.join(template_metadata_types_df, on=\"template_id\")\\n .group_by(\"template_name\")\\n .agg(primary_equipment_types=pl.col(\"type_code\"))\\n .rename({\"template_name\": \"template\"})\\n )\\n tasks_df = tasks_df.join(primary_equipment_types_df, on=\"template\", how=\"left\")\\nIf there\\'s any tracing I can run, I\\'ll create a\\n0.20.6\\nbranch next week if you need to me run anything.\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nCollaborator\\ndeanm0000\\ncommented\\nFeb 3, 2024\\nCould you try your joins with your strings cast to categorical before joining? That will strengthen the theory that the new string implementation is the cause.\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nAuthor\\ndavid-waterworth\\ncommented\\nFeb 3, 2024\\n•\\nedited\\n@deanm0000\\nYeah I think your theory is correct. I reproduced the issue in my projects 0.20.6 branch. Then I ran again with the key cast to\\npl.Categorical\\nand it ran fine.\\nTo be 100% sure I ran the original code and broke on the actual line where it was crashing, performed the join in the debug window using\\npl.Categorical\\nand it worked fine, then ran the\\npl.String\\nversion and it crashed.\\nSo it seems fairly likely that the\\nstr\\nrefactor has introduced this.\\n😕\\n1\\nion-elgreco reacted with confused emoji\\nAll reactions\\n😕\\n1 reaction\\nSorry, something went wrong.\\ndeanm0000\\nadded\\nperformance\\nPerformance issues or improvements\\nregression\\nIssue introduced by a new release\\nP-high\\nPriority: high\\nA-dtype\\nArea: data types in general\\nand removed\\nneeds triage\\nAwaiting prioritization by a maintainer\\nlabels\\nFeb 3, 2024\\nCopy link\\ndetrin\\ncommented\\nFeb 3, 2024\\nWe had to fix the production pipeline partially to\\n0.20.5\\n. We will be looking forward to seeing the fix.\\n👍\\n2\\nLevMartinZachar and ion-elgreco reacted with thumbs up emoji\\nAll reactions\\n👍\\n2 reactions\\nSorry, something went wrong.\\nCopy link\\nMember\\nritchie46\\ncommented\\nFeb 4, 2024\\n•\\nedited\\nCan someone please get a reproduction? How should I fix it if I cannot cause it?\\n@david-waterworth\\ncan you sent me your script perhaps (and remove parts that don\\'t influence this behavior).\\nI really want to fix this, so help is greatly appreciated.\\n@detrin\\ndo you have something that we can use to reproduce this?\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\ndetrin\\ncommented\\nFeb 4, 2024\\nI have relatively free Sunday, I will be glad to prepare a minimal example.\\n❤️\\n1\\nritchie46 reacted with heart emoji\\nAll reactions\\n❤️\\n1 reaction\\nSorry, something went wrong.\\nCopy link\\ndetrin\\ncommented\\nFeb 4, 2024\\n@ritchie46\\nI don\\'t know how to profile it well, from python I am getting too great variance to confirm the issue. I know for a fact that at this point of script I was getting an OOM. Here I extracted for you two intermediate tables that have 1000 lines for the sake of an example, in production score they have millions of rows.\\nimport\\npolars\\nas\\npl\\nfrom\\nmemory_profiler\\nimport\\nmemory_usage\\nprogram\\n=\\npl\\n.\\nread_csv\\n(\\n\"https://eu2.contabostorage.com/62824c32198b4d53a08054da7a8b4df1:polarsissue14201/program_example.csv\"\\n)\\nsource\\n=\\npl\\n.\\nread_csv\\n(\\n\"https://eu2.contabostorage.com/62824c32198b4d53a08054da7a8b4df1:polarsissue14201/source_example.csv\"\\n)\\ndef\\nmain\\n():\\nprogram_csfd\\n=\\nprogram\\n.\\nlazy\\n().\\njoin\\n(\\nsource\\n.\\nlazy\\n(),\\nhow\\n=\\n\"left\"\\n,\\nleft_on\\n=\\n\"program_title_norm\"\\n,\\nright_on\\n=\\n\"title_norm\"\\n).\\ncollect\\n(\\nstreaming\\n=\\nTrue\\n)\\nif\\n__name__\\n==\\n\\'__main__\\'\\n:\\nmem_usage\\n=\\nmemory_usage\\n(\\nmain\\n)\\n# print(f\"polars version: {pl.__version__}\")\\nprint\\n(\\n\\'Maximum memory usage: %s MB\\'\\n%\\nmax\\n(\\nmem_usage\\n))\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nMember\\nritchie46\\ncommented\\nFeb 4, 2024\\nHave you got some more info? E.g. what you did up to this front? And I likely need more rows, 1000 rows is nothing. (I cannot concatenate that into a large dataset as that will lead to duplicates which will explode the join output).\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nMember\\nritchie46\\ncommented\\nFeb 4, 2024\\n@hagsted\\n@jsarbach\\n@antonl\\ndoes any of you have a reproducable examply?\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\ndetrin\\ncommented\\nFeb 4, 2024\\nHave you got some more info? E.g. what you did up to this front? And I likely need more rows, 1000 rows is nothing. (I cannot concatenate that into a large dataset as that will lead to duplicates which will explode the join output).\\nI can\\'t provide the whole tables, I could perhaps reduce the number of columns and mask the titles, so that I would still see the increase of RAM used. That requires a lot of trial and error attempts, I am afraid I don\\'t have that kind of time today or following days. However, I might prepare bigger reproducible example during workday. It most likely will be during this next week.\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nMember\\nritchie46\\ncommented\\nFeb 4, 2024\\nThanks\\n@detrin\\n. If I can reproduce, I can fix it. So I will wait until then.\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nContributor\\njsarbach\\ncommented\\nFeb 4, 2024\\n@ritchie46\\nTurns out mine is not related to a\\njoin\\nat all. Instead, it seems to come from\\nlist.eval\\n.\\nhttps://storage.googleapis.com/endless-dialect-336512-ew6/14201.parquet\\ndf = pl.read_parquet(\\'14201.parquet\\')\\ndf = df.filter(pl.col(\\'id_numeric\\').is_not_null() & pl.col(\\'Team\\').is_not_null()).group_by(by=[\\'id_numeric\\', \\'Team\\']).agg(pl.col(\\'Items\\'))\\n\\ndf.with_columns(pl.col(\\'Items\\').list.eval(pl.element().drop_nulls()))\\nThe last line causes it to go out of memory with 0.20.6, but not with <0.20.6.\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nMember\\nritchie46\\ncommented\\nFeb 4, 2024\\nThank you\\n@jsarbach\\n. I can do something with this. :)\\n👍\\n1\\njsarbach reacted with thumbs up emoji\\nAll reactions\\n👍\\n1 reaction\\nSorry, something went wrong.\\nCopy link\\nhagsted\\ncommented\\nFeb 4, 2024\\nHi. I found some time to look into this. And found that read_csv behaves a little different between 0.20.5 and 0.20.6.\\nI had a csv input with every second line empty. In 0.20.5 the empty lines are just dropped, but in 0.20.6, they are included in the data frame as a row with all Null. This gave me a follow up problem when I used rle_id() on a column. Iti will give a group for each row, as they would be value, Null, value, Null and so on, so never two consecutive rows with the same value.\\nI then later try to group_by on my rle_id column, and will of course not get any grouping. Finally add a Box to a plotly figure for each group. In principle it is then my plotting of the figure that hangs the system, as I will add a large amount of boxes to the figure, and make it extremely slow to draw.\\nI hope it make some sense.\\nA small example to the changed reading:\\ntext = \"\"\"\\n A, B, C,\\n\\n 1,1,1,\\n\\n 2,2,2,\\n\\n 3,3,3,\\n\\n\"\"\"\\ndf = pl.read_csv(io.StringIO(text))\\nprint(df)\\nWhich in 0.20.5 gives:\\nshape: (3, 4)\\n┌───────────┬─────┬─────┬──────┐\\n│ A ┆ B ┆ C ┆ │\\n│ --- ┆ --- ┆ --- ┆ --- │\\n│ i64 ┆ i64 ┆ i64 ┆ str │\\n╞═══════════╪═════╪═════╪══════╡\\n│ 1 ┆ 1 ┆ 1 ┆ null │\\n│ 2 ┆ 2 ┆ 2 ┆ null │\\n│ 3 ┆ 3 ┆ 3 ┆ null │\\n└───────────┴─────┴─────┴──────┘\\nbut in 0.20.6 gives:\\nshape: (7, 4)\\n┌───────┬──────┬──────┬──────┐\\n│ A ┆ B ┆ C ┆ │\\n│ --- ┆ --- ┆ --- ┆ --- │\\n│ i64 ┆ i64 ┆ i64 ┆ str │\\n╞═══════╪══════╪══════╪══════╡\\n│ null ┆ null ┆ null ┆ null │\\n│ 1 ┆ 1 ┆ 1 ┆ null │\\n│ null ┆ null ┆ null ┆ null │\\n│ 2 ┆ 2 ┆ 2 ┆ null │\\n│ null ┆ null ┆ null ┆ null │\\n│ 3 ┆ 3 ┆ 3 ┆ null │\\n│ null ┆ null ┆ null ┆ null │\\n└───────┴──────┴──────┴──────┘\\nBest regards Kristian\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nMember\\nritchie46\\ncommented\\nFeb 4, 2024\\n@hagsted\\nthis is unrelated to the issue. And this is because a bug was fixed. Whitespace belongs to the value and is not longer ignored int he value. You can argue that in this case the schema inference is incorrect as it should all be string columns. Can you open a new issue for that? Then we can leave this issue on topic.\\nAll reactions\\nSorry, something went wrong.\\nritchie46\\nmentioned this issue\\nFeb 4, 2024\\nfix: deduplicate recursive growables\\n#14264\\nMerged\\nritchie46\\nclosed this as\\ncompleted\\nin\\n#14264\\nFeb 4, 2024\\nCopy link\\nAuthor\\ndavid-waterworth\\ncommented\\nFeb 5, 2024\\nThanks\\n@ritchie46\\nthis seems to have fixed this issue on the dateset I originally observed the problem on!\\n🎉\\n1\\nritchie46 reacted with hooray emoji\\nAll reactions\\n🎉\\n1 reaction\\nSorry, something went wrong.\\nc-peters\\nadded\\n the\\naccepted\\nReady for implementation\\nlabel\\nFeb 5, 2024\\nc-peters\\nassigned\\nritchie46\\nFeb 5, 2024\\nkarlwiese\\nmentioned this issue\\nApr 12, 2024\\nGrouping to list of\\nstruct\\nis slower in 0.20.6 than in 0.20.5 and leads to out-of-memory eventually\\n#15615\\nClosed\\n2 tasks\\nCopy link\\nkarlwiese\\ncommented\\nApr 12, 2024\\nI think I found a reproducible example:\\n#15615\\nAll reactions\\nSorry, something went wrong.\\ndhruvyy\\nmentioned this issue\\nApr 19, 2024\\nStreaming pipeline runs out of memory\\n#15771\\nOpen\\n2 tasks\\nSign up for free\\nto join this conversation on GitHub\\n.\\n Already have an account?\\nSign in to comment\\nAssignees\\nritchie46\\nLabels\\nA-dtype\\nArea: data types in general\\naccepted\\nReady for implementation\\nbug\\nSomething isn\\'t working\\nP-high\\nPriority: high\\nperformance\\nPerformance issues or improvements\\npython\\nRelated to Python Polars\\nregression\\nIssue introduced by a new release\\nProjects\\nBacklog\\nArchived in project\\nMilestone\\nNo milestone\\nDevelopment\\nSuccessfully merging a pull request may close this issue.\\nfix: deduplicate recursive growables\\npola-rs/polars\\n10 participants', '', 'pola-rs\\n/\\npolars\\nPublic\\nNotifications\\nFork\\n1.6k\\nStar\\n26.8k\\nCode\\nIssues\\n1.6k\\nPull requests\\n90\\nActions\\nProjects\\n1\\nSecurity\\nInsights\\nAdditional navigation options\\nCode\\nIssues\\nPull requests\\nActions\\nProjects\\nSecurity\\nInsights\\nNew issue\\nHave a question about this project?\\nSign up for a free GitHub account to open an issue and contact its maintainers and the community.\\nSign up for GitHub\\nBy clicking “Sign up for GitHub”, you agree to our\\nterms of service\\nand\\nprivacy statement\\n. We’ll occasionally send you account related emails.\\nAlready on GitHub?\\nSign in\\nto your account\\nJump to bottom\\nBug Report: Joining multiple DataFrames is causing the joined DataFrame to grow exponentially\\n#2986\\nClosed\\nrobinhaecker\\nopened this issue\\nMar 25, 2022\\n· 5 comments\\nClosed\\nBug Report: Joining multiple DataFrames is causing the joined DataFrame to grow exponentially\\n#2986\\nrobinhaecker\\nopened this issue\\nMar 25, 2022\\n· 5 comments\\nComments\\nCopy link\\nrobinhaecker\\ncommented\\nMar 25, 2022\\nWhat language are you using?\\nRust\\nWhich feature gates did you use?\\nnone\\nWhat version of polars are you using?\\n0.20.0\\nWhat operating system are you using polars on?\\nUbuntu 20.04.4 LTS\\nWhat language version are you using\\nRust 2021 Edition\\nrustc 1.58.1 (db9d1b20b 2022-01-20)\\nDescribe your bug.\\nWhen joining DataFrames repeatedly, columns seem to get duplicated and the DataFrame\\'s height is growing exponentially.\\nWhat are the steps to reproduce the behavior?\\nMinimal Example can be found in this Gist:\\nhttps://gist.github.com/robinhaecker/f962f678f2f21da143f576afaff92585\\nDataFrames are loaded from a fixed csv-file and joined into an aggregated DataFrame repeatedly. As the data itself (and hence the \"timestamp\" column which is used for the join) is identical, I would expect the DataFrame\\'s height to stay the same. However, the DataFrame is growing after each iteration.\\nWhat is the actual behavior?\\nMinimal Example can be found in this Gist:\\nhttps://gist.github.com/robinhaecker/f962f678f2f21da143f576afaff92585\\nDataFrame is growing with each iteration exponentially.\\nActual Output:\\nA) (39091, 2)\\nB) (40713, 3)\\nC) (82717, 4)\\nD) (1411941, 5)\\nWhat is the expected behavior?\\nExpected Output:\\nA) (39091, 2)\\nB) (39091, 3)\\nC) (39091, 4)\\nD) (39091, 5)\\nWhat do you think polars should have done?\\nWhen joining two DataFrames on identical columns, the columns should be appended and the number of rows should stay the same.\\nPS: I really like Polars, it\\'s a great library. :-)\\nThe text was updated successfully, but these errors were encountered:\\nAll reactions\\nrobinhaecker\\nadded\\n the\\nbug\\nSomething isn\\'t working\\nlabel\\nMar 25, 2022\\nCopy link\\nMember\\nritchie46\\ncommented\\nMar 25, 2022\\nWhat kind of join do you do? Only a left join guarantees same number of rows. An inner join creates more rows for every match?\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nAuthor\\nrobinhaecker\\ncommented\\nMar 25, 2022\\n•\\nedited\\nThe code snippet is here:\\nhttps://gist.githubusercontent.com/robinhaecker/f962f678f2f21da143f576afaff92585/raw/c438b0e43012c1dda089fb306983ec3e946c001e/main.rs\\nIn this example I\\'ve used an outer join, but have also tried a left join and experienced the same issue.\\nI have not tried an inner join, as it was not suitable for my specific use case.\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nMember\\nritchie46\\ncommented\\nMar 25, 2022\\nNow that I think of it, a left join also returns more rows if it has multiple matches.\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nAuthor\\nrobinhaecker\\ncommented\\nMar 26, 2022\\nOk, thanks for the info. I just double checked - it seems that this is in fact caused by duplicated values - when I remove those, the problem is gone. :-)\\nStill, I\\'m wondering if this is expected behaviour? As someone not too familiar with the internals of Polars, it was certainly unexpected for me. In case that this behaviour is in fact expected and intended, I would suggest to document this on the DataFrame left_join and outer_join functions to avoid people getting equally confused as I did in the future.\\nAdditional note: Out of curiosity I just tried with an inner join as well, and experienced the exact same DataFrame growth behaviour in case of duplicated values in the join_on columns.\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nCollaborator\\nzundertj\\ncommented\\nMar 27, 2022\\nThis is also the behaviour in Pandas and SQL. The reason is that a join always returns all combinations of keys that match left with right, aka the cartesian product. So if a key exists twice in the right dataframe, every record on the left dataframe with that same key will be returned twice. Pandas has a good write up on this:\\nhttps://pandas.pydata.org/docs/user_guide/merging.html#brief-primer-on-merge-methods-relational-algebra\\n.\\nInner join means that only rows which have a match in the other dataframe are being returned; a left (right) join also returns entries in the left (right) dataframe which do not have a match in the right (left) dataframe. See also\\nhttps://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins\\nfor a visual.\\n👍\\n1\\nrobinhaecker reacted with thumbs up emoji\\nAll reactions\\n👍\\n1 reaction\\nSorry, something went wrong.\\nzundertj\\nremoved\\n the\\nbug\\nSomething isn\\'t working\\nlabel\\nMar 27, 2022\\nritchie46\\nclosed this as\\ncompleted\\nMar 28, 2022\\nSign up for free\\nto join this conversation on GitHub\\n.\\n Already have an account?\\nSign in to comment\\nAssignees\\nNo one assigned\\nLabels\\nNone yet\\nProjects\\nNone yet\\nMilestone\\nNo milestone\\nDevelopment\\nNo branches or pull requests\\n3 participants', '', 'pola-rs\\n/\\npolars\\nPublic\\nNotifications\\nFork\\n1.6k\\nStar\\n26.8k\\nCode\\nIssues\\n1.6k\\nPull requests\\n90\\nActions\\nProjects\\n1\\nSecurity\\nInsights\\nAdditional navigation options\\nCode\\nIssues\\nPull requests\\nActions\\nProjects\\nSecurity\\nInsights\\nNew issue\\nHave a question about this project?\\nSign up for a free GitHub account to open an issue and contact its maintainers and the community.\\nSign up for GitHub\\nBy clicking “Sign up for GitHub”, you agree to our\\nterms of service\\nand\\nprivacy statement\\n. We’ll occasionally send you account related emails.\\nAlready on GitHub?\\nSign in\\nto your account\\nJump to bottom\\nset_sorted seems too often necessary\\n#9931\\nClosed\\n2 tasks done\\nm-legrand\\nopened this issue\\nJul 17, 2023\\n· 7 comments\\n· Fixed by\\n#9933\\nClosed\\n2 tasks done\\nset_sorted seems too often necessary\\n#9931\\nm-legrand\\nopened this issue\\nJul 17, 2023\\n· 7 comments\\n· Fixed by\\n#9933\\nLabels\\naccepted\\nReady for implementation\\nbug\\nSomething isn\\'t working\\npython\\nRelated to Python Polars\\nComments\\nCopy link\\nm-legrand\\ncommented\\nJul 17, 2023\\nChecks\\nI have checked that this issue has not already been reported.\\nI have confirmed this bug exists on the\\nlatest version\\nof Polars.\\nReproducible example\\n# 1. Inner joins\\n>>\\n>\\ndf1\\n=\\npl\\n.\\nDataFrame\\n({\\n\"x\"\\n: [\\n1\\n,\\n2\\n,\\n3\\n],\\n\"y\"\\n: [\\n2\\n,\\n4\\n,\\n6\\n]})\\n>>\\n>\\ndf2\\n=\\npl\\n.\\nDataFrame\\n({\\n\"x\"\\n: [\\n1\\n,\\n2\\n,\\n3\\n],\\n\"z\"\\n: [\\n1\\n,\\n4\\n,\\n9\\n]})\\n>>\\n>\\ndf1\\n.\\nset_sorted\\n(\\n\"x\"\\n).\\njoin\\n(\\ndf2\\n,\\non\\n=\\n\"x\"\\n,\\nhow\\n=\\n\"inner\"\\n)[\\n\"x\"\\n].\\nflags\\n{\\n\\'SORTED_ASC\\'\\n:\\nFalse\\n,\\n\\'SORTED_DESC\\'\\n:\\nFalse\\n}\\n>>\\n>\\ndf1\\n.\\njoin\\n(\\ndf2\\n.\\nset_sorted\\n(\\n\"x\"\\n),\\non\\n=\\n\"x\"\\n,\\nhow\\n=\\n\"inner\"\\n)[\\n\"x\"\\n].\\nflags\\n{\\n\\'SORTED_ASC\\'\\n:\\nFalse\\n,\\n\\'SORTED_DESC\\'\\n:\\nFalse\\n}\\n# 2. Singleton dataframes\\n>>\\n>\\npl\\n.\\nDataFrame\\n({\\n\"x\"\\n: [\\n1\\n]})[\\n\"x\"\\n].\\nflags\\n{\\n\\'SORTED_ASC\\'\\n:\\nFalse\\n,\\n\\'SORTED_DESC\\'\\n:\\nFalse\\n}\\n# 3. Left of an asof join\\n>>\\n>\\ndf1\\n.\\njoin_asof\\n(\\ndf2\\n.\\nset_sorted\\n(\\n\"x\"\\n),\\non\\n=\\n\"x\"\\n)\\nInvalidOperationError\\n:\\nargument\\nin\\noperation\\n\\'asof_join\\'\\nis\\nnot\\nexplicitly\\nsorted\\nIssue description\\nThese days working with Polars I run very very often into the now dreaded\\nexceptions.InvalidOperationError: argument in operation \\'asof_join\\' is not explicitly sorted\\n- If your data is ALREADY sorted, set the sorted flag with: \\'.set_sorted()\\'.\\n- If your data is NOT sorted, sort the \\'expr/series/column\\' first.\\nI therefore end up adding\\nset_sorted\\nflags absolutely everywhere, and is actually tempted to implement an extension roughly being\\ndf1.ext.unsafe_asof(df2, on=cols) = df1.set_sorted(cols) .join_asof(df2.set_sorted(cols), on=cols)\\n... which is obviously not ideal.\\nDigging into casesof my code where this is coming from I isolated some examples reproduced above, but I\\'m sure there are more where they come from.\\nI just want to stress this is not a critical bug (as it is more conservative than necessary rather than throwing wrong results).\\nHowever this is greatly impacting the ease and joy of using polars, and ultimately weighs into the decision of using it for a given task of data combination or not.\\nRelated issues:\\npartition_by(maintain_order=True) doesn\\'t preserve SORTED_ASC/SORTED_DESC flags\\n#9757\\n(\\npartition_by\\n)\\nExpected behavior\\nIn case 1 I would expect to have a sorted result as the inner join should preserve the order.\\nIn case 2 it would be convenient to have the dataframe already set to sorted whenever it contains only one column.\\nIt happens to me a lot when constructing a dummy single-row dataframe to extract the \"as-of value\" of another, non-trivial dataset (cf. case 3).\\nIn case 3 I genuinely don\\'t understand why the left dataframe would need to be sorted.\\nThe operation on its side is similar to a join and I often don\\'t even want to sort it by the the time value!\\nOnly the right dataframe (the one on which we\\'re running the as-of logic) should be required to be sorted.\\nInstalled versions\\n--------Version info---------\\nPolars: 0.18.7\\nIndex type: UInt32\\nPlatform: Windows-10-10.0.19041-SP0\\nPython: 3.9.0 (tags/v3.9.0:9cf6752, Oct 5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]\\n\\n----Optional dependencies----\\nadbc_driver_sqlite: \\nconnectorx: \\ndeltalake: \\nfsspec: 2021.11.1\\nmatplotlib: 3.5.1\\nnumpy: 1.21.5\\npandas: 1.3.5\\npyarrow: 6.0.1\\npydantic: 1.8.2\\nsqlalchemy: 1.4.28\\nxlsx2csv: \\nxlsxwriter: ```\\n\\n\\nThe text was updated successfully, but these errors were encountered:\\nAll reactions\\nm-legrand\\nadded\\nbug\\nSomething isn\\'t working\\npython\\nRelated to Python Polars\\nlabels\\nJul 17, 2023\\nCopy link\\nMember\\nritchie46\\ncommented\\nJul 17, 2023\\n•\\nedited\\nI don\\'t agree with the sentiment that asking for a set_sorted is worse than silently producing flawed results.\\nHowever, I do agree that we must maintain those flags. So in the case of a\\nsingleton\\n, yes we must set the sorted flag.\\nAnd in case of the inner join, IFF sortedness is maintained (I have to check) we must maintain the flags.\\nCase 3, I also have to investigate, I don\\'t have it clear on top of mind.\\n👍\\n1\\nm-legrand reacted with thumbs up emoji\\nAll reactions\\n👍\\n1 reaction\\nSorry, something went wrong.\\nCopy link\\nContributor\\ncmdlineluser\\ncommented\\nJul 17, 2023\\nIn case 3 I genuinely don\\'t understand why the left dataframe would need to be sorted.\\nIt\\'s similar to a\\n.search_sorted()\\noperation i.e it only \"works\" on sorted data.\\ns\\n=\\npl\\n.\\nSeries\\n([\\n5\\n,\\n9\\n,\\n2\\n,\\n10\\n,\\n8\\n])\\ns\\n.\\nsearch_sorted\\n(\\n6\\n,\\nside\\n=\\n\\'left\\'\\n)\\n# 3\\n# NOT OK - 2 and 10 are not the nearest values\\ns\\n[\\n2\\n:\\n4\\n]\\n# shape: (2,)\\n# Series: \\'\\' [i64]\\n# [\\n# \\t2\\n# \\t10\\n# ]\\nIf the data is sorted, the nearest values are either side of the insertion index.\\ns\\n=\\npl\\n.\\nSeries\\n([\\n5\\n,\\n9\\n,\\n2\\n,\\n10\\n,\\n8\\n]).\\nsort\\n()\\ns\\n.\\nsearch_sorted\\n(\\n6\\n,\\nside\\n=\\n\\'left\\'\\n)\\n# 2\\n# OK - 5 and 8 are the nearest values\\ns\\n[\\n1\\n:\\n3\\n]\\n# shape: (2,)\\n# Series: \\'\\' [i64]\\n# [\\n# \\t5 # <- backward\\n# \\t8 # <- forward\\n# ]\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nMember\\nritchie46\\ncommented\\nJul 17, 2023\\nYes, I can confirm case 3 just needs sorted data on both sides. So that is a correct Error.\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nMember\\nritchie46\\ncommented\\nJul 17, 2023\\n•\\nedited\\nCase 1:\\nAn inner join is not guaranteed to return in sorted order that why the flags represent that. This can change on the number of rows in your data. So the correct query in your case would be to sort the data after the inner join.\\nSo this is a correct Error and would have possible saved you from silently producing incorrect data if your input size changed.\\nSee the example below where an inner join on sorted data produces a\\nSeries\\nthat is not sorted.\\ndf1\\n=\\npl\\n.\\nDataFrame\\n({\\n\"x\"\\n: [\\n1\\n,\\n2\\n,\\n3\\n,\\n4\\n],\\n\"y\"\\n: [\\n2\\n,\\n4\\n,\\n6\\n,\\n6\\n]}).\\nset_sorted\\n(\\n\"x\"\\n)\\ndf2\\n=\\npl\\n.\\nDataFrame\\n({\\n\"x\"\\n: [\\n4\\n,\\n2\\n,\\n3\\n,\\n1\\n],\\n\"z\"\\n: [\\n1\\n,\\n4\\n,\\n9\\n,\\n1\\n]})\\ndf1\\n.\\njoin\\n(\\ndf2\\n,\\non\\n=\\n\"x\"\\n,\\nhow\\n=\\n\"inner\"\\n)[\\n\"x\"\\n]\\nshape: (4,)\\nSeries: \\'x\\' [i64]\\n[\\n\\t4\\n\\t2\\n\\t3\\n\\t1\\n]\\n👍\\n1\\nalexander-beedie reacted with thumbs up emoji\\nAll reactions\\n👍\\n1 reaction\\nSorry, something went wrong.\\nThis was referenced\\nJul 17, 2023\\nfeat(rust, python): keep sorted flag in streaming left join\\n#9932\\nMerged\\nfeat(rust, python): sorted flag on singletons\\n#9933\\nMerged\\nCopy link\\nMember\\nritchie46\\ncommented\\nJul 17, 2023\\nCase 2:\\n#9933\\nadds sorted flags for singletons\\nI think we can close this now. As case 1 and case 3 are correct positives and case 2 is fixed now.\\nAll reactions\\nSorry, something went wrong.\\nritchie46\\nmentioned this issue\\nJul 17, 2023\\npartition_by(maintain_order=True) doesn\\'t preserve SORTED_ASC/SORTED_DESC flags\\n#9757\\nClosed\\n2 tasks\\nstinodego\\nadded\\n the\\naccepted\\nReady for implementation\\nlabel\\nJul 17, 2023\\nritchie46\\nclosed this as\\ncompleted\\nin\\n#9933\\nJul 18, 2023\\nCopy link\\nAuthor\\nm-legrand\\ncommented\\nJul 18, 2023\\nCase 1\\nAgreed that the behaviour in practice doesn\\'t preserve order.\\nYour example\\n@ritchie46\\nseems to infer that the order\\nof the second dataframe\\ncould be preserved, so just putting here a counter-example I found testing it, for curious minds:\\n>>\\n>\\ndf\\n=\\npl\\n.\\nDataFrame\\n({\\n\"x\"\\n: [\\n0\\n,\\n2\\n,\\n1\\n,\\n0\\n]})\\n>>\\n>\\ndf\\n.\\njoin\\n(\\npl\\n.\\nDataFrame\\n({\\n\"x\"\\n: [\\n1\\n,\\n2\\n,\\n3\\n]}),\\non\\n=\\n\"x\"\\n,\\nhow\\n=\\n\"inner\"\\n)\\nshape\\n: (\\n2\\n,\\n1\\n)\\n┌─────┐\\n│\\nx\\n│\\n│\\n-\\n-\\n-\\n│\\n│\\ni64\\n│\\n╞═════╡\\n│\\n2\\n│\\n│\\n1\\n│\\n└─────┘\\n>>\\n>\\ndf\\n.\\njoin\\n(\\npl\\n.\\nDataFrame\\n({\\n\"x\"\\n: [\\n0\\n,\\n1\\n,\\n2\\n,\\n3\\n]}),\\non\\n=\\n\"x\"\\n,\\nhow\\n=\\n\"inner\"\\n)\\nshape\\n: (\\n4\\n,\\n1\\n)\\n┌─────┐\\n│\\nx\\n│\\n│\\n-\\n-\\n-\\n│\\n│\\ni64\\n│\\n╞═════╡\\n│\\n0\\n│\\n│\\n0\\n│\\n│\\n1\\n│\\n│\\n2\\n│\\n└─────┘\\nI still think the order of at least one of the dataframes could be preserved (ideally the left one, to be consistent with other cases), maybe adding a\\nmaintain_order: bool\\nargument?\\nA naive implementation could look like this:\\ndef\\ninner_join\\n(\\ndf1\\n,\\ndf2\\n,\\non\\n:\\nstr\\n,\\n*\\n,\\nmaintain_order\\n:\\nbool\\n=\\nFalse\\n):\\nif\\nmaintain_order\\n:\\ndf1\\n=\\ndf1\\n.\\nfilter\\n(\\npl\\n.\\ncol\\n(\\non\\n).\\nis_in\\n(\\ndf2\\n[\\non\\n]))\\ndf2\\n=\\ndf2\\n.\\nfilter\\n(\\npl\\n.\\ncol\\n(\\non\\n).\\nis_in\\n(\\ndf1\\n[\\non\\n]))\\nreturn\\ndf1\\n.\\njoin\\n(\\ndf2\\n,\\non\\n=\\non\\n,\\nhow\\n=\\n\"left\"\\n)\\nelse\\n:\\nreturn\\ndf1\\n.\\njoin\\n(\\ndf2\\n,\\non\\n=\\non\\n,\\nhow\\n=\\n\"inner\"\\n)\\nCase 3\\nYes it is similar to a\\nsearch_sorted\\noperation, but on the right dataframe, not the left.\\nFunctionally an asof join is a\\nsearch_sorted\\noperation on the right dataframe, mapped to each row of the left dataframe.\\nThe fact that the left dataframe has to be sorted makes sense after some thought from a performance POV though (as you don\\'t have to search in the parts of right-dataframe you\\'ve already searched), but not because of the\\nsearch_sorted\\nsemantic IMHO.\\nMaybe a mention in the docs of why each dataframe needs to be sorted would help users like me develop a better intuition and write code more naturally.\\nAll reactions\\nSorry, something went wrong.\\nCopy link\\nMember\\nritchie46\\ncommented\\nJul 18, 2023\\nI still think the order of at least one of the dataframes could be preserved\\nYes, it could, but it would have a runtime cost. And more complexity in our engine. If you want to maintain sortedness, my answer for now is, sort after the join.\\nYes it is similar to a search_sorted operation, but on the right dataframe, not the left.\\nIf the left\\nDataFrame\\nwouldn\\'t be sorted, we must do a binary search on the right hand side for every tuple in left. This would have very bad performance. The sortedness requirement is there because the implementation constraints.\\n👍\\n1\\nm-legrand reacted with thumbs up emoji\\nAll reactions\\n👍\\n1 reaction\\nSorry, something went wrong.\\nSign up for free\\nto join this conversation on GitHub\\n.\\n Already have an account?\\nSign in to comment\\nAssignees\\nNo one assigned\\nLabels\\naccepted\\nReady for implementation\\nbug\\nSomething isn\\'t working\\npython\\nRelated to Python Polars\\nProjects\\nBacklog\\nArchived in project\\nMilestone\\nNo milestone\\nDevelopment\\nSuccessfully merging a pull request may close this issue.\\nfeat(rust, python): sorted flag on singletons\\npola-rs/polars\\n4 participants', '']\n"
+ ]
+ }
+ ]
+ }
+ ]
+}
\ No newline at end of file