diff --git a/notebooks/RAGoon_SimilaritySearch_cookbook.ipynb b/notebooks/RAGoon_SimilaritySearch_cookbook.ipynb new file mode 100644 index 0000000..c614580 --- /dev/null +++ b/notebooks/RAGoon_SimilaritySearch_cookbook.ipynb @@ -0,0 +1,6060 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# RAGoon SimilaritySearch cookbook ⚡\n", + "[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)\n", + "[![GitHub](https://img.shields.io/badge/GitHub-Project-blue?logo=github)](https://github.com/louisbrulenaudet/ragoon)\n", + "\n", + "![Plot](https://github.com/louisbrulenaudet/ragoon/blob/main/thumbnail.png?raw=true)\n", + "\n", + "RAGoon is a set of NLP utilities for multi-model embedding production, high-dimensional vector visualization, and aims to improve language model performance by providing contextually relevant information through search-based querying, web scraping and data augmentation techniques.\n", + "\n", + "In this notebook, you will learn how to create and search document in a corpus using scalar (int8) rescoring.\n", + "\n", + "## Quick install\n", + "The reference page for RAGoon is available on the official page of PyPI: [RAGoon](https://pypi.org/project/ragoon/).\n", + "\n", + "```python\n", + "pip install ragoon\n", + "```\n", + "\n", + "## Citing this project\n", + "If you use this code in your research, please use the following BibTeX entry.\n", + "\n", + "```BibTeX\n", + "@misc{louisbrulenaudet2024,\n", + "\tauthor = {Louis Brulé Naudet},\n", + "\ttitle = {RAGoon : High level library for batched embeddings generation, blazingly-fast web-based RAG and quantitized indexes processing},\n", + "\thowpublished = {\\url{https://github.com/louisbrulenaudet/ragoon}},\n", + "\tyear = {2024}\n", + "}\n", + "```\n", + "\n", + "## Feedback\n", + "If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).\n", + "\n" + ], + "metadata": { + "id": "E1qMPnLpqcr3" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Installation\n", + "\n", + "The RAGoon project leverages a variety of libraries to provide robust functionality for tasks such as embeddings generation, retrieval-augmented generation (RAG), and web-based processing. Below is an overview of some key dependencies:\n", + "\n", + "- `transformers`: This library from Hugging Face is esential for working with state-of-the-art language models, enabling the project to perform tasks like text generation and model inference.\n", + "- `torch`: PyTorch is used for deep learning operations, particularly for model training and inference. It is a fundamental component for handling neural networks and tensor computations.\n", + "- `sentence_transformers`: This library simplifies the generation of dense vector representations (embeddings) from text, which is crucial for tasks like semantic search and information retrieval.\n", + "- `faiss_cpu`: FAISS is a powerful library for efficient similarity search, used in RAGoon to handle large-scale indexing and retrieval tasks with high performance.\n", + "- `httpx` and `beautifulsoup4`: These libraries are used for web scraping and making HTTP requests, enabling the project to fetch and process data from web sources efficiently.\n", + "- `openai`: This library connects to OpenAI's APIs, allowing integration with models like GPT for advanced text generation capabilities.\n", + "- `huggingface_hub`: Essential for interacting with Hugging Face’s model repository, enabling easy access to pre-trained models and datasets.\n", + "\n", + "These dependencies work together to empower RAGoon with advanced capabilities in natural language processing, machine learning, and web data processing, making it a versatile tool for developers and researchers in AI." + ], + "metadata": { + "id": "-UbYh3VCrikh" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "u4Bq23-p34KP", + "outputId": "0f8edae4-fb9f-4faa-a7a9-b93a5ca0233b" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Collecting ragoon\n", + " Downloading ragoon-0.0.8-py3-none-any.whl.metadata (7.7 kB)\n", + "Requirement already satisfied: beautifulsoup4==4.12.3 in /usr/local/lib/python3.10/dist-packages (from ragoon) (4.12.3)\n", + "Collecting datasets==2.20.0 (from ragoon)\n", + " Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)\n", + "Collecting faiss-cpu==1.8.0 (from ragoon)\n", + " Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)\n", + "Collecting google-api-python-client==2.126.0 (from ragoon)\n", + " Downloading google_api_python_client-2.126.0-py2.py3-none-any.whl.metadata (6.7 kB)\n", + "Collecting groq==0.9.0 (from ragoon)\n", + " Downloading groq-0.9.0-py3-none-any.whl.metadata (13 kB)\n", + "Collecting httpx==0.27.0 (from ragoon)\n", + " Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)\n", + "Collecting huggingface-hub==0.24.2 (from ragoon)\n", + " Downloading huggingface_hub-0.24.2-py3-none-any.whl.metadata (13 kB)\n", + "Collecting myst-parser==3.0.1 (from ragoon)\n", + " Downloading myst_parser-3.0.1-py3-none-any.whl.metadata (5.5 kB)\n", + "Requirement already satisfied: numpy<2 in /usr/local/lib/python3.10/dist-packages (from ragoon) (1.26.4)\n", + "Collecting numpydoc==1.7.0 (from ragoon)\n", + " Downloading numpydoc-1.7.0-py3-none-any.whl.metadata (4.2 kB)\n", + "Collecting openai==1.37.1 (from ragoon)\n", + " Downloading openai-1.37.1-py3-none-any.whl.metadata (22 kB)\n", + "Collecting overload==1.1 (from ragoon)\n", + " Downloading overload-1.1.tar.gz (4.6 kB)\n", + " Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + "Collecting plotly==5.23.0 (from ragoon)\n", + " Downloading plotly-5.23.0-py3-none-any.whl.metadata (7.3 kB)\n", + "Collecting pydata-sphinx-theme==0.15.4 (from ragoon)\n", + " Downloading pydata_sphinx_theme-0.15.4-py3-none-any.whl.metadata (7.5 kB)\n", + "Collecting pytest==8.3.2 (from ragoon)\n", + " Downloading pytest-8.3.2-py3-none-any.whl.metadata (7.5 kB)\n", + "Collecting scikit-learn==1.5.1 (from ragoon)\n", + " Downloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)\n", + "Collecting sentence-transformers==3.0.1 (from ragoon)\n", + " Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)\n", + "Collecting sphinx==7.4.7 (from ragoon)\n", + " Downloading sphinx-7.4.7-py3-none-any.whl.metadata (6.1 kB)\n", + "Collecting sphinx-book-theme==1.1.3 (from ragoon)\n", + " Downloading sphinx_book_theme-1.1.3-py3-none-any.whl.metadata (5.7 kB)\n", + "Requirement already satisfied: torch==2.3.1 in /usr/local/lib/python3.10/dist-packages (from ragoon) (2.3.1+cu121)\n", + "Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (from ragoon) (4.42.4)\n", + "Collecting tqdm==4.66.4 (from ragoon)\n", + " Downloading tqdm-4.66.4-py3-none-any.whl.metadata (57 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m57.6/57.6 kB\u001b[0m \u001b[31m3.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hCollecting umap==0.1.1 (from ragoon)\n", + " Downloading umap-0.1.1.tar.gz (3.2 kB)\n", + " Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + "Collecting umap-learn==0.5.6 (from ragoon)\n", + " Downloading umap_learn-0.5.6-py3-none-any.whl.metadata (21 kB)\n", + "Collecting usearch==2.12.0 (from ragoon)\n", + " Downloading usearch-2.12.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (28 kB)\n", + "Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4==4.12.3->ragoon) (2.5)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets==2.20.0->ragoon) (3.15.4)\n", + "Collecting pyarrow>=15.0.0 (from datasets==2.20.0->ragoon)\n", + " Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)\n", + "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets==2.20.0->ragoon) (0.6)\n", + "Collecting dill<0.3.9,>=0.3.0 (from datasets==2.20.0->ragoon)\n", + " Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)\n", + "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets==2.20.0->ragoon) (2.1.4)\n", + "Requirement already satisfied: requests>=2.32.2 in /usr/local/lib/python3.10/dist-packages (from datasets==2.20.0->ragoon) (2.32.3)\n", + "Collecting xxhash (from datasets==2.20.0->ragoon)\n", + " Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)\n", + "Collecting multiprocess (from datasets==2.20.0->ragoon)\n", + " Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)\n", + "Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets==2.20.0->ragoon)\n", + " Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)\n", + "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets==2.20.0->ragoon) (3.10.1)\n", + "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets==2.20.0->ragoon) (24.1)\n", + "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets==2.20.0->ragoon) (6.0.2)\n", + "Requirement already satisfied: httplib2<1.dev0,>=0.19.0 in /usr/local/lib/python3.10/dist-packages (from google-api-python-client==2.126.0->ragoon) (0.22.0)\n", + "Requirement already satisfied: google-auth!=2.24.0,!=2.25.0,<3.0.0.dev0,>=1.32.0 in /usr/local/lib/python3.10/dist-packages (from google-api-python-client==2.126.0->ragoon) (2.27.0)\n", + "Requirement already satisfied: google-auth-httplib2<1.0.0,>=0.2.0 in /usr/local/lib/python3.10/dist-packages (from google-api-python-client==2.126.0->ragoon) (0.2.0)\n", + "Requirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5 in /usr/local/lib/python3.10/dist-packages (from google-api-python-client==2.126.0->ragoon) (2.19.1)\n", + "Requirement already satisfied: uritemplate<5,>=3.0.1 in /usr/local/lib/python3.10/dist-packages (from google-api-python-client==2.126.0->ragoon) (4.1.1)\n", + "Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from groq==0.9.0->ragoon) (3.7.1)\n", + "Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from groq==0.9.0->ragoon) (1.7.0)\n", + "Requirement already satisfied: pydantic<3,>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from groq==0.9.0->ragoon) (2.8.2)\n", + "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from groq==0.9.0->ragoon) (1.3.1)\n", + "Requirement already satisfied: typing-extensions<5,>=4.7 in /usr/local/lib/python3.10/dist-packages (from groq==0.9.0->ragoon) (4.12.2)\n", + "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx==0.27.0->ragoon) (2024.7.4)\n", + "Collecting httpcore==1.* (from httpx==0.27.0->ragoon)\n", + " Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)\n", + "Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx==0.27.0->ragoon) (3.7)\n", + "Requirement already satisfied: docutils<0.22,>=0.18 in /usr/local/lib/python3.10/dist-packages (from myst-parser==3.0.1->ragoon) (0.18.1)\n", + "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from myst-parser==3.0.1->ragoon) (3.1.4)\n", + "Requirement already satisfied: markdown-it-py~=3.0 in /usr/local/lib/python3.10/dist-packages (from myst-parser==3.0.1->ragoon) (3.0.0)\n", + "Requirement already satisfied: mdit-py-plugins~=0.4 in /usr/local/lib/python3.10/dist-packages (from myst-parser==3.0.1->ragoon) (0.4.1)\n", + "Requirement already satisfied: tabulate>=0.8.10 in /usr/local/lib/python3.10/dist-packages (from numpydoc==1.7.0->ragoon) (0.9.0)\n", + "Requirement already satisfied: tomli>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from numpydoc==1.7.0->ragoon) (2.0.1)\n", + "Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly==5.23.0->ragoon) (9.0.0)\n", + "Requirement already satisfied: Babel in /usr/local/lib/python3.10/dist-packages (from pydata-sphinx-theme==0.15.4->ragoon) (2.15.0)\n", + "Requirement already satisfied: pygments>=2.7 in /usr/local/lib/python3.10/dist-packages (from pydata-sphinx-theme==0.15.4->ragoon) (2.16.1)\n", + "Collecting accessible-pygments (from pydata-sphinx-theme==0.15.4->ragoon)\n", + " Downloading accessible_pygments-0.0.5-py3-none-any.whl.metadata (10 kB)\n", + "Requirement already satisfied: iniconfig in /usr/local/lib/python3.10/dist-packages (from pytest==8.3.2->ragoon) (2.0.0)\n", + "Requirement already satisfied: pluggy<2,>=1.5 in /usr/local/lib/python3.10/dist-packages (from pytest==8.3.2->ragoon) (1.5.0)\n", + "Requirement already satisfied: exceptiongroup>=1.0.0rc8 in /usr/local/lib/python3.10/dist-packages (from pytest==8.3.2->ragoon) (1.2.2)\n", + "Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.5.1->ragoon) (1.13.1)\n", + "Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.5.1->ragoon) (1.4.2)\n", + "Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn==1.5.1->ragoon) (3.5.0)\n", + "Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from sentence-transformers==3.0.1->ragoon) (9.4.0)\n", + "Requirement already satisfied: sphinxcontrib-applehelp in /usr/local/lib/python3.10/dist-packages (from sphinx==7.4.7->ragoon) (2.0.0)\n", + "Requirement already satisfied: sphinxcontrib-devhelp in /usr/local/lib/python3.10/dist-packages (from sphinx==7.4.7->ragoon) (2.0.0)\n", + "Requirement already satisfied: sphinxcontrib-jsmath in /usr/local/lib/python3.10/dist-packages (from sphinx==7.4.7->ragoon) (1.0.1)\n", + "Requirement already satisfied: sphinxcontrib-htmlhelp>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from sphinx==7.4.7->ragoon) (2.1.0)\n", + "Requirement already satisfied: sphinxcontrib-serializinghtml>=1.1.9 in /usr/local/lib/python3.10/dist-packages (from sphinx==7.4.7->ragoon) (2.0.0)\n", + "Requirement already satisfied: sphinxcontrib-qthelp in /usr/local/lib/python3.10/dist-packages (from sphinx==7.4.7->ragoon) (2.0.0)\n", + "Collecting pygments>=2.7 (from pydata-sphinx-theme==0.15.4->ragoon)\n", + " Downloading pygments-2.18.0-py3-none-any.whl.metadata (2.5 kB)\n", + "Collecting docutils<0.22,>=0.18 (from myst-parser==3.0.1->ragoon)\n", + " Downloading docutils-0.21.2-py3-none-any.whl.metadata (2.8 kB)\n", + "Requirement already satisfied: snowballstemmer>=2.2 in /usr/local/lib/python3.10/dist-packages (from sphinx==7.4.7->ragoon) (2.2.0)\n", + "Requirement already satisfied: alabaster~=0.7.14 in /usr/local/lib/python3.10/dist-packages (from sphinx==7.4.7->ragoon) (0.7.16)\n", + "Requirement already satisfied: imagesize>=1.3 in /usr/local/lib/python3.10/dist-packages (from sphinx==7.4.7->ragoon) (1.4.1)\n", + "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch==2.3.1->ragoon) (1.13.1)\n", + "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch==2.3.1->ragoon) (3.3)\n", + "Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.3.1->ragoon)\n", + " Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n", + "Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.3.1->ragoon)\n", + " Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n", + "Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.3.1->ragoon)\n", + " Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n", + "Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.3.1->ragoon)\n", + " Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n", + "Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.3.1->ragoon)\n", + " Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n", + "Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.3.1->ragoon)\n", + " Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n", + "Collecting nvidia-curand-cu12==10.3.2.106 (from torch==2.3.1->ragoon)\n", + " Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)\n", + "Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch==2.3.1->ragoon)\n", + " Using cached nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n", + "Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch==2.3.1->ragoon)\n", + " Using cached nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)\n", + "Collecting nvidia-nccl-cu12==2.20.5 (from torch==2.3.1->ragoon)\n", + " Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)\n", + "Collecting nvidia-nvtx-cu12==12.1.105 (from torch==2.3.1->ragoon)\n", + " Using cached nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.7 kB)\n", + "Requirement already satisfied: triton==2.3.1 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.1->ragoon) (2.3.1)\n", + "Requirement already satisfied: numba>=0.51.2 in /usr/local/lib/python3.10/dist-packages (from umap-learn==0.5.6->ragoon) (0.60.0)\n", + "Collecting pynndescent>=0.5 (from umap-learn==0.5.6->ragoon)\n", + " Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)\n", + "Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx==0.27.0->ragoon)\n", + " Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)\n", + "Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch==2.3.1->ragoon)\n", + " Using cached nvidia_nvjitlink_cu12-12.6.20-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n", + "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers->ragoon) (2024.5.15)\n", + "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers->ragoon) (0.4.4)\n", + "Requirement already satisfied: tokenizers<0.20,>=0.19 in /usr/local/lib/python3.10/dist-packages (from transformers->ragoon) (0.19.1)\n", + "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.20.0->ragoon) (2.3.4)\n", + "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.20.0->ragoon) (1.3.1)\n", + "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.20.0->ragoon) (24.2.0)\n", + "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.20.0->ragoon) (1.4.1)\n", + "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.20.0->ragoon) (6.0.5)\n", + "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.20.0->ragoon) (1.9.4)\n", + "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets==2.20.0->ragoon) (4.0.3)\n", + "Requirement already satisfied: googleapis-common-protos<2.0.dev0,>=1.56.2 in /usr/local/lib/python3.10/dist-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5->google-api-python-client==2.126.0->ragoon) (1.63.2)\n", + "Requirement already satisfied: protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<6.0.0.dev0,>=3.19.5 in /usr/local/lib/python3.10/dist-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5->google-api-python-client==2.126.0->ragoon) (3.20.3)\n", + "Requirement already satisfied: proto-plus<2.0.0dev,>=1.22.3 in /usr/local/lib/python3.10/dist-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.0,<3.0.0.dev0,>=1.31.5->google-api-python-client==2.126.0->ragoon) (1.24.0)\n", + "Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0.dev0,>=1.32.0->google-api-python-client==2.126.0->ragoon) (5.4.0)\n", + "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0.dev0,>=1.32.0->google-api-python-client==2.126.0->ragoon) (0.4.0)\n", + "Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0.dev0,>=1.32.0->google-api-python-client==2.126.0->ragoon) (4.9)\n", + "Requirement already satisfied: pyparsing!=3.0.0,!=3.0.1,!=3.0.2,!=3.0.3,<4,>=2.4.2 in /usr/local/lib/python3.10/dist-packages (from httplib2<1.dev0,>=0.19.0->google-api-python-client==2.126.0->ragoon) (3.1.2)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->myst-parser==3.0.1->ragoon) (2.1.5)\n", + "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py~=3.0->myst-parser==3.0.1->ragoon) (0.1.2)\n", + "Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba>=0.51.2->umap-learn==0.5.6->ragoon) (0.43.0)\n", + "Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=1.9.0->groq==0.9.0->ragoon) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.20.1 in /usr/local/lib/python3.10/dist-packages (from pydantic<3,>=1.9.0->groq==0.9.0->ragoon) (2.20.1)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets==2.20.0->ragoon) (3.3.2)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.32.2->datasets==2.20.0->ragoon) (2.0.7)\n", + "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets==2.20.0->ragoon) (2.8.2)\n", + "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets==2.20.0->ragoon) (2024.1)\n", + "Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets==2.20.0->ragoon) (2024.1)\n", + "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch==2.3.1->ragoon) (1.3.0)\n", + "Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth!=2.24.0,!=2.25.0,<3.0.0.dev0,>=1.32.0->google-api-python-client==2.126.0->ragoon) (0.6.0)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->datasets==2.20.0->ragoon) (1.16.0)\n", + "Downloading ragoon-0.0.8-py3-none-any.whl (37 kB)\n", + "Downloading datasets-2.20.0-py3-none-any.whl (547 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m547.8/547.8 kB\u001b[0m \u001b[31m15.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m27.0/27.0 MB\u001b[0m \u001b[31m50.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading google_api_python_client-2.126.0-py2.py3-none-any.whl (12.6 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.6/12.6 MB\u001b[0m \u001b[31m73.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading groq-0.9.0-py3-none-any.whl (103 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m103.5/103.5 kB\u001b[0m \u001b[31m8.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading httpx-0.27.0-py3-none-any.whl (75 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m75.6/75.6 kB\u001b[0m \u001b[31m6.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading huggingface_hub-0.24.2-py3-none-any.whl (417 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m417.2/417.2 kB\u001b[0m \u001b[31m26.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading myst_parser-3.0.1-py3-none-any.whl (83 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m83.2/83.2 kB\u001b[0m \u001b[31m7.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading numpydoc-1.7.0-py3-none-any.whl (62 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m62.8/62.8 kB\u001b[0m \u001b[31m4.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading openai-1.37.1-py3-none-any.whl (337 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m337.0/337.0 kB\u001b[0m \u001b[31m22.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading plotly-5.23.0-py3-none-any.whl (17.3 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m17.3/17.3 MB\u001b[0m \u001b[31m25.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading pydata_sphinx_theme-0.15.4-py3-none-any.whl (4.6 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.6/4.6 MB\u001b[0m \u001b[31m71.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading pytest-8.3.2-py3-none-any.whl (341 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m341.8/341.8 kB\u001b[0m \u001b[31m21.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.4/13.4 MB\u001b[0m \u001b[31m68.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m227.1/227.1 kB\u001b[0m \u001b[31m16.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading sphinx-7.4.7-py3-none-any.whl (3.4 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.4/3.4 MB\u001b[0m \u001b[31m70.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading sphinx_book_theme-1.1.3-py3-none-any.whl (430 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m430.1/430.1 kB\u001b[0m \u001b[31m27.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading tqdm-4.66.4-py3-none-any.whl (78 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m78.3/78.3 kB\u001b[0m \u001b[31m6.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading umap_learn-0.5.6-py3-none-any.whl (85 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m85.7/85.7 kB\u001b[0m \u001b[31m7.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading usearch-2.12.0-cp310-cp310-manylinux_2_28_x86_64.whl (1.5 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.5/1.5 MB\u001b[0m \u001b[31m54.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading httpcore-1.0.5-py3-none-any.whl (77 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m77.9/77.9 kB\u001b[0m \u001b[31m6.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hUsing cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)\n", + "Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)\n", + "Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)\n", + "Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)\n", + "Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)\n", + "Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)\n", + "Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)\n", + "Using cached nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)\n", + "Using cached nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)\n", + "Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)\n", + "Using cached nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)\n", + "Downloading dill-0.3.8-py3-none-any.whl (116 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m116.3/116.3 kB\u001b[0m \u001b[31m423.5 kB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading docutils-0.21.2-py3-none-any.whl (587 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m587.4/587.4 kB\u001b[0m \u001b[31m35.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading fsspec-2024.5.0-py3-none-any.whl (316 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m316.1/316.1 kB\u001b[0m \u001b[31m24.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m39.9/39.9 MB\u001b[0m \u001b[31m16.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading pygments-2.18.0-py3-none-any.whl (1.2 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.2/1.2 MB\u001b[0m \u001b[31m58.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading pynndescent-0.5.13-py3-none-any.whl (56 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.9/56.9 kB\u001b[0m \u001b[31m4.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading accessible_pygments-0.0.5-py3-none-any.whl (1.4 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.4/1.4 MB\u001b[0m \u001b[31m66.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m13.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m15.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hDownloading h11-0.14.0-py3-none-any.whl (58 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m58.3/58.3 kB\u001b[0m \u001b[31m5.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hUsing cached nvidia_nvjitlink_cu12-12.6.20-py3-none-manylinux2014_x86_64.whl (19.7 MB)\n", + "Building wheels for collected packages: overload, umap\n", + " Building wheel for overload (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Created wheel for overload: filename=overload-1.1-py3-none-any.whl size=5675 sha256=bd134871ea1dd33588cb0eb38faa5141ef6e5bf1581a2df164a784a16e4f7fee\n", + " Stored in directory: /root/.cache/pip/wheels/c2/bd/04/b71278036f82f85e09d62b31d780f87df6f2a2dd378a185b3e\n", + " Building wheel for umap (setup.py) ... \u001b[?25l\u001b[?25hdone\n", + " Created wheel for umap: filename=umap-0.1.1-py3-none-any.whl size=3542 sha256=64a33bfe9c627bd2873973ff4d15a11bd54edb51ca0ae728bb1e2e868e4cb9ff\n", + " Stored in directory: /root/.cache/pip/wheels/15/f1/28/53dcf7a309118ed35d810a5f9cb995217800f3f269ab5771cb\n", + "Successfully built overload umap\n", + "Installing collected packages: umap, overload, xxhash, tqdm, pytest, pygments, pyarrow, plotly, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, h11, fsspec, faiss-cpu, docutils, dill, usearch, sphinx, scikit-learn, nvidia-cusparse-cu12, nvidia-cudnn-cu12, multiprocess, huggingface-hub, httpcore, accessible-pygments, pynndescent, pydata-sphinx-theme, nvidia-cusolver-cu12, numpydoc, myst-parser, httpx, umap-learn, sphinx-book-theme, openai, groq, google-api-python-client, datasets, sentence-transformers, ragoon\n", + " Attempting uninstall: tqdm\n", + " Found existing installation: tqdm 4.66.5\n", + " Uninstalling tqdm-4.66.5:\n", + " Successfully uninstalled tqdm-4.66.5\n", + " Attempting uninstall: pytest\n", + " Found existing installation: pytest 7.4.4\n", + " Uninstalling pytest-7.4.4:\n", + " Successfully uninstalled pytest-7.4.4\n", + " Attempting uninstall: pygments\n", + " Found existing installation: Pygments 2.16.1\n", + " Uninstalling Pygments-2.16.1:\n", + " Successfully uninstalled Pygments-2.16.1\n", + " Attempting uninstall: pyarrow\n", + " Found existing installation: pyarrow 14.0.2\n", + " Uninstalling pyarrow-14.0.2:\n", + " Successfully uninstalled pyarrow-14.0.2\n", + " Attempting uninstall: plotly\n", + " Found existing installation: plotly 5.15.0\n", + " Uninstalling plotly-5.15.0:\n", + " Successfully uninstalled plotly-5.15.0\n", + " Attempting uninstall: fsspec\n", + " Found existing installation: fsspec 2024.6.1\n", + " Uninstalling fsspec-2024.6.1:\n", + " Successfully uninstalled fsspec-2024.6.1\n", + " Attempting uninstall: docutils\n", + " Found existing installation: docutils 0.18.1\n", + " Uninstalling docutils-0.18.1:\n", + " Successfully uninstalled docutils-0.18.1\n", + " Attempting uninstall: sphinx\n", + " Found existing installation: Sphinx 5.0.2\n", + " Uninstalling Sphinx-5.0.2:\n", + " Successfully uninstalled Sphinx-5.0.2\n", + " Attempting uninstall: scikit-learn\n", + " Found existing installation: scikit-learn 1.3.2\n", + " Uninstalling scikit-learn-1.3.2:\n", + " Successfully uninstalled scikit-learn-1.3.2\n", + " Attempting uninstall: huggingface-hub\n", + " Found existing installation: huggingface-hub 0.23.5\n", + " Uninstalling huggingface-hub-0.23.5:\n", + " Successfully uninstalled huggingface-hub-0.23.5\n", + " Attempting uninstall: google-api-python-client\n", + " Found existing installation: google-api-python-client 2.137.0\n", + " Uninstalling google-api-python-client-2.137.0:\n", + " Successfully uninstalled google-api-python-client-2.137.0\n", + "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "ipython 7.34.0 requires jedi>=0.16, which is not installed.\n", + "cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is incompatible.\n", + "gcsfs 2024.6.1 requires fsspec==2024.6.1, but you have fsspec 2024.5.0 which is incompatible.\n", + "ibis-framework 8.0.0 requires pyarrow<16,>=2, but you have pyarrow 17.0.0 which is incompatible.\u001b[0m\u001b[31m\n", + "\u001b[0mSuccessfully installed accessible-pygments-0.0.5 datasets-2.20.0 dill-0.3.8 docutils-0.21.2 faiss-cpu-1.8.0 fsspec-2024.5.0 google-api-python-client-2.126.0 groq-0.9.0 h11-0.14.0 httpcore-1.0.5 httpx-0.27.0 huggingface-hub-0.24.2 multiprocess-0.70.16 myst-parser-3.0.1 numpydoc-1.7.0 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.6.20 nvidia-nvtx-cu12-12.1.105 openai-1.37.1 overload-1.1 plotly-5.23.0 pyarrow-17.0.0 pydata-sphinx-theme-0.15.4 pygments-2.18.0 pynndescent-0.5.13 pytest-8.3.2 ragoon-0.0.8 scikit-learn-1.5.1 sentence-transformers-3.0.1 sphinx-7.4.7 sphinx-book-theme-1.1.3 tqdm-4.66.4 umap-0.1.1 umap-learn-0.5.6 usearch-2.12.0 xxhash-3.4.1\n" + ] + } + ], + "source": [ + "!pip3 install ragoon polars" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "id": "r_zE8a8z4HMV" + }, + "outputs": [], + "source": [ + "import polars as pl\n", + "\n", + "from ragoon import (\n", + " dataset_loader,\n", + " SimilaritySearch,\n", + " EmbeddingsVisualizer\n", + ")" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Instance creation\n", + "\n", + "The `SimilaritySearch` class is instantiated with specific parameters to configure the embedding model and search infrastructure. The chosen model, `louisbrulenaudet/tsdae-lemone-mbert-base`, is likely a multilingual BERT model fine-tuned with TSDAE (Transfomer-based Denoising Auto-Encoder) on a custom dataset. This model choice suggests a focus on multilingual capabilities and improved semantic representations.\n", + "\n", + "The `cuda` device specification leverages GPU acceleration, crucial for efficient processing of large datasets. The embedding dimension of `768` is typical for BERT-based models, representing a balance between expressiveness and computational efficiency. The `ip` (inner product) metric is selected for similarity comparisons, which is computationally faster than cosine similarity when vectors are normalized. The `i8` dtype indicates 8-bit integer quantization, a technique that significantly reduces memory usage and speeds up similarity search at the cost of a small accuracy rade-off." + ], + "metadata": { + "id": "Sb6QMUxtMC7x" + } + }, + { + "cell_type": "code", + "source": [ + "instance = SimilaritySearch(\n", + " model_name=\"louisbrulenaudet/tsdae-lemone-mbert-base\",\n", + " device=\"cuda\",\n", + " ndim=768,\n", + " metric=\"ip\",\n", + " dtype=\"i8\"\n", + ")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 491, + "referenced_widgets": [ + "3b0740b7c1184b11af19a2575c2028d7", + "ed036c7e98d34ca1b1fefb5f7d7da690", + "041dcd84dd8047b8bb00dcf0a8605366", + "6ffed718be9d47b08b0f319823627692", + "bd764b4c8d3a4218a51ac6cfa016b176", + "459b6bd8f7804c6fa2cfbbf3574fea81", + "310db8d9db6e4b05b3340d9ac4e6a885", + "454e18ed66a543e88829d0eb1c7585dd", + "341d5f623a4e4d97a0b02a8b1f8372fa", + "bf1831cf76294173ad2f4506bab90601", + "d840ecb454ce401b9781ca181204320f", + "990877dd1448431ca676da862de76e66", + "215f503d83524f22b534383c76bb7aa5", + "57d8c511f130440ab43cd26946d73484", + "6ad57261fbcc4033bcf3c56e01f14099", + "9d590a1002364797a243818665f95cea", + "31967af398094a5b8003a78d7cdef06a", + "ee1e1101bcae43f8b938b70758635e28", + "fcadc705920c4dc0a965eff5b1de8088", + "9cedabb67be046b7b0dfb7ca89c8e299", + "91962d7fbd814ae0b6a6e10072f621bc", + "910e040f53c246048d85236b525d7e95", + "017b2cc0c5594dc6a4624b0bec38fe73", + "476cac6399c142efa95e85a5b159dd20", + "ce8b3efd425e43158a2b55f8e47b8d56", + "1fe1c159c31a4f8dba9beb0c53e49bbc", + "f121aa8e2b4a41feb4e7966f1a697d6f", + "66abd6f0b54c47548f3f8074d9af2af8", + "8347c3e484fa43b785c2a1392d4bf1aa", + "5a1a0dd9fd9f4f0195f8ead413dfd1a9", + "ef52f1ca67554e29a478abcd91eae44f", + "2629efe6d2044c29a22d9736b71523f4", + "2c0e040a88e948a2835e07dd3fac05f0", + "5d298e5197c94cb183ca668444f8ca81", + "1f4ea0c70e9c4923b7de5e307c01fcd5", + "8527969d2f9d4b8da907cb8d586c27eb", + "a10b2bca915841479e634c16af092d00", + "da023e01ccb34cd6a0d63ac05c4caf0f", + "cc58183628524f24b2f1cbe66527cf03", + "77a0828fab334bf280fde83577adbc36", + "874f00bd481e40c0802f3b6312d1e7c1", + "aab5b68dd3ce4d019b582188871e937e", + "f51d2b0311ad4fa895f1c7c5c2c2ecee", + "910a5796ac244e74b3fcd189166a7279", + "897fb40e1b0347a089aa4cb0fab582f3", + "744c5ae12aac4e9faa4a4845af24e3e9", + "c4616271c618426cbe1043dd0ef541a5", + "c245609de8ea43bb88c500e484638238", + "60fd5071bae7484f991e1fc6aca170de", + "de77ef90160f4d89a01e99fab514e0bb", + "ceb21ab3f6ef43c68642439fa948590c", + "141eb847045b48f69698e36bbff54e9e", + "0f598dd46d844505a64a85ccb22aa462", + "87d22f7577a94256aeca8146b6797101", + "a0c1f63835ea4700b5d59de386045f82", + "6b07d0ea362e442fb4a9f46fa66b8155", + "4d8991a173e5485a966fc51a0d58222d", + "d57b5e6e0017437db70929616ad041d7", + "9bbe6c83fe2f4bb6817e3177006af255", + "cbcb40306ff84710983dfc47b3ed0ad0", + "76bd90a147dd4d3da5b7bcb1bd759b9d", + "a3ee7025242c48aea17ce4935c467742", + "4267bf1283a04abda6e7877b43d04b52", + "0bdddb3424f44dbe97674c45488dd4eb", + "8330426f6ff140b6aba7190543688c2f", + "0fabe9fd86344ae1928752e6b4abc883", + "78a210fcb03a4450b34e73b7f4dc2280", + "ccb12bd8b80242c38f644fc73bc4edd5", + "8b647e9a5eb84dada7715f764258a7df", + "4398e38513ae40d5a0a9399c89fd60a7", + "fee8560569fa4c8797ce9d25fe587bff", + "3ff7fe78226542c4adc84898179249a6", + "0c959e89a32e42d8885bb9168ab3036f", + "bc7a5be54a8e43fc92a49b7c812b8a48", + "f8af455ea779482cab33235f505fe5ef", + "1e2db34d262f41cabd288f1100fed025", + "d26d7c429d9a4e80aba1c7de6e7e64f0", + "558cf7b57e9c4f729d834ac6d5ef6bfe", + "4cf43d2ded134c06ab1df7509d226175", + "14e16de286304a94a75960fe47176237", + "c435a072016d4dc3b30e409f2410450e", + "b4295cdb89d14877b60eb45c913f7409", + "91db42f83e454d1baf20cc1e0af0e1e1", + "227f6358cc4a47b3b6da3001a0e3fdbf", + "cf4bae6c34b34d3abeaa1a77454f4097", + "834f4a6d80eb4c1a8a3cf6276e071553", + "c5d2f8d7eb8d40c8b976c43406ed0204", + "6848d43f27bc4ce195806fb51e741a6f", + "c182ee1819344032ad954b39d261e35e", + "b20264f18ecc4f89bdb2ad9ec3052d0f", + "0ab071ebac92435791261ebebca4d103", + "115636ffa31a4212a2bb7294733744f3", + "c8b243e465c74845b21ce64729bd7e34", + "09d7e4295da04cc68569f096f60dc0da", + "18a5275ac2864347ba605046e798f344", + "a193ba8df21d4434a83079faaace6a16", + "810c7730004b4d92adde4a835b9e8896", + "9c17c34bca294f6ab92dc2973ad831bd", + "9beabf1daab340ae82a12ae4ad44defd", + "0d52cd598ddb4f63ab68ebb3a1672936", + "a7378ef70c964373873ac13eb9188af7", + "532eeb99350e444fad7fe3f659e118cb", + "99f18945bfa4496badf1b6398eced269", + "26924bd271a34fd3b6bc4cd4ef7d0fc8", + "1069c57ab76349acb1df1e267836877b", + "0d5f036ec3d84483b88de5c312c24631", + "b35af0a4b0394d6ca66388187a12b9c9", + "fd08587944f34eb2a2e72d6ebdf11dc9", + "5f8560ee9dea418a85f693dd4f771a72", + "961a5e9e866f4a0aa772e0395d2c5cc4", + "9ba34593f7c449acacde80b276e47e5e", + "b227d4b242384297883d73e2ef4ad36b", + "2aa6f24c423a4c9fbae4c9e2f59cd1ef", + "13147575bc324653ad03fe2b764c0f30", + "622c32270e6548ed9514345c0c1450cb", + "15f0540100a44d1cada3d2f6a0e927d4", + "183376a937604feca5a8fe1e01e8d8e6", + "e192d2a4b79b45ff858d90751fc7ebea", + "2693cf39ba874fff92e1b5294e796ed7", + "a070c177e8f94802a073d9be6a3226b6", + "984117efb8c548ccb2debd825d89d156" + ] + }, + "id": "DO8LihEaL9Es", + "outputId": "05761d16-36e5-4278-c36b-f883b46a39fd" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n", + "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", + "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", + "You will be able to reuse this secret in all of your notebooks.\n", + "Please note that authentication is recommended but still optional to access public models or datasets.\n", + " warnings.warn(\n" + ] + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "modules.json: 0%| | 0.00/229 [00:00\n", + " .ndarray_repr .ndarray_raw_data {\n", + " display: none;\n", + " }\n", + " .ndarray_repr.show_array .ndarray_raw_data {\n", + " display: block;\n", + " }\n", + " .ndarray_repr.show_array .ndarray_image_preview {\n", + " display: none;\n", + " }\n", + " \n", + "
ndarray (414, 96) 
array([[152,  86,  70, ..., 173, 112,  84],\n",
+              "       [ 42, 215, 109, ...,  13,  60, 198],\n",
+              "       [136, 151, 117, ...,  77, 208,  22],\n",
+              "       ...,\n",
+              "       [136, 148,  85, ...,  46, 248, 198],\n",
+              "       [204, 222, 134, ..., 223, 216, 244],\n",
+              "       [ 44,  21,  70, ..., 189, 244, 246]], dtype=uint8)
" + ] + }, + "metadata": {}, + "execution_count": 8 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Quantizing embeddings to 8-bit integers\n", + "\n", + "Int8 quantization maps the continuous embedding values to a discrete set of 256 values represented by 8-bit integers. This process typically involves scaling the original values to fit within the int8 range (-128 to 127) and may use techniques like asymmetric quantization to preserve more information. While less extreme than binary quantization, int8 still offers substantial memory savings (reducing each dimension to 1 byte) while preserving more of the original information. This quantization enables efficient SIMD (Single Instruction, Multiple Data) operations on modern CPUs, significantly accelerating similarity computations." + ], + "metadata": { + "id": "KSWoo96YVlpd" + } + }, + { + "cell_type": "code", + "source": [ + "int8_embeddings = instance.quantize_embeddings(\n", + " embeddings=embeddings,\n", + " quantization_type=\"int8\"\n", + ")\n", + "\n", + "int8_embeddings" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "SCEiWKlXVyAO", + "outputId": "cac70c19-8441-4872-9cf3-afdba953bbaa" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "array([[ 55, -4, -13, ..., 5, -64, -11],\n", + " [-53, -30, 71, ..., -7, 43, 32],\n", + " [ 4, -27, -15, ..., -25, -34, 18],\n", + " ...,\n", + " [ 36, -33, -3, ..., 15, -24, 29],\n", + " [ 15, 16, -97, ..., 20, -48, 51],\n", + " [-15, -15, 96, ..., 65, -33, 37]], dtype=int8)" + ] + }, + "metadata": {}, + "execution_count": 9 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Creating a USEARCH index\n", + "\n", + "USEARCH is designed for high-performance approximate nearest neighbor search. The index creation process likely involves building a hierarchical structure, possibly a navigable small world (NSW) graph, which allows for efficient traversal during search operations. The use of int8 quantized embeddings enables USEARCH to leverage SIMD instructions for rapid distance calculations. The resulting index balances search speed and accuracy, allowing for fast retrieval with a controlled trade-off in precision." + ], + "metadata": { + "id": "fA8GqkFxk9AF" + } + }, + { + "cell_type": "code", + "source": [ + "instance.create_usearch_index(\n", + " int8_embeddings=int8_embeddings,\n", + " index_path=\"./usearch_int8.index\",\n", + " save=True\n", + ")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Uyt4yUfCV3Jy", + "outputId": "29ba3af3-841d-4079-e7c6-bd8033221c88" + }, + "execution_count": 17, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "usearch.Index\n", + "- config\n", + "-- data type: ScalarKind.I8\n", + "-- dimensions: 768\n", + "-- metric: MetricKind.IP\n", + "-- multi: False\n", + "-- connectivity: 16\n", + "-- expansion on addition :128 candidates\n", + "-- expansion on search: 64 candidates\n", + "- binary\n", + "-- uses OpenMP: 0\n", + "-- uses SimSIMD: 1\n", + "-- supports half-precision: 1\n", + "-- uses hardware acceleration: haswell\n", + "- state\n", + "-- size: 414 vectors\n", + "-- memory usage: 20,975,808 bytes\n", + "-- max level: 2\n", + "--- 0. 414 nodes\n", + "--- 1. 26 nodes\n", + "--- 2. 2 nodes" + ] + }, + "metadata": {}, + "execution_count": 17 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Creating a FAISS index\n", + "\n", + "FAISS (Facebook AI Similarity Search) is a library that provides efficient similarity search and clustering of dense vectors. For binary vectors, FAISS typically uses specialized index structures like the BinaryFlat index. This index performs exhaustive search using Hamming distance, which can be computed extremely efficiently on modern hardware using XOR and bit count operations. The binary nature of the index allows for compact storage and very fast search operations, albeit with reduced granularity in similarity scores compared to float-based indices." + ], + "metadata": { + "id": "bu4tRZdnlDxe" + } + }, + { + "cell_type": "code", + "source": [ + "instance.create_faiss_index(\n", + " ubinary_embeddings=ubinary_embeddings,\n", + " index_path=\"./faiss_ubinary.index\",\n", + " save=True\n", + ")" + ], + "metadata": { + "id": "aT6qW9OZlDFZ" + }, + "execution_count": 16, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Performing a similarity search\n", + "\n", + "The search process combines the strengths of both USEARCH and FAISS indices. It likely first uses the binary FAISS index for a rapid initial filtering step, leveraging the efficiency of Hamming distance calculations. The top candidates from this step (increased by the rescore_multiplier for better recall) are then refined using the more precise int8 USEARCH index. This two-stage approach balances speed and accuracy, allowing for quick pruning of unlikely candidates followed by more accurate rescoring.\n", + "\n", + "The query is first encoded using the same model and quantization processes as the corpus. The rescore_multiplier of 4 means the initial retrieval fetches 40 candidates (4 * top_k), which are then reranked to produce the final top 10 results. This oversampling helps mitigate the potential loss of relevant results due to quantization approximations." + ], + "metadata": { + "id": "uhPo-7y4lJyZ" + } + }, + { + "cell_type": "code", + "source": [ + "top_k_scores, top_k_indices = instance.search(\n", + " query=\"Définir le rôle d'un intermédiaire concepteur conformément à l'article 1649 AE du Code général des Impôts.\",\n", + " top_k=10,\n", + " rescore_multiplier=4\n", + ")\n", + "print(top_k_scores, top_k_indices)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 86, + "referenced_widgets": [ + "1b7a1c758c9844f6bb3297f5e907b125", + "a99fdb97b94b42b1ae2ede146cc02512", + "ce8efeb255504bcd8797022b70c89ac0", + "769a7ab752aa41a297ffa878f3b717b6", + "902ca7262ca94a7180f522df6b9b54f9", + "5f85ba1a93d04dbda65acab723e60ba2", + "125128e3613f40feb8153e3dd9e3b3aa", + "c2d320a7c814446db7159feeb907b6a2", + "f5d20157226846fd983f11904cf9eecf", + "fb117e7bda4a445ba65b472a64971440", + "ffcdbc40630a4056a8929fb5a8a88cdd" + ] + }, + "id": "SjyvMgbJlIBn", + "outputId": "b24a2110-9631-4a20-a38f-cb72d137d698" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "Batches: 0%| | 0/1 [00:00