Skip to content

Commit

Permalink
feat: add Google BigQueryVectorSearch in vectorstore (#14829)
Browse files Browse the repository at this point in the history
BigQuery vector search lets you use GoogleSQL to do semantic search,
using vector indexes for fast but approximate results, or using brute
force for exact results.

This PR integrates LangChain vectorstore with BigQuery Vector Search.

<!-- Thank you for contributing to LangChain!

Replace this entire comment with:
  - **Description:** a description of the change, 
  - **Issue:** the issue # it fixes (if applicable),
  - **Dependencies:** any dependencies required for this change,
- **Tag maintainer:** for a quicker response, tag the relevant
maintainer (see below),
- **Twitter handle:** we announce bigger features on Twitter. If your PR
gets announced, and you'd like a mention, we'll gladly shout you out!

Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

See contribution guidelines for more information on how to write/run
tests, lint, etc:
https://python.langchain.com/docs/contributing/

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in `docs/extras`
directory.

If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17.
 -->

---------

Co-authored-by: Vlad Kolesnikov <[email protected]>
  • Loading branch information
ashleyxuu and vladkol authored Jan 2, 2024
1 parent 02f59c2 commit 0ce7858
Show file tree
Hide file tree
Showing 5 changed files with 1,322 additions and 0 deletions.
22 changes: 22 additions & 0 deletions docs/docs/integrations/platforms/google.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,28 @@ See a [usage example](/docs/integrations/vectorstores/matchingengine).
from langchain_community.vectorstores import MatchingEngine
```

### Google BigQuery Vector Search

> [Google BigQuery](https://cloud.google.com/bigquery),
> BigQuery is a serverless and cost-effective enterprise data warehouse in Google Cloud.
>
> Google BigQuery Vector Search
> BigQuery vector search lets you use GoogleSQL to do semantic search, using vector indexes for fast but approximate results, or using brute force for exact results.
> It can calculate Euclidean or Cosine distance. With LangChain, we default to use Euclidean distance.
We need to install several python packages.

```bash
pip install google-cloud-bigquery
```

See a [usage example](/docs/integrations/vectorstores/bigquery_vector_search).

```python
from langchain.vectorstores import BigQueryVectorSearch
```

### Google ScaNN

>[Google ScaNN](https://github.com/google-research/google-research/tree/master/scann)
Expand Down
353 changes: 353 additions & 0 deletions docs/docs/integrations/vectorstores/bigquery_vector_search.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,353 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "E_RJy7C1bpCT"
},
"source": [
"# BigQuery Vector Search\n",
"> **BigQueryVectorSearch**:\n",
"BigQuery vector search lets you use GoogleSQL to do semantic search, using vector indexes for fast approximate results, or using brute force for exact results.\n",
"\n",
"\n",
"This tutorial illustrates how to work with an end-to-end data and embedding management system in LangChain, and provide scalable semantic search in BigQuery."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EmPJkpOCckyh"
},
"source": [
"## Getting started\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IR54BmgvdHT_"
},
"source": [
"### Install the library"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "0ZITIDE160OD",
"outputId": "e184bc0d-6541-4e0a-82d2-1e216db00a2d"
},
"outputs": [],
"source": [
"! pip install langchain google-cloud-aiplatform google-cloud-bigquery --upgrade --user"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "v40bB_GMcr9f"
},
"source": [
"**Colab only:** Uncomment the following cell to restart the kernel or use the button to restart the kernel. For Vertex AI Workbench you can restart the terminal using the button on top."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6o0iGVIdDD6K"
},
"outputs": [],
"source": [
"# # Automatically restart kernel after installs so that your environment can access the new packages\n",
"# import IPython\n",
"\n",
"# app = IPython.Application.instance()\n",
"# app.kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Before you begin"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Set your project ID\n",
"\n",
"If you don't know your project ID, try the following:\n",
"* Run `gcloud config list`.\n",
"* Run `gcloud projects list`.\n",
"* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# @title Project { display-mode: \"form\" }\n",
"PROJECT_ID = \"\" # @param {type:\"string\"}\n",
"\n",
"# Set the project id\n",
"! gcloud config set project {PROJECT_ID}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Set the region\n",
"\n",
"You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# @title Region { display-mode: \"form\" }\n",
"REGION = \"US\" # @param {type: \"string\"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Set the dataset and table names\n",
"\n",
"They will be your BigQuery Vector Store."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# @title Dataset and Table { display-mode: \"form\" }\n",
"DATASET = \"my_langchain_dataset\" # @param {type: \"string\"}\n",
"TABLE = \"doc_and_vectors\" # @param {type: \"string\"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Authenticating your notebook environment\n",
"\n",
"- If you are using **Colab** to run this notebook, uncomment the cell below and continue.\n",
"- If you are using **Vertex AI Workbench**, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from google.colab import auth as google_auth\n",
"\n",
"google_auth.authenticate_user()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AD3yG49BdLlr"
},
"source": [
"## Demo: BigQueryVectorSearch"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create an embedding class instance\n",
"\n",
"You may need to enable Vertex AI API in your project by running\n",
"`gcloud services enable aiplatform.googleapis.com --project {PROJECT_ID}`\n",
"(replace `{PROJECT_ID}` with the name of your project).\n",
"\n",
"You can use any [LangChain embeddings model](https://python.langchain.com/docs/integrations/text_embedding/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Vb2RJocV9_LQ",
"outputId": "37f5dc74-2512-47b2-c135-f34c10afdcf4"
},
"outputs": [],
"source": [
"from langchain_community.embeddings import VertexAIEmbeddings\n",
"\n",
"embedding = VertexAIEmbeddings(\n",
" model_name=\"textembedding-gecko@latest\", project=PROJECT_ID\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create BigQuery Dataset\n",
"\n",
"Optional step to create the dataset if it doesn't exist."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from google.cloud import bigquery\n",
"\n",
"client = bigquery.Client(project=PROJECT_ID, location=REGION)\n",
"client.create_dataset(dataset=DATASET, exists_ok=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Initialize BigQueryVectorSearch Vector Store with an existing BigQuery dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.vectorstores.utils import DistanceStrategy\n",
"from langchain_community.vectorstores import BigQueryVectorSearch\n",
"\n",
"store = BigQueryVectorSearch(\n",
" project_id=PROJECT_ID,\n",
" dataset_name=DATASET,\n",
" table_name=TABLE,\n",
" location=REGION,\n",
" embedding=embedding,\n",
" distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Add texts"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"all_texts = [\"Apples and oranges\", \"Cars and airplanes\", \"Pineapple\", \"Train\", \"Banana\"]\n",
"metadatas = [{\"len\": len(t)} for t in all_texts]\n",
"\n",
"store.add_texts(all_texts, metadatas=metadatas)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Search for documents"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = \"I'd like a fruit.\"\n",
"docs = store.similarity_search(query)\n",
"print(docs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Search for documents by vector"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query_vector = embedding.embed_query(query)\n",
"docs = store.similarity_search_by_vector(query_vector, k=2)\n",
"print(docs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Search for documents with metadata filter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# This should only return \"Banana\" document.\n",
"docs = store.similarity_search_by_vector(query_vector, filter={\"len\": 6})\n",
"print(docs)"
]
}
],
"metadata": {
"colab": {
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading

0 comments on commit 0ce7858

Please sign in to comment.