-
Notifications
You must be signed in to change notification settings - Fork 15.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: Add Databricks Vector Search example notebook (#14158)
This PR adds an example notebook for the Databricks Vector Search vector store. It also adds an introduction to the Databricks Vector Search product on the Databricks's provider page. --------- Co-authored-by: Bagatur <[email protected]>
- Loading branch information
Showing
3 changed files
with
392 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
385 changes: 385 additions & 0 deletions
385
docs/docs/integrations/vectorstores/databricks_vector_search.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,385 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "5a8c5767-adfe-4b9d-a665-a898756d7a6c", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"source": [ | ||
"# Databricks Vector Search\n", | ||
"\n", | ||
"Databricks Vector Search is a serverless similarity search engine that allows you to store a vector representation of your data, including metadata, in a vector database. With Vector Search, you can create auto-updating vector search indexes from Delta tables managed by Unity Catalog and query them with a simple API to return the most similar vectors.\n", | ||
"\n", | ||
"This notebook shows how to use LangChain with Databricks Vector Search." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "746cfacd-fb30-48fd-96a5-bbcc0d15ae49", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"source": [ | ||
"Install `databricks-vectorsearch` and related Python packages used in this notebook." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "9258a3e7-e050-4390-9d3f-9adff1460dab", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"!pip install langchain-core databricks-vectorsearch openai tiktoken" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "f4f09d6d-002d-4cb0-a664-0a83bd2a13da", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"source": [ | ||
"Use `OpenAIEmbeddings` for the embeddings." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "f11b902d-a772-45e0-bbd9-526218b717cc", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import getpass\n", | ||
"import os\n", | ||
"\n", | ||
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "59b568f3-8db2-427e-9a4a-1df6fa7a1739", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"source": [ | ||
"Split documents and get embeddings." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "b28e1c7b-eae4-4be8-abbd-8433c7557dc2", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from langchain.document_loaders import TextLoader\n", | ||
"from langchain.embeddings import OpenAIEmbeddings\n", | ||
"from langchain.text_splitter import CharacterTextSplitter\n", | ||
"\n", | ||
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n", | ||
"documents = loader.load()\n", | ||
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", | ||
"docs = text_splitter.split_documents(documents)\n", | ||
"\n", | ||
"embeddings = OpenAIEmbeddings()\n", | ||
"emb_dim = len(embeddings.embed_query(\"hello\"))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "e8fcdda1-208a-45c9-816e-ff0d2c8f59d6", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"source": [ | ||
"## Setup Databricks Vector Search client" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "9b87fff1-99e5-4d9f-aba3-d21a7ccc498e", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from databricks.vector_search.client import VectorSearchClient\n", | ||
"\n", | ||
"vsc = VectorSearchClient()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Create a Vector Search Endpoint\n", | ||
"This endpoint is used to create and access vector search indexes." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"vsc.create_endpoint(name=\"vector_search_demo_endpoint\", endpoint_type=\"STANDARD\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "81090f87-3efd-4c1e-9f58-8d6adba7553d", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"source": [ | ||
"## Create Direct Vector Access Index\n", | ||
"Direct Vector Access Index supports direct read and write of embedding vectors and metadata through a REST API or an SDK. For this index, you manage embedding vectors and index updates yourself." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "9389ec6b-5885-411f-a26e-1a4b03651f5c", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"vector_search_endpoint_name = \"vector_search_demo_endpoint\"\n", | ||
"index_name = \"ml.llm.demo_index\"\n", | ||
"\n", | ||
"index = vsc.create_direct_access_index(\n", | ||
" endpoint_name=vector_search_endpoint_name,\n", | ||
" index_name=index_name,\n", | ||
" primary_key=\"id\",\n", | ||
" embedding_dimension=emb_dim,\n", | ||
" embedding_vector_column=\"text_vector\",\n", | ||
" schema={\n", | ||
" \"id\": \"string\",\n", | ||
" \"text\": \"string\",\n", | ||
" \"text_vector\": \"array<float>\",\n", | ||
" \"source\": \"string\",\n", | ||
" },\n", | ||
")\n", | ||
"\n", | ||
"index.describe()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "047a14c9-2f06-4f74-883d-815b2c69786c", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from langchain.vectorstores import DatabricksVectorSearch\n", | ||
"\n", | ||
"dvs = DatabricksVectorSearch(\n", | ||
" index, text_column=\"text\", embedding=embeddings, columns=[\"source\"]\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "951bd581-2ced-497f-9c70-4fda902fd3a1", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"source": [ | ||
"## Add docs to the index" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "1e85f235-901f-4cf5-845f-5dbf4ce42078", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"dvs.add_documents(docs)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "8bea6f0a-b305-455a-acba-99cc8c9350b5", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"source": [ | ||
"## Similarity search" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "25c5a044-a61a-4929-9e65-a0f0462925df", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"query = \"What did the president say about Ketanji Brown Jackson\"\n", | ||
"dvs.similarity_search(query)\n", | ||
"print(docs[0].page_content)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "46e3f41b-dac2-4bed-91cb-a3914c25d275", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"source": [ | ||
"## Work with Delta Sync Index\n", | ||
"\n", | ||
"You can also use `DatabricksVectorSearch` to search in a Delta Sync Index. Delta Sync Index automatically syncs from a Delta table. You don't need to call `add_text`/`add_documents` manually. See [Databricks documentation page](https://docs.databricks.com/en/generative-ai/vector-search.html#delta-sync-index-with-managed-embeddings) for more details." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"application/vnd.databricks.v1+cell": { | ||
"cellMetadata": {}, | ||
"inputWidgets": {}, | ||
"nuid": "0c1f448e-77ca-41ce-887c-15948e866a0e", | ||
"showTitle": false, | ||
"title": "" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"dvs_delta_sync = DatabricksVectorSearch(\"catalog_name.schema_name.delta_sync_index\")\n", | ||
"dvs_delta_sync.similarity_search(query)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"application/vnd.databricks.v1+notebook": { | ||
"dashboards": [], | ||
"language": "python", | ||
"notebookMetadata": { | ||
"pythonIndentUnit": 2 | ||
}, | ||
"notebookName": "databricks_vector_search", | ||
"widgets": {} | ||
}, | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.10" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
Oops, something went wrong.