Skip to content

Commit

Permalink
docs: Add Databricks Vector Search example notebook (#14158)
Browse files Browse the repository at this point in the history
This PR adds an example notebook for the Databricks Vector Search vector
store. It also adds an introduction to the Databricks Vector Search
product on the Databricks's provider page.

---------

Co-authored-by: Bagatur <[email protected]>
  • Loading branch information
freemso and baskaryan authored Dec 13, 2023
1 parent 414bddd commit df95abb
Show file tree
Hide file tree
Showing 3 changed files with 392 additions and 1 deletion.
6 changes: 6 additions & 0 deletions docs/docs/integrations/providers/databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,3 +79,9 @@ Databricks as an LLM provider

The notebook [Wrap Databricks endpoints as LLMs](/docs/integrations/llms/databricks#wrapping-a-serving-endpoint-custom-model) demonstrates how to serve a custom model that has been registered by MLflow as a Databricks endpoint.
It supports two types of endpoints: the serving endpoint, which is recommended for both production and development, and the cluster driver proxy app, which is recommended for interactive development.


Databricks Vector Search
------------------------

Databricks Vector Search is a serverless similarity search engine that allows you to store a vector representation of your data, including metadata, in a vector database. With Vector Search, you can create auto-updating vector search indexes from Delta tables managed by Unity Catalog and query them with a simple API to return the most similar vectors. See the notebook [Databricks Vector Search](/docs/integrations/vectorstores/databricks_vector_search) for instructions to use it with LangChain.
385 changes: 385 additions & 0 deletions docs/docs/integrations/vectorstores/databricks_vector_search.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,385 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "5a8c5767-adfe-4b9d-a665-a898756d7a6c",
"showTitle": false,
"title": ""
}
},
"source": [
"# Databricks Vector Search\n",
"\n",
"Databricks Vector Search is a serverless similarity search engine that allows you to store a vector representation of your data, including metadata, in a vector database. With Vector Search, you can create auto-updating vector search indexes from Delta tables managed by Unity Catalog and query them with a simple API to return the most similar vectors.\n",
"\n",
"This notebook shows how to use LangChain with Databricks Vector Search."
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "746cfacd-fb30-48fd-96a5-bbcc0d15ae49",
"showTitle": false,
"title": ""
}
},
"source": [
"Install `databricks-vectorsearch` and related Python packages used in this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "9258a3e7-e050-4390-9d3f-9adff1460dab",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"!pip install langchain-core databricks-vectorsearch openai tiktoken"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "f4f09d6d-002d-4cb0-a664-0a83bd2a13da",
"showTitle": false,
"title": ""
}
},
"source": [
"Use `OpenAIEmbeddings` for the embeddings."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "f11b902d-a772-45e0-bbd9-526218b717cc",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"import getpass\n",
"import os\n",
"\n",
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "59b568f3-8db2-427e-9a4a-1df6fa7a1739",
"showTitle": false,
"title": ""
}
},
"source": [
"Split documents and get embeddings."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "b28e1c7b-eae4-4be8-abbd-8433c7557dc2",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"from langchain.document_loaders import TextLoader\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain.text_splitter import CharacterTextSplitter\n",
"\n",
"loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n",
"documents = loader.load()\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"docs = text_splitter.split_documents(documents)\n",
"\n",
"embeddings = OpenAIEmbeddings()\n",
"emb_dim = len(embeddings.embed_query(\"hello\"))"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "e8fcdda1-208a-45c9-816e-ff0d2c8f59d6",
"showTitle": false,
"title": ""
}
},
"source": [
"## Setup Databricks Vector Search client"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "9b87fff1-99e5-4d9f-aba3-d21a7ccc498e",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"from databricks.vector_search.client import VectorSearchClient\n",
"\n",
"vsc = VectorSearchClient()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a Vector Search Endpoint\n",
"This endpoint is used to create and access vector search indexes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vsc.create_endpoint(name=\"vector_search_demo_endpoint\", endpoint_type=\"STANDARD\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "81090f87-3efd-4c1e-9f58-8d6adba7553d",
"showTitle": false,
"title": ""
}
},
"source": [
"## Create Direct Vector Access Index\n",
"Direct Vector Access Index supports direct read and write of embedding vectors and metadata through a REST API or an SDK. For this index, you manage embedding vectors and index updates yourself."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "9389ec6b-5885-411f-a26e-1a4b03651f5c",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"vector_search_endpoint_name = \"vector_search_demo_endpoint\"\n",
"index_name = \"ml.llm.demo_index\"\n",
"\n",
"index = vsc.create_direct_access_index(\n",
" endpoint_name=vector_search_endpoint_name,\n",
" index_name=index_name,\n",
" primary_key=\"id\",\n",
" embedding_dimension=emb_dim,\n",
" embedding_vector_column=\"text_vector\",\n",
" schema={\n",
" \"id\": \"string\",\n",
" \"text\": \"string\",\n",
" \"text_vector\": \"array<float>\",\n",
" \"source\": \"string\",\n",
" },\n",
")\n",
"\n",
"index.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "047a14c9-2f06-4f74-883d-815b2c69786c",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"from langchain.vectorstores import DatabricksVectorSearch\n",
"\n",
"dvs = DatabricksVectorSearch(\n",
" index, text_column=\"text\", embedding=embeddings, columns=[\"source\"]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "951bd581-2ced-497f-9c70-4fda902fd3a1",
"showTitle": false,
"title": ""
}
},
"source": [
"## Add docs to the index"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "1e85f235-901f-4cf5-845f-5dbf4ce42078",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"dvs.add_documents(docs)"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "8bea6f0a-b305-455a-acba-99cc8c9350b5",
"showTitle": false,
"title": ""
}
},
"source": [
"## Similarity search"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "25c5a044-a61a-4929-9e65-a0f0462925df",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"query = \"What did the president say about Ketanji Brown Jackson\"\n",
"dvs.similarity_search(query)\n",
"print(docs[0].page_content)"
]
},
{
"cell_type": "markdown",
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "46e3f41b-dac2-4bed-91cb-a3914c25d275",
"showTitle": false,
"title": ""
}
},
"source": [
"## Work with Delta Sync Index\n",
"\n",
"You can also use `DatabricksVectorSearch` to search in a Delta Sync Index. Delta Sync Index automatically syncs from a Delta table. You don't need to call `add_text`/`add_documents` manually. See [Databricks documentation page](https://docs.databricks.com/en/generative-ai/vector-search.html#delta-sync-index-with-managed-embeddings) for more details."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"application/vnd.databricks.v1+cell": {
"cellMetadata": {},
"inputWidgets": {},
"nuid": "0c1f448e-77ca-41ce-887c-15948e866a0e",
"showTitle": false,
"title": ""
}
},
"outputs": [],
"source": [
"dvs_delta_sync = DatabricksVectorSearch(\"catalog_name.schema_name.delta_sync_index\")\n",
"dvs_delta_sync.similarity_search(query)"
]
}
],
"metadata": {
"application/vnd.databricks.v1+notebook": {
"dashboards": [],
"language": "python",
"notebookMetadata": {
"pythonIndentUnit": 2
},
"notebookName": "databricks_vector_search",
"widgets": {}
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading

0 comments on commit df95abb

Please sign in to comment.