Merge pull request #334 from Shreyanand/milvus

Add Milvus database compatibility with the RAG recipe
containers · May 16, 2024 · ae88cd7 · ae88cd7
2 parents 0df104d + ef4b6f0
commit ae88cd7
Showing 14 changed files with 216 additions and 111 deletions.
diff --git a/.github/workflows/rag.yaml b/.github/workflows/rag.yaml
@@ -5,16 +5,16 @@ on:
     branches:
       - main
     paths:
-      - ./recipes/common/Makefile.common
-      - ./recipes/natural_language_processing/rag/**
-      - .github/workflows/rag.yaml
+      - 'recipes/common/Makefile.common'
+      - 'recipes/natural_language_processing/rag/**'
+      - '.github/workflows/rag.yaml'
   push:
     branches:
       - main
     paths:
-      - ./recipes/common/Makefile.common
-      - ./recipes/natural_language_processing/rag/**
-      - .github/workflows/rag.yaml
+      - 'recipes/common/Makefile.common'
+      - 'recipes/natural_language_processing/rag/**'
+      - '.github/workflows/rag.yaml'
 
   workflow_dispatch:
 

diff --git a/.gitignore b/.gitignore
@@ -12,3 +12,4 @@ recipes/common/bin/*
 */.venv/
 training/cloud/examples
 training/instructlab/instructlab
+vector_dbs/milvus/volumes/milvus/*
diff --git a/recipes/natural_language_processing/rag/README.md b/recipes/natural_language_processing/rag/README.md
@@ -4,7 +4,7 @@ This demo provides a simple recipe to help developers start to build out their o
 
 There are a few options today for local Model Serving, but this recipe will use [`llama-cpp-python`](https://github.com/abetlen/llama-cpp-python) and their OpenAI compatible Model Service. There is a Containerfile provided that can be used to build this Model Service within the repo, [`model_servers/llamacpp_python/base/Containerfile`](/model_servers/llamacpp_python/base/Containerfile).
 
-In order for the LLM to interact with our documents, we need them stored and available in such a manner that we can retrieve a small subset of them that are relevant to our query. To do this we employ a Vector Database alongside an embedding model. The embedding model converts our documents into numerical representations, vectors, such that similarity searches can be easily performed. The Vector Database stores these vectors for us and makes them available to the LLM. In this recipe we will use [chromaDB](https://docs.trychroma.com/) as our Vector Database.
+In order for the LLM to interact with our documents, we need them stored and available in such a manner that we can retrieve a small subset of them that are relevant to our query. To do this we employ a Vector Database alongside an embedding model. The embedding model converts our documents into numerical representations, vectors, such that similarity searches can be easily performed. The Vector Database stores these vectors for us and makes them available to the LLM. In this recipe we can use [chromaDB](https://docs.trychroma.com/) or [Milvus](https://milvus.io/) as our Vector Database.
 
 Our AI Application will connect to our Model Service via it's OpenAI compatible API. In this example we rely on [Langchain's](https://python.langchain.com/docs/get_started/introduction) python package to simplify communication with our Model Service and we use [Streamlit](https://streamlit.io/) for our UI layer. Below please see an example of the RAG application.     
 
@@ -78,16 +78,41 @@ snapshot_download(repo_id="BAAI/bge-base-en-v1.5",
 
 ### Deploy the Vector Database 
 
-To deploy the Vector Database service locally, simply use the existing ChromaDB image. 
+To deploy the Vector Database service locally, simply use the existing ChromaDB or Milvus image. The Vector Database is ephemeral and will need to be re-populated each time the container restarts. When implementing RAG in production, you will want a long running and backed up Vector Database.
 
+
+#### ChromaDB
 ```bash
 podman pull chromadb/chroma
 ```
 ```bash
 podman run --rm -it -p 8000:8000 chroma
 ```
-
-This Vector Database is ephemeral and will need to be re-populated each time the container restarts. When implementing RAG in production, you will want a long running and backed up Vector Database.
+#### Milvus
+```bash
+podman pull milvusdb/milvus:master-20240426-bed6363f
+```
+```bash
+podman run -it \
+        --name milvus-standalone \
+        --security-opt seccomp:unconfined \
+        -e ETCD_USE_EMBED=true \
+        -e ETCD_CONFIG_PATH=/milvus/configs/embedEtcd.yaml \
+        -e COMMON_STORAGETYPE=local \
+        -v $(pwd)/volumes/milvus:/var/lib/milvus \
+        -v $(pwd)/embedEtcd.yaml:/milvus/configs/embedEtcd.yaml \
+        -p 19530:19530 \
+        -p 9091:9091 \
+        -p 2379:2379 \
+        --health-cmd="curl -f http://localhost:9091/healthz" \
+        --health-interval=30s \
+        --health-start-period=90s \
+        --health-timeout=20s \
+        --health-retries=3 \
+        milvusdb/milvus:master-20240426-bed6363f \
+        milvus run standalone  1> /dev/null
+```
+Note: For running the Milvus instance, make sure you have the `$(pwd)/volumes/milvus` directory and `$(pwd)/embedEtcd.yaml` file as shown in this repository. These are required by the database for its operations.
 
 
 ### Build the Model Service

diff --git a/recipes/natural_language_processing/rag/app/Containerfile b/recipes/natural_language_processing/rag/app/Containerfile
@@ -16,6 +16,7 @@ COPY requirements.txt .
 RUN pip install --upgrade pip
 RUN pip install --no-cache-dir --upgrade -r /rag/requirements.txt
 COPY rag_app.py .
+COPY manage_vectordb.py .
 EXPOSE  8501
 ENV  HF_HUB_CACHE=/rag/models/
 ENTRYPOINT [ "streamlit", "run" ,"rag_app.py" ]
diff --git a/recipes/natural_language_processing/rag/app/manage_vectordb.py b/recipes/natural_language_processing/rag/app/manage_vectordb.py
@@ -0,0 +1,81 @@
+from langchain_community.vectorstores import Chroma
+from chromadb import HttpClient
+from chromadb.config import Settings
+import chromadb.utils.embedding_functions as embedding_functions
+from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
+from langchain_community.vectorstores import Milvus
+from pymilvus import MilvusClient
+from pymilvus import connections, utility
+
+class VectorDB:
+    def __init__(self, vector_vendor, host, port, collection_name, embedding_model):
+        self.vector_vendor = vector_vendor
+        self.host = host
+        self.port = port
+        self.collection_name = collection_name
+        self.embedding_model = embedding_model
+
+    def connect(self):
+        # Connection logic
+        print(f"Connecting to {self.host}:{self.port}...")
+        if self.vector_vendor == "chromadb":
+            self.client = HttpClient(host=self.host,
+                                port=self.port,
+                                settings=Settings(allow_reset=True,))
+        elif self.vector_vendor == "milvus":
+            self.client = MilvusClient(uri=f"http://{self.host}:{self.port}")
+        return self.client
+
+    def populate_db(self, documents):
+        # Logic to populate the VectorDB with vectors
+        e = SentenceTransformerEmbeddings(model_name=self.embedding_model)
+        print(f"Populating VectorDB with vectors...")
+        if self.vector_vendor == "chromadb":
+            embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=self.embedding_model)
+            collection = self.client.get_or_create_collection(self.collection_name,
+                                                              embedding_function=embedding_func)
+            if collection.count() < 1:
+                db = Chroma.from_documents(
+                    documents=documents,
+                    embedding=e,
+                    collection_name=self.collection_name,
+                    client=self.client
+                )
+                print("DB populated")
+            else:
+                db = Chroma(client=self.client,
+                            collection_name=self.collection_name,
+                            embedding_function=e,
+                            )
+                print("DB already populated")
+
+        elif self.vector_vendor == "milvus":
+            connections.connect(host=self.host, port=self.port)
+            if not utility.has_collection(self.collection_name):
+                print("Populating VectorDB with vectors...")
+                db = Milvus.from_documents(
+                    documents,
+                    e,
+                    collection_name=self.collection_name,
+                    connection_args={"host": self.host, "port": self.port},
+                )
+                print("DB populated")
+            else:
+                print("DB already populated")
+                db = Milvus(
+                    e,
+                    collection_name=self.collection_name,
+                    connection_args={"host": self.host, "port": self.port},
+                )
+        return db
+
+    def clear_db(self):
+        print(f"Clearing VectorDB...")
+        try:
+            if self.vector_vendor == "chromadb":
+                self.client.delete_collection(self.collection_name)
+            elif self.vector_vendor == "milvus":
+                self.client.drop_collection(self.collection_name)
+            print("Cleared DB")
+        except:
+            print("Couldn't clear the collection possibly because it doesn't exist") 
diff --git a/recipes/natural_language_processing/rag/app/populate_vectordb.py b/recipes/natural_language_processing/rag/app/populate_vectordb.py
diff --git a/recipes/natural_language_processing/rag/app/rag_app.py b/recipes/natural_language_processing/rag/app/rag_app.py
@@ -1,91 +1,68 @@
 from langchain_openai import ChatOpenAI
 from langchain_core.prompts import ChatPromptTemplate
 from langchain_core.runnables import RunnablePassthrough
-from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
 from langchain.text_splitter import CharacterTextSplitter
 from langchain_community.callbacks import StreamlitCallbackHandler
-from langchain_community.vectorstores import Chroma
+from langchain_community.document_loaders import TextLoader
 from langchain_community.document_loaders import PyPDFLoader
-from langchain.schema.document import Document
-from chromadb import HttpClient
-from chromadb.config import Settings
-import chromadb.utils.embedding_functions as embedding_functions
-import streamlit as st
+from manage_vectordb import VectorDB
 import tempfile
-import uuid
+import streamlit as st
 import os
 
 model_service = os.getenv("MODEL_ENDPOINT","http://0.0.0.0:8001")
 model_service = f"{model_service}/v1"
 chunk_size = os.getenv("CHUNK_SIZE", 150)
 embedding_model = os.getenv("EMBEDDING_MODEL","BAAI/bge-base-en-v1.5")
+vdb_vendor = os.getenv("VECTORDB_VENDOR", "chromadb")
 vdb_host = os.getenv("VECTORDB_HOST", "0.0.0.0")
 vdb_port = os.getenv("VECTORDB_PORT", "8000")
 vdb_name = os.getenv("VECTORDB_NAME", "test_collection")
 
+vdb = VectorDB(vdb_vendor, vdb_host, vdb_port, vdb_name, embedding_model)
+vectorDB_client = vdb.connect()
+def split_docs(raw_documents):
+    text_splitter = CharacterTextSplitter(separator = ".",
+                                            chunk_size=int(chunk_size),
+                                            chunk_overlap=0)
+    docs = text_splitter.split_documents(raw_documents)
+    return docs
 
-vectorDB_client = HttpClient(host=vdb_host,
-                    port=vdb_port,
-                    settings=Settings(allow_reset=True,))
-
-def clear_vdb():
-    global vectorDB_client
-    try:
-        vectorDB_client.delete_collection(vdb_name)
-        print("Cleared DB")
-    except:
-        pass
 
 def read_file(file):
     file_type = file.type
-
     if file_type == "application/pdf":
         temp = tempfile.NamedTemporaryFile()
         with open(temp.name, "wb") as f:
             f.write(file.getvalue())
             loader = PyPDFLoader(temp.name)
-        pages = loader.load()
-        text = "".join([p.page_content for p in pages]) 
 
     if file_type == "text/plain":
-        text = file.read().decode()   
-
-    return text
+        temp = tempfile.NamedTemporaryFile()
+        with open(temp.name, "wb") as f:
+            f.write(file.getvalue())
+            loader = TextLoader(temp.name)   
+    raw_documents = loader.load()
+    return raw_documents
 
 st.title("📚 RAG DEMO")
 with st.sidebar:
     file = st.file_uploader(label="📄 Upload Document",
-                            type=[".txt",".pdf"],
-                            on_change=clear_vdb
-                            )
+                        type=[".txt",".pdf"],
+                        on_change=vdb.clear_db
+                        )
 
 ### populate the DB ####
-os.environ["TOKENIZERS_PARALLELISM"] = "false"
-
-embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=embedding_model)
-e = SentenceTransformerEmbeddings(model_name=embedding_model)
-
-collection = vectorDB_client.get_or_create_collection(vdb_name,
-                                      embedding_function=embedding_func)
-if collection.count() < 1 and file != None:
-    print("populating db")
+if file != None:
     text = read_file(file)
-    raw_documents = [Document(page_content=text,
-                              metadata={"":""})]
-    text_splitter = CharacterTextSplitter(separator = ".",
-                                          chunk_size=int(chunk_size),
-                                          chunk_overlap=0)
-    docs = text_splitter.split_documents(raw_documents) 
-    for doc in docs:
-        collection.add(
-            ids=[str(uuid.uuid1())],
-            metadatas=doc.metadata, 
-            documents=doc.page_content
-            )
-if file == None:
-    print("Empty VectorDB")
+    documents = split_docs(text)
+    db = vdb.populate_db(documents)
+    retriever = db.as_retriever(threshold=0.75)
 else:
-    print("DB already populated")
+    retriever = {}
+    print("Empty VectorDB")
+
+
 ########################
 
 if "messages" not in st.session_state:
@@ -95,11 +72,6 @@ def read_file(file):
 for msg in st.session_state.messages:
     st.chat_message(msg["role"]).write(msg["content"])
 
-db = Chroma(client=vectorDB_client,
-            collection_name=vdb_name,
-            embedding_function=e
-    )
-retriever = db.as_retriever(threshold=0.75)
 
 llm = ChatOpenAI(base_url=model_service, 
                  api_key="EMPTY",

diff --git a/recipes/natural_language_processing/rag/app/requirements.txt b/recipes/natural_language_processing/rag/app/requirements.txt
@@ -4,4 +4,5 @@ chromadb
 sentence-transformers
 streamlit
 pypdf
+pymilvus
 
diff --git a/vector_dbs/README.md b/vector_dbs/README.md
@@ -1,12 +1,10 @@
 # Directory to store vector_dbs files
+This directory has make files and container files for open source vector databases. The built container images are used by recipes like `rag` to provide required database functions.
 
-[Chroma](https://www.trychroma.com/) is the open-source embedding database.
+## Chroma
+[Chroma](https://www.trychroma.com/) is an AI-native open-source embedding database.
 Chroma makes it easy to build LLM apps by making knowledge, facts, and skills
 pluggable for LLMs.
 
-chromadb is an the AI-native open-source embedding database.
-
-This container image is used by recipes like `rag` to provide required database
-functions.
-
-Use the included Makefile to build the container image.
+## Milvus
+[Milvus](https://milvus.io/) is an open-source vector database built to power embedding similarity search and AI applications. It is highly scalable and offers many production ready features for search. 
diff --git a/vector_dbs/Makefile → vector_dbs/chromadb/Makefile b/vector_dbs/Makefile → vector_dbs/chromadb/Makefile
@@ -3,4 +3,4 @@ APPIMAGE ?= quay.io/ai-lab/${APP}:latest
 
 .PHONY: build
 build:
-	podman build -f chromadb/Containerfile -t ${APPIMAGE} .
+	podman build -f Containerfile -t ${APPIMAGE} .
diff --git a/vector_dbs/milvus/Containerfile b/vector_dbs/milvus/Containerfile
@@ -0,0 +1,2 @@
+FROM docker.io/milvusdb/milvus:master-20240426-bed6363f
+ADD embedEtcd.yaml /milvus/configs/embedEtcd.yaml
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		FROM docker.io/milvusdb/milvus:master-20240426-bed6363f
		ADD embedEtcd.yaml /milvus/configs/embedEtcd.yaml