nomic: init pkg (#16853)

Co-authored-by: Lance Martin <[email protected]>
langchain-ai · Feb 1, 2024 · 17e8863 · 17e8863
1 parent 2e5949b
commit 17e8863
Show file tree

Hide file tree

Showing 24 changed files with 2,179 additions and 0 deletions.
diff --git a/.github/workflows/_integration_test.yml b/.github/workflows/_integration_test.yml
@@ -56,6 +56,7 @@ jobs:
           GOOGLE_SEARCH_API_KEY: ${{ secrets.GOOGLE_SEARCH_API_KEY }}
           GOOGLE_CSE_ID: ${{ secrets.GOOGLE_CSE_ID }}
           EXA_API_KEY: ${{ secrets.EXA_API_KEY }}
+          NOMIC_API_KEY: ${{ secrets.NOMIC_API_KEY }}
         run: |
           make integration_tests
 

diff --git a/.github/workflows/_release.yml b/.github/workflows/_release.yml
@@ -175,6 +175,7 @@ jobs:
           GOOGLE_SEARCH_API_KEY: ${{ secrets.GOOGLE_SEARCH_API_KEY }}
           GOOGLE_CSE_ID: ${{ secrets.GOOGLE_CSE_ID }}
           EXA_API_KEY: ${{ secrets.EXA_API_KEY }}
+          NOMIC_API_KEY: ${{ secrets.NOMIC_API_KEY }}
         run: make integration_tests
         working-directory: ${{ inputs.working-directory }}
 

diff --git a/cookbook/nomic_embeddings.ipynb b/cookbook/nomic_embeddings.ipynb
@@ -0,0 +1,301 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d8da6094-30c7-43f3-a608-c91717b673db",
+   "metadata": {},
+   "source": [
+    "# Nomic Embeddings\n",
+    "\n",
+    "Nomic has released a new embedding model with strong performance for long context retrieval (8k context window).\n",
+    "\n",
+    "## Signup\n",
+    "\n",
+    "Get your API token, then run:\n",
+    "```\n",
+    "! nomic login\n",
+    "```\n",
+    "\n",
+    "Then run with your generated API token \n",
+    "```\n",
+    "! nomic login < token > \n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f737ec15-e9ab-4629-b54c-24be69e8b60b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! nomic login"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "8ab7434a-2930-42b5-9164-dc2c03abe232",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! nomic login token"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a3501e2a-4686-4b95-8a1c-f19e035ea354",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install -U langchain-nomic"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "134475f2-f256-4c13-9712-c55783e6a4e2",
+   "metadata": {},
+   "source": [
+    "## Document Loading\n",
+    "\n",
+    "Let's test 3 interesting blog posts."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "01c4d270-171e-45c2-a1b6-e350faa74117",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders import WebBaseLoader\n",
+    "\n",
+    "urls = [\n",
+    "    \"https://lilianweng.github.io/posts/2023-06-23-agent/\",\n",
+    "    \"https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/\",\n",
+    "    \"https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/\",\n",
+    "]\n",
+    "\n",
+    "docs = [WebBaseLoader(url).load() for url in urls]\n",
+    "docs_list = [item for sublist in docs for item in sublist]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "75ab7f74-873c-4d84-af5a-5cf19c61239d",
+   "metadata": {},
+   "source": [
+    "## Splitting \n",
+    "\n",
+    "Long context retrieval "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "f512e128-629e-4304-926f-94fe5c999527",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.text_splitter import CharacterTextSplitter\n",
+    "\n",
+    "text_splitter = CharacterTextSplitter.from_tiktoken_encoder(\n",
+    "    chunk_size=7500, chunk_overlap=100\n",
+    ")\n",
+    "doc_splits = text_splitter.split_documents(docs_list)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "d2a69cf0-e3ab-4c92-a1d0-10da45c08b3b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The document is 6562 tokens\n",
+      "The document is 3037 tokens\n",
+      "The document is 6092 tokens\n",
+      "The document is 1050 tokens\n",
+      "The document is 6933 tokens\n",
+      "The document is 5560 tokens\n"
+     ]
+    }
+   ],
+   "source": [
+    "import tiktoken\n",
+    "\n",
+    "encoding = tiktoken.get_encoding(\"cl100k_base\")\n",
+    "encoding = tiktoken.encoding_for_model(\"gpt-3.5-turbo\")\n",
+    "for d in doc_splits:\n",
+    "    print(\"The document is %s tokens\" % len(encoding.encode(d.page_content)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c58d1e9b-e98e-4bd9-b52f-4dfc2a4e69f4",
+   "metadata": {},
+   "source": [
+    "## Index \n",
+    "\n",
+    "Nomic embeddings [here](https://docs.nomic.ai/reference/endpoints/nomic-embed-text). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "76447866-bf8b-412b-93bc-d6ea8ec35952",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "from langchain_community.vectorstores import Chroma\n",
+    "from langchain_core.output_parsers import StrOutputParser\n",
+    "from langchain_core.runnables import RunnableLambda, RunnablePassthrough\n",
+    "from langchain_nomic import NomicEmbeddings\n",
+    "from langchain_nomic.embeddings import NomicEmbeddings"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "15b3eab2-2689-49d4-8cb0-67ef2adcbc49",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Add to vectorDB\n",
+    "vectorstore = Chroma.from_documents(\n",
+    "    documents=doc_splits,\n",
+    "    collection_name=\"rag-chroma\",\n",
+    "    embedding=NomicEmbeddings(model=\"nomic-embed-text-v1\"),\n",
+    ")\n",
+    "retriever = vectorstore.as_retriever()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "41131122-3591-4566-aac1-ed19d496820a",
+   "metadata": {},
+   "source": [
+    "## RAG Chain\n",
+    "\n",
+    "We can use the Mistral `v0.2`, which is [fine-tuned for 32k context](https://x.com/dchaplot/status/1734198245067243629?s=20).\n",
+    "\n",
+    "We can [use Ollama](https://ollama.ai/library/mistral) -\n",
+    "```\n",
+    "ollama pull mistral:instruct\n",
+    "```\n",
+    "\n",
+    "We can also run [GPT-4 128k](https://openai.com/blog/new-models-and-developer-products-announced-at-devday). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "1397de64-5b4a-4001-adc5-570ff8d31ff6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.chat_models import ChatOllama\n",
+    "from langchain_core.prompts import ChatPromptTemplate\n",
+    "from langchain_openai import ChatOpenAI\n",
+    "\n",
+    "# Prompt\n",
+    "template = \"\"\"Answer the question based only on the following context:\n",
+    "{context}\n",
+    "\n",
+    "Question: {question}\n",
+    "\"\"\"\n",
+    "prompt = ChatPromptTemplate.from_template(template)\n",
+    "\n",
+    "# LLM API\n",
+    "model = ChatOpenAI(temperature=0, model=\"gpt-4-1106-preview\")\n",
+    "\n",
+    "# Local LLM\n",
+    "ollama_llm = \"mistral:instruct\"\n",
+    "model_local = ChatOllama(model=ollama_llm)\n",
+    "\n",
+    "# Chain\n",
+    "chain = (\n",
+    "    {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
+    "    | prompt\n",
+    "    | model_local\n",
+    "    | StrOutputParser()\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "1548e00c-1ff6-4e88-aa13-69badf2088fb",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "' Agents, especially those used in artificial intelligence and natural language processing, can have different types of memory. Here are some common types:\\n\\n1. **Short-term memory** or working memory: This is a small capacity, high-turnover memory that holds information temporarily while the agent processes it. Short-term memory is essential for tasks requiring attention and quick response, such as parsing sentences or following instructions.\\n\\n2. **Long-term memory**: This is a large capacity, low-turnover memory where agents store information for extended periods. Long-term memory enables learning from experiences, accessing past knowledge, and improving performance over time.\\n\\n3. **Explicit memory** or declarative memory: Agents use explicit memory to store and recall facts, concepts, and rules that can be expressed in natural language. This type of memory is crucial for problem solving and reasoning.\\n\\n4. **Implicit memory** or procedural memory: Implicit memory refers to the acquisition and retention of skills and habits. The agent learns through repeated experiences without necessarily being aware of it.\\n\\n5. **Connectionist memory**: Connectionist memory, also known as neural networks, is inspired by the structure and function of biological brains. Connectionist models learn and store information in interconnected nodes or artificial neurons. This type of memory enables the model to recognize patterns and generalize knowledge.\\n\\n6. **Hybrid memory systems**: Many advanced agents employ a combination of different memory types to maximize their learning potential and performance. These hybrid systems can integrate short-term, long-term, explicit, implicit, and connectionist memories.'"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Question\n",
+    "chain.invoke(\"What are the types of agent memory?\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5ec5b4c3-757d-44df-92ea-dd5f08017dd6",
+   "metadata": {},
+   "source": [
+    "**Mistral**\n",
+    "\n",
+    "Trace: 24k prompt tokens.\n",
+    "\n",
+    "* https://smith.langchain.com/public/3e04d475-ea08-4ee3-ae66-6416a93d8b08/r\n",
+    "\n",
+    "--- \n",
+    "\n",
+    "Some considerations are noted in the [needle in a haystack analysis](https://twitter.com/GregKamradt/status/1722386725635580292?lang=en):\n",
+    "\n",
+    "* LLMs may suffer with retrieval from large context depending on where the information is placed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6ffb6b63-17ee-42d8-b1fb-d6a866e98458",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/docs/docs/integrations/providers/nomic.ipynb b/docs/docs/integrations/providers/nomic.ipynb
@@ -0,0 +1,53 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Nomic\n",
+    "\n",
+    "Nomic currently offers two products:\n",
+    "\n",
+    "- Atlas: their Visual Data Engine\n",
+    "- GPT4All: their Open Source Edge Language Model Ecosystem\n",
+    "\n",
+    "Currently, you can import their hosted [embedding model](/docs/integrations/text_embedding/nomic) as follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "id": "y8ku6X96sebl"
+   },
+   "outputs": [],
+   "source": [
+    "from langchain_nomic import NomicEmbeddings"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}