Updating code blocks

mongodb-developer · Jul 16, 2024 · aa3dd72 · aa3dd72
1 parent f15c2a1
commit aa3dd72
Show file tree

Hide file tree

Showing 9 changed files with 74 additions and 47 deletions.
diff --git a/docs/50-prepare-the-data/2-load-data.mdx b/docs/50-prepare-the-data/2-load-data.mdx
@@ -1,5 +1,5 @@
 # 👐 Load the dataset
 
-First, let's download the dataset for our lab. We'll use four RAG-focused blogs from our Developer Center as the source data for our RAG application.
+First, let's download the dataset for our lab. We'll use a subset of articles from the MongoDB Developer Center as the source data for our RAG application.
 
-Run all the cells under the **Step 3: Load the dataset** section in the notebook to load the blog content as LangChain Document objects.
+Run all the cells under the **Step 3: Load the dataset** section in the notebook to load the articles as a list of Python objects consisting of the content and relevant metadata.
diff --git a/docs/50-prepare-the-data/3-chunk-data.mdx b/docs/50-prepare-the-data/3-chunk-data.mdx
@@ -2,7 +2,7 @@
 
 Since we are working with large documents, we first need to break them up into smaller chunks before embedding and storing them in MongoDB.
 
-Fill in any `<CODE_BLOCK_N>` placeholders and run the cells under the **Step 4: Chunk up the data** section in the notebook to chunk up the documents we loaded.
+Fill in any `<CODE_BLOCK_N>` placeholders and run the cells under the **Step 4: Chunk up the data** section in the notebook to chunk up the articles we loaded.
 
 The answers for code blocks in this section are as follows:
 
@@ -13,7 +13,7 @@ The answers for code blocks in this section are as follows:
 <div>
 ```python
 RecursiveCharacterTextSplitter.from_tiktoken_encoder(
-    encoding_name="cl100k_base", chunk_size=200, chunk_overlap=30
+    encoding_name="cl100k_base", separators=separators, chunk_size=200, chunk_overlap=30
 )
 ```
 </div>
@@ -25,7 +25,7 @@ RecursiveCharacterTextSplitter.from_tiktoken_encoder(
 <summary>Answer</summary>
 <div>
 ```python
-text_splitter.split_documents(docs)
+doc[text_field]
 ```
 </div>
 </details>
@@ -36,7 +36,34 @@ text_splitter.split_documents(docs)
 <summary>Answer</summary>
 <div>
 ```python
-doc.dict() for doc in split_docs
+text_splitter.split_text(text)
+```
+</div>
+</details>
+
+**CODE_BLOCK_6**
+
+<details>
+<summary>Answer</summary>
+<div>
+```python
+for chunk in chunks:
+    temp = doc.copy()
+    temp[text_field] = chunk
+    chunked_data.append(temp)
+```
+</div>
+</details>
+
+**CODE_BLOCK_7**
+
+<details>
+<summary>Answer</summary>
+<div>
+```python
+for doc in docs:
+    chunks = get_chunks(doc, "body")
+    split_docs.extend(chunks)
 ```
 </div>
 </details>
diff --git a/docs/50-prepare-the-data/4-embed-data.mdx b/docs/50-prepare-the-data/4-embed-data.mdx
@@ -2,11 +2,11 @@
 
 To perform vector search on our data, we need to embed it (i.e. generate embedding vectors) before ingesting it into MongoDB.
 
-Fill in any `<CODE_BLOCK_N>` placeholders and run the cells under the **Step 5: Generate embeddings** section in the notebook to generate embeddings for the chunked documents.
+Fill in any `<CODE_BLOCK_N>` placeholders and run the cells under the **Step 5: Generate embeddings** section in the notebook to embed the chunked articles.
 
 The answers for code blocks in this section are as follows:
 
-**CODE_BLOCK_6**
+**CODE_BLOCK_8**
 
 <details>
 <summary>Answer</summary>
@@ -17,7 +17,7 @@ SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
 </div>
 </details>
 
-**CODE_BLOCK_7**
+**CODE_BLOCK_9**
 
 <details>
 <summary>Answer</summary>
@@ -29,15 +29,15 @@ return embedding.tolist()
 </div>
 </details>
 
-**CODE_BLOCK_8**
+**CODE_BLOCK_10**
 
 <details>
 <summary>Answer</summary>
 <div>
 ```python
 for doc in split_docs:
     temp = doc.copy()
-    temp["embedding"] = get_embedding(temp["page_content"])
+    temp["embedding"] = get_embedding(temp["body"])
     embedded_docs.append(temp)
 ```
 </div>

diff --git a/docs/50-prepare-the-data/5-ingest-data.mdx b/docs/50-prepare-the-data/5-ingest-data.mdx
@@ -2,13 +2,13 @@ import Screenshot from "@site/src/components/Screenshot";
 
 # 👐 Ingest data into MongoDB
 
-The final step to build a MongoDB vector store for our RAG application is to ingest the embedded documents into MongoDB.
+The final step to build a MongoDB vector store for our RAG application is to ingest the embedded article chunks into MongoDB.
 
 Fill in any `<CODE_BLOCK_N>` placeholders and run the cells under the **Step 6: Ingest data into MongoDB** section in the notebook to ingest the embedded documents into MongoDB.
 
 The answers for code blocks in this section are as follows:
 
-**CODE_BLOCK_9**
+**CODE_BLOCK_11**
 
 <details>
 <summary>Answer</summary>
@@ -19,7 +19,7 @@ MongoClient(MONGODB_URI)
 </div>
 </details>
 
-**CODE_BLOCK_10**
+**CODE_BLOCK_12**
 
 <details>
 <summary>Answer</summary>
@@ -30,7 +30,7 @@ mongo_client[DB_NAME][COLLECTION_NAME]
 </div>
 </details>
 
-**CODE_BLOCK_11**
+**CODE_BLOCK_13**
 
 <details>
 <summary>Answer</summary>
@@ -41,7 +41,7 @@ collection.delete_many({})
 </div>
 </details>
 
-**CODE_BLOCK_12**
+**CODE_BLOCK_14**
 
 <details>
 <summary>Answer</summary>

diff --git a/docs/60-perform-semantic-search/3-vector-search.mdx b/docs/60-perform-semantic-search/3-vector-search.mdx
@@ -6,7 +6,7 @@ Fill in any `<CODE_BLOCK_N>` placeholders and run the cells under the **Step 8:
 
 The answers for code blocks in this section are as follows:
 
-**CODE_BLOCK_13**
+**CODE_BLOCK_15**
 
 <details>
 <summary>Answer</summary>
@@ -17,7 +17,7 @@ get_embedding(user_query)
 </div>
 </details>
 
-**CODE_BLOCK_14**
+**CODE_BLOCK_16**
 
 <details>
 <summary>Answer</summary>
@@ -36,7 +36,7 @@ get_embedding(user_query)
     {
         "$project": {
             "_id": 0,
-            "page_content": 1,
+            "body": 1,
             "score": {"$meta": "vectorSearchScore"},
         }
     },
@@ -45,7 +45,7 @@ get_embedding(user_query)
 </div>
 </details>
 
-**CODE_BLOCK_15**
+**CODE_BLOCK_17**
 
 <details>
 <summary>Answer</summary>

diff --git a/docs/60-perform-semantic-search/4-pre-filtering.mdx b/docs/60-perform-semantic-search/4-pre-filtering.mdx
@@ -10,7 +10,7 @@ Fill in any `<CODE_BLOCK_N>` placeholders and run the cells under the **🦹‍
 
 The answers for code blocks in this section are as follows:
 
-**CODE_BLOCK_16**
+**CODE_BLOCK_18**
 
 <details>
 <summary>Answer</summary>
@@ -25,7 +25,7 @@ The answers for code blocks in this section are as follows:
             "type": "vector"
         },
         {
-            "path": "metadata.language"
+            "path": "metadata.contentType",
             "type": "filter"
         }
     ]
@@ -34,7 +34,7 @@ The answers for code blocks in this section are as follows:
 </div>
 </details>
 
-**CODE_BLOCK_17**
+**CODE_BLOCK_19**
 
 <details>
 <summary>Answer</summary>
@@ -48,13 +48,13 @@ The answers for code blocks in this section are as follows:
             "path": "embedding",
             "numCandidates": 150,
             "limit": 5,
-            "filter": {"metadata.language": "en"}
+            "filter": {"metadata.contentType": "Video"}
         }
     },
     {
         "$project": {
             "_id": 0,
-            "page_content": 1,
+            "body": 1,
             "score": {"$meta": "vectorSearchScore"}
         }
     }
@@ -63,7 +63,7 @@ The answers for code blocks in this section are as follows:
 </div>
 </details>
 
-**CODE_BLOCK_18**
+**CODE_BLOCK_20**
 
 <details>
 <summary>Answer</summary>
@@ -78,11 +78,11 @@ The answers for code blocks in this section are as follows:
             "type": "vector"
         },
         {
-            "path": "metadata.language"
+            "path": "metadata.contentType",
             "type": "filter"
         },
         {
-            "path": "type"
+            "path": "updated",
             "type": "filter"
         }
     ]
@@ -91,7 +91,7 @@ The answers for code blocks in this section are as follows:
 </div>
 </details>
 
-**CODE_BLOCK_19**
+**CODE_BLOCK_21**
 
 <details>
 <summary>Answer</summary>
@@ -107,16 +107,16 @@ The answers for code blocks in this section are as follows:
             "limit": 5,
             "filter": {
                 "$and": [
-                    {"metadata.language": "en"},
-                    {"type": "Document"}
+                    {"metadata.contentType": "Video"},
+                    {"updated": {"$gte": "2024-05-20"}}
                 ]
             }
         }
     },
     {
         "$project": {
             "_id": 0,
-            "page_content": 1,
+            "body": 1,
             "score": {"$meta": "vectorSearchScore"}
         }
     }

diff --git a/docs/70-build-rag-app/2-build-rag-app.mdx b/docs/70-build-rag-app/2-build-rag-app.mdx
@@ -6,7 +6,7 @@ Fill in any `<CODE_BLOCK_N>` placeholders and run the cells under the **Step 9:
 
 The answers for code blocks in this section are as follows:
 
-**CODE_BLOCK_20**
+**CODE_BLOCK_22**
 
 <details>
 <summary>Answer</summary>
@@ -17,18 +17,18 @@ vector_search(user_query)
 </div>
 </details>
 
-**CODE_BLOCK_21**
+**CODE_BLOCK_23**
 
 <details>
 <summary>Answer</summary>
 <div>
 ```python
-"\n\n".join([d.get("page_content", "") for d in context])
+"\n\n".join([d.get("body", "") for d in context])
 ```
 </div>
 </details>
 
-**CODE_BLOCK_22**
+**CODE_BLOCK_24**
 
 <details>
 <summary>Answer</summary>

diff --git a/docs/70-build-rag-app/3-stream-responses.mdx b/docs/70-build-rag-app/3-stream-responses.mdx
@@ -6,7 +6,7 @@ Fill in any `<CODE_BLOCK_N>` placeholders and run the cells under the **🦹‍
 
 The answers for code blocks in this section are as follows:
 
-**CODE_BLOCK_23**
+**CODE_BLOCK_25**
 
 <details>
 <summary>Answer</summary>
@@ -27,7 +27,7 @@ fw_client.chat.completions.create(
 </div>
 </details>
 
-**CODE_BLOCK_24**
+**CODE_BLOCK_26**
 
 <details>
 <summary>Answer</summary>