From 5cc69d4f419117a75d9fb561af9b0badb2350e87 Mon Sep 17 00:00:00 2001 From: ajosh0504 Date: Thu, 1 Aug 2024 14:13:28 +0200 Subject: [PATCH] Updating embedding section --- docs/50-prepare-the-data/4-embed-data.mdx | 10 +++++++--- .../2-create-vector-index.mdx | 4 ++-- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/docs/50-prepare-the-data/4-embed-data.mdx b/docs/50-prepare-the-data/4-embed-data.mdx index c9fe502..c0511bc 100644 --- a/docs/50-prepare-the-data/4-embed-data.mdx +++ b/docs/50-prepare-the-data/4-embed-data.mdx @@ -12,7 +12,7 @@ The answers for code blocks in this section are as follows: Answer
```python -SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1") +SentenceTransformer("thenlper/gte-small") ```
@@ -35,9 +35,13 @@ return embedding.tolist() Answer
```python -for doc in split_docs: +for doc in tqdm(split_docs): doc["embedding"] = get_embedding(doc["body"]) embedded_docs.append(doc) ```
- \ No newline at end of file + + +:::caution +If the embedding generation is taking too long (> 2-3 min), kill/interrupt the cell and move on to the next step with the documents that have been embedded up until that point. +::: \ No newline at end of file diff --git a/docs/60-perform-semantic-search/2-create-vector-index.mdx b/docs/60-perform-semantic-search/2-create-vector-index.mdx index e4a71b9..530b4c9 100644 --- a/docs/60-perform-semantic-search/2-create-vector-index.mdx +++ b/docs/60-perform-semantic-search/2-create-vector-index.mdx @@ -23,7 +23,7 @@ Select the `mongodb_rag_lab` database and the `knowledge` collection, change the { "type": "vector", "path": "embedding", - "numDimensions": 1024, + "numDimensions": 384, "similarity": "cosine" } ] @@ -31,5 +31,5 @@ Select the `mongodb_rag_lab` database and the `knowledge` collection, change the ``` :::info -The number of dimensions in the index definition is 1024 since we are using Mixedbread AI's open-source [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) model to generate embeddings in this lab. +The number of dimensions in the index definition is 384 since we are using the [gte-small](https://huggingface.co/thenlper/gte-small) model to generate embeddings in this lab. ::: \ No newline at end of file