Add Astra DB blog post announcement (#284)

* Add AstraDB announcement blog post * add some more flavor text * add the rest of the code blocks and a conclusion * add image * fix image * sparkles * last => latest * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * Update content/blog/astradb-haystack-integration/index.md * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * Update content/blog/astradb-haystack-integration/index.md Co-authored-by: Daniel Sauble <[email protected]> * change date * make logo white in thumbnail --------- Co-authored-by: Daniel Sauble <[email protected]> Co-authored-by: Tuana Çelik <[email protected]>
deepset-ai · Jan 19, 2024 · d92d992 · d92d992 · vercel · Jan 19, 2024
1 parent cab0e73
commit d92d992
Show file tree

Hide file tree

Showing 2 changed files with 173 additions and 0 deletions.
diff --git a/content/blog/astradb-haystack-integration/index.md b/content/blog/astradb-haystack-integration/index.md
@@ -0,0 +1,173 @@
+---
+layout: blog-post
+title: Announcing the Astra DB Haystack Integration
+description: Learn how to use the new Astra DB integrations for Haystack 2.0 in your RAG pipelines.
+featured_image: thumbnail.png
+images: ["blog/astradb-haystack-integration/thumbnail.png"]
+alt_image: The logos for Haystack and Astra DB hang out on a blue background in front of some people tending to pipelines, and inexplicably a giant lightbulb.
+toc: True
+date: 2024-01-19
+last_updated:  2024-01-19
+authors:
+  - Tilde Thurium
+tags: ["Embeddings", "Haystack 2.0", "Vector Database"]
+cookbook: astradb_haystack_integration.ipynb
+---
+The Haystack extension family is growing so fast, it's hard to keep up! Our latest addition is the Astra DB extension by [Datastax](https://datastax.com/). It's an open source package that helps you use Astra DB as a vector database for your Haystack pipelines.
+
+Let's learn about the benefits of Astra DB and how to use it with Haystack.
+
+### Benefits of Astra DB
+
+DataStax Astra DB is a serverless vector database built on [Apache Cassandra](https://cassandra.apache.org/_/index.html). What makes Astra DB special?
+
+- **Interoperability** with Cassandra's open source ecosystem and tooling. 
+- Astra DB **supports a variety of different embedding models**. One Astra database instance can have multiple `collections` with different vector sizes. This makes it easy to test different embedding models and find the best one for your use case.
+- **It's serverless**. What does that mean for a database? You don't have to manage individual instances, or deal with cumbersome upgrading or scaling. All of that is taken care of for you behind the scenes.
+- **Enterprise scalability**. Astra DB can be deployed across the major cloud providers (AWS, GCP, or Azure) and across multiple regions depending on your needs.
+- At the time of this writing, **there's a free tier available** so you can try it without a credit card.
+
+### Create your Astra DB database
+To ensure these instructions remain up to date, we're going to point you to the Astra DB docs to explain how to create a database.
+
+1. [Create a free Astra DB database](https://docs.datastax.com/en/astra/astra-db-vector/databases/create-database.html#create-vector-database). Make a note of your credentials - you'll need your database ID, application token, keyspace, and database region to use the Haystack extension.
+2. Choose the number of dimensions that matches the [embedding model](https://haystack.deepset.ai/blog/what-is-text-vectorization-in-nlp) you plan on using. For this example we'll use a 384-dimension model, [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).
+3. [Create a collection](https://docs.datastax.com/en/astra/astra-db-vector/databases/manage-collections.html#create-collection) with the same number of dimensions as your embedding model. Save the name of your collection since you'll need this as well. 
+
+### Get started with the Astra DB Haystack Integration
+
+First, install the integration:
+
+```bash
+pip install astra-haystack sentence-transformers
+```
+
+Remember earlier when I mentioned you were going to need your credentials? I hope you saved them. If not, that's okay, you can go back to the [Astra Portal](https://astra.datastax.com/) and grab them.
+
+> Note: if you were running this code in production, you'd want to save these as environment variables to keep things nice and secure.
+
+```python
+from getpass import getpass
+
+OPENAI_API_KEY = getpass("Enter your openAI key:")
+ASTRA_DB_ID = getpass("Enter your Astra database ID:")
+ASTRA_DB_APPLICATION_TOKEN = getpass("Enter your Astra application token (e.g.AstraCS:xxx ):")
+ASTRA_DB_REGION = getpass("Enter your AstraDB Region: ")
+ASTRA_DB_COLLECTION_NAME = getpass("enter your Astra collection name:")
+ASTRA_DB_KEYSPACE_NAME = getpass("Enter your Astra keyspace name:")
+```
+
+## Using the Astra DocumentStore in an index pipeline 
+Next, we'll make a Haystack pipeline, create some embeddings from documents, and add them into the [`AstraDocumentStore`](https://docs.haystack.deepset.ai/v2.0/docs/astradocumentstore).
+
+```python
+import logging
+
+from haystack import Document, Pipeline
+
+from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
+from haystack.components.writers import DocumentWriter
+from haystack.document_stores.types import DuplicatePolicy
+
+from astra_haystack.document_store import AstraDocumentStore
+
+logger = logging.getLogger(__name__)
+logging.basicConfig(level=logging.INFO)
+
+embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
+
+# embedding_dim is the number of dimensions the embedding model supports.
+document_store = AstraDocumentStore(
+    astra_id=ASTRA_DB_ID,
+    astra_region=ASTRA_DB_REGION,
+    astra_collection=ASTRA_DB_COLLECTION_NAME,
+    astra_keyspace=ASTRA_DB_KEYSPACE_NAME,
+    astra_application_token=ASTRA_DB_APPLICATION_TOKEN,
+    duplicates_policy=DuplicatePolicy.SKIP,
+    embedding_dim=384,
+)
+
+
+# Add Documents
+documents = [
+    Document(content="There are over 7,000 languages spoken around the world today."),
+    Document(
+        content="Elephants have been observed to behave in a way that indicates"
+        " a high level of self-awareness, such as recognizing themselves in mirrors."
+    ),
+    Document(
+        content="In certain parts of the world, like the Maldives, Puerto Rico, "
+        "and San Diego, you can witness the phenomenon of bioluminescent waves."
+    ),
+]
+index_pipeline = Pipeline()
+index_pipeline.add_component(
+    instance=SentenceTransformersDocumentEmbedder(model=embedding_model_name),
+    name="embedder",
+)
+index_pipeline.add_component(instance=DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP), name="writer")
+index_pipeline.connect("embedder.documents", "writer.documents")
+
+index_pipeline.run({"embedder": {"documents": documents}})
+
+print(document_store.count_documents())
+```
+If all has gone well, there should be 3 documents. 🎉
+
+## Use the `AstraRetriever` in a Haystack RAG pipeline
+
+In Haystack, every `DocumentStore` is tightly coupled with the `Retriever` that fetches from it. Astra DB is no exception. Here we'll create a RAG pipeline, where the [`AstraRetriever`](https://docs.haystack.deepset.ai/v2.0/docs/astraretriever) will fetch documents relevant to our query.
+
+```python
+from haystack.components.builders.answer_builder import AnswerBuilder
+from haystack.components.builders.prompt_builder import PromptBuilder
+from haystack.components.generators import OpenAIGenerator
+from astra_haystack.retriever import AstraRetriever
+
+prompt_template = """
+                Given these documents, answer the question.
+                Documents:
+                {% for doc in documents %}
+                    {{ doc.content }}
+                {% endfor %}
+                Question: {{question}}
+                Answer:
+                """
+
+rag_pipeline = Pipeline()
+rag_pipeline.add_component(
+    instance=SentenceTransformersTextEmbedder(model=embedding_model_name),
+    name="embedder",
+)
+rag_pipeline.add_component(instance=AstraRetriever(document_store=document_store), name="retriever")
+rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
+rag_pipeline.add_component(instance=OpenAIGenerator(api_key=OPENAI_API_KEY), name="llm")
+rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
+rag_pipeline.connect("embedder", "retriever")
+rag_pipeline.connect("retriever", "prompt_builder.documents")
+rag_pipeline.connect("prompt_builder", "llm")
+rag_pipeline.connect("llm.replies", "answer_builder.replies")
+rag_pipeline.connect("llm.meta", "answer_builder.meta")
+rag_pipeline.connect("retriever", "answer_builder.documents")
+
+# Run the pipeline
+question = "How many languages are there in the world today?"
+result = rag_pipeline.run(
+    {
+        "embedder": {"text": question},
+        "retriever": {"top_k": 2},
+        "prompt_builder": {"question": question},
+        "answer_builder": {"query": question},
+    }
+)
+
+print(result)
+```
+The output should look like this:
+```bash
+{'answer_builder': {'answers': [GeneratedAnswer(data='There are over 7,000 languages spoken around the world today.', query='How many languages are there in the world today?', documents=[Document(id=cfe93bc1c274908801e6670440bf2bbba54fad792770d57421f85ffa2a4fcc94, content: 'There are over 7,000 languages spoken around the world today.', score: 0.9267925, embedding: vector of size 384), Document(id=6f20658aeac3c102495b198401c1c0c2bd71d77b915820304d4fbc324b2f3cdb, content: 'Elephants have been observed to behave in a way that indicates a high level of self-awareness, such ...', score: 0.5357444, embedding: vector of size 384)], meta={'model': 'gpt-3.5-turbo-0613', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 14, 'prompt_tokens': 83, 'total_tokens': 97}})]}}
+```
+
+## Wrapping it up
+
+If you've gotten this far, now you know how to use Astra DB as a data source for your Haystack pipeline. To learn more about Haystack, [join us on Discord](https://discord.gg/QMP5jgMH) or [sign up for our monthly newsletter](https://landing.deepset.ai/haystack-community-updates?utm_campaign=developer-relations&utm_source=astradb-haystack-notebook).
diff --git a/content/blog/astradb-haystack-integration/thumbnail.png b/content/blog/astradb-haystack-integration/thumbnail.png