Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chroma-haystack 0.20.1 is not compatible with haystack 2.3.0 #904

Closed
springrain opened this issue Jul 16, 2024 · 6 comments · Fixed by #907
Closed

Chroma-haystack 0.20.1 is not compatible with haystack 2.3.0 #904

springrain opened this issue Jul 16, 2024 · 6 comments · Fixed by #907
Assignees
Labels
bug Something isn't working integration:chroma P1

Comments

@springrain
Copy link

Describe the bug
chroma-haystack 0.20.1 is not compatible with haystack 2.3.0.

 raise ValueError(
ValueError: Expected metadata value to be a str, int, float or bool, got [{'doc_id': 'eaf05ac742e3e047b8293cc82333a04f3ba771c19f7450ef876950ae11b1f75f', 'range': (0, 252)}] which is a list

Describe your environment (please complete the following information):

  • OS: [e.g. iOS] Win11
  • Haystack version: 2.3.0
  • Integration version: chroma-haystack 0.20.1
@springrain springrain added the bug Something isn't working label Jul 16, 2024
@anakin87
Copy link
Member

Hello!
I ran the example and could not reproduce the error.

Can you add a reproducible code example that triggers the error?

@springrain
Copy link
Author

Hello! I ran the example and could not reproduce the error.

Can you add a reproducible code example that triggers the error?

import os
from haystack.components.writers import DocumentWriter
from haystack.components.converters import PyPDFToDocument, TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.routers import FileTypeRouter
from haystack.components.joiners import DocumentJoiner
from haystack import Pipeline
from pathlib import Path
from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack_integrations.document_stores.chroma import ChromaDocumentStore
from haystack.document_stores.types import DuplicatePolicy

OPENAI_API_KEY = "sk-dummytoken1234567890abcdef"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY 

document_store = ChromaDocumentStore(persist_path="./chroma")

file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf"])
text_file_converter = TextFileToDocument()
pdf_converter = PyPDFToDocument()
document_joiner = DocumentJoiner()


document_cleaner = DocumentCleaner()
document_splitter = DocumentSplitter(split_by="word", split_length=150, split_overlap=30)

document_embedder = OpenAIDocumentEmbedder(model="bge-large-zh-v1.5",api_base_url="http://192.168.1.10:9998/v1")
document_writer = DocumentWriter(document_store=document_store,policy=DuplicatePolicy.SKIP)

index_pipeline = Pipeline()
index_pipeline.add_component(instance=file_type_router, name="file_type_router")
index_pipeline.add_component(instance=text_file_converter, name="text_file_converter")
index_pipeline.add_component(instance=pdf_converter, name="pypdf_converter")
index_pipeline.add_component(instance=document_joiner, name="document_joiner")
index_pipeline.add_component(instance=document_cleaner, name="document_cleaner")
index_pipeline.add_component(instance=document_splitter, name="document_splitter")
index_pipeline.add_component(instance=document_embedder, name="document_embedder")
index_pipeline.add_component(instance=document_writer, name="document_writer")

index_pipeline.connect("file_type_router.text/plain", "text_file_converter.sources")
index_pipeline.connect("file_type_router.application/pdf", "pypdf_converter.sources")
index_pipeline.connect("text_file_converter", "document_joiner")
index_pipeline.connect("pypdf_converter", "document_joiner")
index_pipeline.connect("document_joiner", "document_cleaner")
index_pipeline.connect("document_cleaner", "document_splitter")
index_pipeline.connect("document_splitter", "document_embedder")
index_pipeline.connect("document_embedder", "document_writer")

result = index_pipeline.run(data={"file_type_router": {"sources": list(Path("./testdata").glob("**/*"))}})
print(result)

@anakin87
Copy link
Member

Ok, you are right. Thanks for reporting the bug!

Minimal reproducible example

from haystack.components.preprocessors import DocumentSplitter
from haystack import Document
from haystack_integrations.document_stores.chroma import ChromaDocumentStore


document_store = ChromaDocumentStore()

documents = [Document(content = "This is a test document to split "*10)]

document_splitter = DocumentSplitter(split_by="word", split_length=5, split_overlap=2)

splitted_docs=document_splitter.run(documents=documents)["documents"]

print(splitted_docs[0].meta)
# {'source_id': '7c1703594787e30800683d64673880811611051d4444f08d8619f8fba6ab1480', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0,
# '_split_overlap': [{'doc_id': '187258e53ee90d2cf6f67f2e63e17d390c195b929b8721572e61c9e389a23e8d', 'range': (0, 13)}]}

document_store.write_documents(splitted_docs)

ValueError: Expected metadata value to be a str, int, float or bool, got [{'doc_id': '187258e53ee90d2cf6f67f2e63e17d390c195b929b8721572e61c9e389a23e8d', 'range': (0, 13)}] which is a list

Debugging and potential solutions

In deepset-ai/haystack#7933, we added the _split_overlap list to meta, but Chroma cannot handle lists in metadata.

One solution in ChromaDocumentStore might be to check the type of the single entry in meta before writing to Chroma and to discard the entry if not valid.

@anakin87 anakin87 added the P1 label Jul 16, 2024
@springrain
Copy link
Author

Okay, thanks, I'm currently using Haystack 2.2.4

@anakin87
Copy link
Member

I reopened the issue because this is a bug we should fix.

@anakin87
Copy link
Member

@springrain fixed and released a new version of chroma-haystack: https://pypi.org/project/chroma-haystack/0.21.1/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working integration:chroma P1
Projects
None yet
2 participants