Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation error in creating qdrant collection #665

Closed
NILICK opened this issue Apr 15, 2024 · 2 comments
Closed

Validation error in creating qdrant collection #665

NILICK opened this issue Apr 15, 2024 · 2 comments

Comments

@NILICK
Copy link

NILICK commented Apr 15, 2024

I want to embed multiple pdf files into a qdrant vector database using below code:

from pathlib import Path
from haystack import Pipeline
from haystack.components.routers import FileTypeRouter
from haystack.components.writers import DocumentWriter
from haystack.components.joiners import DocumentJoiner
from haystack.components.builders import PromptBuilder
from haystack.document_stores.types import DuplicatePolicy
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
from haystack_integrations.components.generators.ollama import OllamaGenerator
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack_integrations.components.retrievers.qdrant import QdrantEmbeddingRetriever
from haystack.components.converters import MarkdownToDocument, PyPDFToDocument, TextFileToDocument
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder

document_store = QdrantDocumentStore(host="localhost",
                                     index="document",
                                     embedding_dim=768,
                                     recreate_index=True,
                                     timeout=500,
                                     hnsw_config={"m": 16, "ef_construct": 64}
                                    )

file_type_router = FileTypeRouter(mime_types=["text/plain", "application/pdf", "text/markdown"])
text_file_converter = TextFileToDocument()
markdown_converter = MarkdownToDocument()
pdf_converter = PyPDFToDocument()
cleaner = DocumentCleaner()
document_joiner = DocumentJoiner()
document_embedder = SentenceTransformersDocumentEmbedder()
splitter = DocumentSplitter(split_by="sentence", split_length=10, split_overlap=0)
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=file_type_router, name="file_type_router")
indexing_pipeline.add_component(instance = pdf_converter, name="pdf_file_converter")
indexing_pipeline.add_component(instance=document_joiner, name="document_joiner")
indexing_pipeline.add_component(instance = cleaner, name="document_cleaner")
indexing_pipeline.add_component(instance = splitter, name="document_splitter")
indexing_pipeline.add_component(instance=document_embedder, name="document_embedder")
indexing_pipeline.add_component(instance = writer, name="document_writer")

indexing_pipeline.connect("file_type_router.application/pdf", "pdf_file_converter.sources")
indexing_pipeline.connect("pdf_file_converter", "document_joiner")
indexing_pipeline.connect("document_joiner", "document_cleaner")
indexing_pipeline.connect("document_cleaner", "document_splitter")
indexing_pipeline.connect("document_splitter", "document_embedder")
indexing_pipeline.connect("document_embedder", "document_writer")

indexing_pipeline.run({"file_type_router": {"sources": ["./data/1.pdf"]}})

But it return below error:


ValidationError                           Traceback (most recent call last)
File [~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/http/api_client.py:94](http://localhost:8889/lab/tree/privateGPT/Haystack2/~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/http/api_client.py#line=93), in ApiClient.send(self, request, type_)
     93 try:
---> 94     return parse_as_type(response.json(), type_)
     95 except ValidationError as e:

File [~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/http/api_client.py:213](http://localhost:8889/lab/tree/privateGPT/Haystack2/~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/http/api_client.py#line=212), in parse_as_type(obj, type_)
    212 model_type = _get_parsing_type(type_, source=parse_as_type.__name__)
--> 213 return model_type(obj=obj).obj

File [~/.local/lib/python3.11/site-packages/pydantic/main.py:164](http://localhost:8889/lab/tree/privateGPT/Haystack2/~/.local/lib/python3.11/site-packages/pydantic/main.py#line=163), in BaseModel.__init__(__pydantic_self__, **data)
    163 __tracebackhide__ = True
--> 164 __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)

ValidationError: 1 validation error for ParsingModel[InlineResponse2005] (for parse_as_type)
obj.result.config.optimizer_config.max_optimization_threads
  Input should be a valid integer [type=int_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.5/v/int_type

During handling of the above exception, another exception occurred:

ResponseHandlingException                 Traceback (most recent call last)
File <timed eval>:2

File ~/micromamba/envs/ollama/lib/python3.11/site-packages/haystack/pipeline.py:85, in Pipeline.run(self, data, debug)
     83 is_nested_component_input = all(isinstance(value, dict) for value in data.values())
     84 if is_nested_component_input:
---> 85     return self._run_internal(data=data, debug=debug)
     86 else:
     87     # flat input, a dict where keys are input names and values are the corresponding values
     88     # we need to convert it to a nested dictionary of component inputs and then run the pipeline
     89     # just like in the previous case
     90     pipeline_inputs, unresolved_inputs = self._prepare_component_input_data(data)

File [~/micromamba/envs/ollama/lib/python3.11/site-packages/haystack/pipeline.py:111](http://localhost:8889/lab/tree/privateGPT/Haystack2/~/micromamba/envs/ollama/lib/python3.11/site-packages/haystack/pipeline.py#line=110), in Pipeline._run_internal(self, data, debug)
    100 """
    101 Runs the pipeline by invoking the underlying run to initiate the pipeline execution.
    102 
   (...)
    108 :raises PipelineRuntimeError: if any of the components fail or return unexpected output.
    109 """
    110 pipeline_running(self)
--> 111 return super().run(data=data, debug=debug)

File [~/micromamba/envs/ollama/lib/python3.11/site-packages/haystack/core/pipeline/pipeline.py:601](http://localhost:8889/lab/tree/privateGPT/Haystack2/~/micromamba/envs/ollama/lib/python3.11/site-packages/haystack/core/pipeline/pipeline.py#line=600), in Pipeline.run(self, data, debug)
    597         continue
    599 if name in last_inputs and len(comp.__haystack_input__._sockets_dict) == len(last_inputs[name]):  # type: ignore
    600     # This component has all the inputs it needs to run
--> 601     res = comp.run(**last_inputs[name])
    603     if not isinstance(res, Mapping):
    604         raise PipelineRuntimeError(
    605             f"Component '{name}' didn't return a dictionary. "
    606             "Components must always return dictionaries: check the the documentation."
    607         )

File [~/micromamba/envs/ollama/lib/python3.11/site-packages/haystack/components/writers/document_writer.py:84](http://localhost:8889/lab/tree/privateGPT/Haystack2/~/micromamba/envs/ollama/lib/python3.11/site-packages/haystack/components/writers/document_writer.py#line=83), in DocumentWriter.run(self, documents, policy)
     81 if policy is None:
     82     policy = self.policy
---> 84 documents_written = self.document_store.write_documents(documents=documents, policy=policy)
     85 return {"documents_written": documents_written}

File [~/micromamba/envs/ollama/lib/python3.11/site-packages/haystack_integrations/document_stores/qdrant/document_store.py:191](http://localhost:8889/lab/tree/privateGPT/Haystack2/~/micromamba/envs/ollama/lib/python3.11/site-packages/haystack_integrations/document_stores/qdrant/document_store.py#line=190), in QdrantDocumentStore.write_documents(self, documents, policy)
    189         msg = f"DocumentStore.write_documents() expects a list of Documents but got an element of {type(doc)}."
    190         raise ValueError(msg)
--> 191 self._set_up_collection(self.index, self.embedding_dim, False, self.similarity)
    193 if len(documents) == 0:
    194     logger.warning("Calling QdrantDocumentStore.write_documents() with empty list")

File [~/micromamba/envs/ollama/lib/python3.11/site-packages/haystack_integrations/document_stores/qdrant/document_store.py:351](http://localhost:8889/lab/tree/privateGPT/Haystack2/~/micromamba/envs/ollama/lib/python3.11/site-packages/haystack_integrations/document_stores/qdrant/document_store.py#line=350), in QdrantDocumentStore._set_up_collection(self, collection_name, embedding_dim, recreate_collection, similarity)
    346     return
    348 try:
    349     # Check if the collection already exists and validate its
    350     # current configuration with the parameters.
--> 351     collection_info = self.client.get_collection(collection_name)
    352 except (UnexpectedResponse, RpcError, ValueError):
    353     # That indicates the collection does not exist, so it can be
    354     # safely created with any configuration.
   (...)
    357     # with the remote server UnexpectedResponse [/](http://localhost:8889/) RpcError is raised.
    358     # Until that's unified, we need to catch both.
    359     self._recreate_collection(collection_name, distance, embedding_dim)

File [~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/qdrant_client.py:1530](http://localhost:8889/lab/tree/privateGPT/Haystack2/~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/qdrant_client.py#line=1529), in QdrantClient.get_collection(self, collection_name, **kwargs)
   1520 """Get detailed information about specified existing collection
   1521 
   1522 Args:
   (...)
   1526     Detailed information about the collection
   1527 """
   1528 assert len(kwargs) == 0, f"Unknown arguments: {list(kwargs.keys())}"
-> 1530 return self._client.get_collection(collection_name=collection_name, **kwargs)

File [~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/qdrant_remote.py:1963](http://localhost:8889/lab/tree/privateGPT/Haystack2/~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/qdrant_remote.py#line=1962), in QdrantRemote.get_collection(self, collection_name, **kwargs)
   1956 if self._prefer_grpc:
   1957     return GrpcToRest.convert_collection_info(
   1958         self.grpc_collections.Get(
   1959             grpc.GetCollectionInfoRequest(collection_name=collection_name),
   1960             timeout=self._timeout,
   1961         ).result
   1962     )
-> 1963 result: Optional[types.CollectionInfo] = self.http.collections_api.get_collection(
   1964     collection_name=collection_name
   1965 ).result
   1966 assert result is not None, "Get collection returned None"
   1967 return result

File [~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/http/api/collections_api.py:1262](http://localhost:8889/lab/tree/privateGPT/Haystack2/~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/http/api/collections_api.py#line=1261), in SyncCollectionsApi.get_collection(self, collection_name)
   1255 def get_collection(
   1256     self,
   1257     collection_name: str,
   1258 ) -> m.InlineResponse2005:
   1259     """
   1260     Get detailed information about specified existing collection
   1261     """
-> 1262     return self._build_for_get_collection(
   1263         collection_name=collection_name,
   1264     )

File [~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/http/api/collections_api.py:377](http://localhost:8889/lab/tree/privateGPT/Haystack2/~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/http/api/collections_api.py#line=376), in _CollectionsApi._build_for_get_collection(self, collection_name)
    372 path_params = {
    373     "collection_name": str(collection_name),
    374 }
    376 headers = {}
--> 377 return self.api_client.request(
    378     type_=m.InlineResponse2005,
    379     method="GET",
    380     url="[/collections/](http://localhost:8889/collections/){collection_name}",
    381     headers=headers if headers else None,
    382     path_params=path_params,
    383 )

File [~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/http/api_client.py:74](http://localhost:8889/lab/tree/privateGPT/Haystack2/~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/http/api_client.py#line=73), in ApiClient.request(self, type_, method, url, path_params, **kwargs)
     72 url = (self.host or "") + url.format(**path_params)
     73 request = self._client.build_request(method, url, **kwargs)
---> 74 return self.send(request, type_)

File [~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/http/api_client.py:96](http://localhost:8889/lab/tree/privateGPT/Haystack2/~/micromamba/envs/ollama/lib/python3.11/site-packages/qdrant_client/http/api_client.py#line=95), in ApiClient.send(self, request, type_)
     94         return parse_as_type(response.json(), type_)
     95     except ValidationError as e:
---> 96         raise ResponseHandlingException(e)
     97 raise UnexpectedResponse.for_response(response)

ResponseHandlingException: 1 validation error for ParsingModel[InlineResponse2005] (for parse_as_type)
obj.result.config.optimizer_config.max_optimization_threads
  Input should be a valid integer [type=int_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.5/v/int_type

What is this error and how can I improve the code?
I used docker for qdrant server as follow:

docker run -p 6333:6333 -v /mnt/Qdrant_Docker_Collections:/qdrant/storage qdrant/qdrant

@anakin87 anakin87 transferred this issue from deepset-ai/haystack Apr 16, 2024
@anakin87
Copy link
Member

Hey, @NILICK...

I tried to reproduce the bug

  • started Qdrant with docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant
  • haystack-ai==2.0.1
  • qdrant-haystack==3.3.1
  • qdrant-client==1.8.2

The pipeline runs well and the Documents are correctly written to the DB along with their vector representations.

I would suggest you create a fresh environment, install the latest version of qdrant-haystack and retry.
In case you encounter the same bug, please report all the installed packages (you can get the list with the command pip freeze).

@NILICK
Copy link
Author

NILICK commented Apr 18, 2024

Hey @anakin87
Thanks for your answer, You're right, I uninstalled qdrant-client and reinstalled the qdrant-client 1.8.2 and now it works very well.

@NILICK NILICK closed this as completed Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants