-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch Document store - embedding retrieval #52
Conversation
document_stores/elasticsearch/src/elasticsearch_haystack/document_store.py
Outdated
Show resolved
Hide resolved
document_stores/elasticsearch/src/elasticsearch_haystack/document_store.py
Outdated
Show resolved
Hide resolved
|
||
with pytest.raises( | ||
BadRequestError, | ||
match="search_phase_execution_exception", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not match this error but let's check that a custom error is raised when the received embedding have a different size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried implementing a custom error handling logic, but the code became too messy.
Also, the Elasticsearch exception is quite informative:
elasticsearch.BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'failed to create query: the query vector has a different dimension [2] than the index vectors [4]')
So, I would rather not handle this error explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍
Part of deepset-ai/haystack#5329
Notes for the reviewer
ElasticSearchEmbeddingRetriever
will be added in a subsequent PR.I chose to support only Approximate kNN, which is the suggested approach compared to Exact, brute-force kNN.
This Document Store is compatible with Elasticsearch>=8.11
Configuring vector fields in older versions required more manual effort on the part of the user (e.g., explicitly specifying the vector size at index creation).
I haven't implemented scaling scores in the range [0, 1]. See Elasticsearch Document Store - investigate scaling scores for embedding retrieval #53
I would like to test also other unhappy paths, such as trying to write documents with embeddings of different sizes.
Concerning this point, we should rework error handling during indexing (
write_documents
method).Update: I encountered several erroneous
DuplicateDocumentError
while working on embedding retrieval.If we do not want to spend time reworking this part at the moment, I would propose printing a warning with the obtained errors. It seems less misleading to me. WDYT?