Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch Document store - embedding retrieval #52

Merged
merged 11 commits into from
Nov 16, 2023

Conversation

anakin87
Copy link
Member

@anakin87 anakin87 commented Nov 14, 2023

Part of deepset-ai/haystack#5329

  • Add basic support for Embedding retrieval in Elasticsearch Document Store

Notes for the reviewer

  • ElasticSearchEmbeddingRetriever will be added in a subsequent PR.

  • I chose to support only Approximate kNN, which is the suggested approach compared to Exact, brute-force kNN.

  • This Document Store is compatible with Elasticsearch>=8.11
    Configuring vector fields in older versions required more manual effort on the part of the user (e.g., explicitly specifying the vector size at index creation).

  • I haven't implemented scaling scores in the range [0, 1]. See Elasticsearch Document Store - investigate scaling scores for embedding retrieval #53

  • I would like to test also other unhappy paths, such as trying to write documents with embeddings of different sizes.
    Concerning this point, we should rework error handling during indexing (write_documents method).

    Update: I encountered several erroneous DuplicateDocumentError while working on embedding retrieval.
    If we do not want to spend time reworking this part at the moment, I would propose printing a warning with the obtained errors. It seems less misleading to me. WDYT?

@anakin87 anakin87 marked this pull request as ready for review November 14, 2023 17:16
@anakin87 anakin87 requested a review from a team as a code owner November 14, 2023 17:16
@anakin87 anakin87 requested review from masci and silvanocerza and removed request for a team November 14, 2023 17:16

with pytest.raises(
BadRequestError,
match="search_phase_execution_exception",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not match this error but let's check that a custom error is raised when the received embedding have a different size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried implementing a custom error handling logic, but the code became too messy.
Also, the Elasticsearch exception is quite informative:
elasticsearch.BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'failed to create query: the query vector has a different dimension [2] than the index vectors [4]')
So, I would rather not handle this error explicitly.

Copy link
Contributor

@silvanocerza silvanocerza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

@anakin87 anakin87 merged commit 7d2b824 into main Nov 16, 2023
4 checks passed
@anakin87 anakin87 deleted the elasticsearch-embedding-retrieval branch November 16, 2023 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants