Elasticsearch Document store - embedding retrieval #52

anakin87 · 2023-11-14T15:33:33Z

Part of deepset-ai/haystack#5329

Add basic support for Embedding retrieval in Elasticsearch Document Store

Notes for the reviewer

ElasticSearchEmbeddingRetriever will be added in a subsequent PR.
I chose to support only Approximate kNN, which is the suggested approach compared to Exact, brute-force kNN.
This Document Store is compatible with Elasticsearch>=8.11
Configuring vector fields in older versions required more manual effort on the part of the user (e.g., explicitly specifying the vector size at index creation).
I haven't implemented scaling scores in the range [0, 1]. See Elasticsearch Document Store - investigate scaling scores for embedding retrieval #53
I would like to test also other unhappy paths, such as trying to write documents with embeddings of different sizes.
Concerning this point, we should rework error handling during indexing (write_documents method).

Update: I encountered several erroneous DuplicateDocumentError while working on embedding retrieval.
If we do not want to spend time reworking this part at the moment, I would propose printing a warning with the obtained errors. It seems less misleading to me. WDYT?

document_stores/elasticsearch/src/elasticsearch_haystack/document_store.py

document_stores/elasticsearch/tests/test_document_store.py

silvanocerza · 2023-11-16T11:12:00Z

document_stores/elasticsearch/tests/test_document_store.py

+
+        with pytest.raises(
+            BadRequestError,
+            match="search_phase_execution_exception",


Let's not match this error but let's check that a custom error is raised when the received embedding have a different size.

I tried implementing a custom error handling logic, but the code became too messy.
Also, the Elasticsearch exception is quite informative:
elasticsearch.BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'failed to create query: the query vector has a different dimension [2] than the index vectors [4]')
So, I would rather not handle this error explicitly.

silvanocerza

Nice 👍

anakin87 added 9 commits November 13, 2023 12:05

set scale_score default to False

b357045

unrelated: replace text w content

8b47688

first implementation

93f377c

test

99741d6

Merge branch 'main' into elasticsearch-embedding-retrieval

8fee9f0

fix some tests

b3d641e

make tests more robust; skip unsupported ones

66782ff

rm unsupported test

1a6eed6

ignore import-not-found

e23036b

anakin87 marked this pull request as ready for review November 14, 2023 17:16

anakin87 requested a review from a team as a code owner November 14, 2023 17:16

anakin87 requested review from masci and silvanocerza and removed request for a team November 14, 2023 17:16

anakin87 mentioned this pull request Nov 15, 2023

Elasticsearch Embedding Retriever #54

Merged

silvanocerza reviewed Nov 16, 2023

View reviewed changes

document_stores/elasticsearch/src/elasticsearch_haystack/document_store.py Outdated Show resolved Hide resolved

silvanocerza reviewed Nov 16, 2023

View reviewed changes

document_stores/elasticsearch/src/elasticsearch_haystack/document_store.py Outdated Show resolved Hide resolved

silvanocerza reviewed Nov 16, 2023

View reviewed changes

document_stores/elasticsearch/tests/test_document_store.py Show resolved Hide resolved

silvanocerza reviewed Nov 16, 2023

View reviewed changes

first chunk addressing PR feedback

701bc80

github-actions bot added the integration:elasticsearch label Nov 16, 2023

improve tests

a5d0e01

silvanocerza approved these changes Nov 16, 2023

View reviewed changes

anakin87 merged commit 7d2b824 into main Nov 16, 2023
4 checks passed

anakin87 deleted the elasticsearch-embedding-retrieval branch November 16, 2023 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch Document store - embedding retrieval #52

Elasticsearch Document store - embedding retrieval #52

anakin87 commented Nov 14, 2023 •

edited

Loading

silvanocerza Nov 16, 2023

anakin87 Nov 16, 2023

silvanocerza left a comment

Elasticsearch Document store - embedding retrieval #52

Elasticsearch Document store - embedding retrieval #52

Conversation

anakin87 commented Nov 14, 2023 • edited Loading

Notes for the reviewer

silvanocerza Nov 16, 2023

Choose a reason for hiding this comment

anakin87 Nov 16, 2023

Choose a reason for hiding this comment

silvanocerza left a comment

Choose a reason for hiding this comment

anakin87 commented Nov 14, 2023 •

edited

Loading