Filter ElasticSearch results by min_score #759

t-charura · 2022-03-23T12:17:53Z

Problem:
I want to retrieve all relevant (similar) documents from the ElasticsearchDocumentStore based on the _score using the EmbeddingRetriever (I am not using the Reader). Prior to the search, I don't know how many relevant Documents exist. To make sure, that I retrieve all relevant entries from the ElasticsearchDocumentStore I need to set top_k=10000 or higher and filter the results afterwards - only taking documents with a _score higher than x. Retrieving this many documents takes several seconds.

Solution
Filtering your query results by a minimum score value is already implemented in the Python Elasticsearch client. You could add another parameter (min_score) similar to tok_k and add it to the body that you use in client.search(). See my example:

body = { "size": top_k, "min_score": min_score, "query": self._get_vector_similarity_query(query_emb, top_k) }

I changed the body form the function def query_by_embedding(...) from the file haystack/document_stores/elasticsearch.py. Now the results contain only documents that have a _score higher than min_score.

Additional context
In case the user wants to filter the results by the cosine similarity metric the min_score parameter needs to be scaled appropriately before using it in the body.

The text was updated successfully, but these errors were encountered:

TuanaCelik · 2022-03-24T11:19:25Z

Hi @Schokomensch - do I understand correctly that you have already made these changes? Would you like to create a PR and we can have a look?

I can imagine this being a useful optional addition. So if you provide the parameter min_score it uses it, if not it defaults to the top_k. Does that make sense?

t-charura · 2022-03-26T15:39:39Z

Hi, @TuanaCelik, so far I implemented these changes by creating my own custom class that overwrites some class methods from the ElasticsearchDocumentStore and EmbeddingRetriever. I will create a proper PR in the beginning of next week.

Within Elasticsearch the min_score filter is applied only after the top_k (size) filter already reduces the results.
Therefore, I would suggest that whenever the user provides the min_score without setting the top_k parameter, I will set top_k=10000, which is the maximum value that Elasticsearch allows for search results (if you want to set top_k>10000 you would need to paginate your search results). The default value for min_score would be 0, since Elasticsearch does not allow None or False values within the (request) body.

TuanaCelik · 2022-03-29T12:56:37Z

@Schokomensch This sounds good to me. Once you have the PR we can have a proper look at your implementation too. When you're ready, link the PR to this Issue so that we have a nice timeline of the discussions. Looking forward to it 👍🏾

tstadel · 2022-04-13T11:29:42Z

@Schokomensch feel free to request a review from me and @TuanaCelik on your PR.

liorshk · 2023-03-06T14:14:42Z

+1

TuanaCelik self-assigned this Mar 24, 2022

t-charura mentioned this issue Apr 4, 2022

Add min_score parameter for the EmbeddingRetriever in combination with Elasticsearch deepset-ai/haystack#2389

Closed

ZanSara added the contributions wanted! Looking for external contributions label Jul 19, 2022

masci removed the contributions wanted! Looking for external contributions label Dec 13, 2023

masci transferred this issue from deepset-ai/haystack May 25, 2024

masci added integration:elasticsearch feature request Ideas to improve an integration contributions wanted! Looking for external contributions labels May 25, 2024

github-project-automation bot added this to Haystack - Contributions wanted May 31, 2024

anakin87 unassigned TuanaCelik Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter ElasticSearch results by min_score #759

Filter ElasticSearch results by min_score #759

t-charura commented Mar 23, 2022

TuanaCelik commented Mar 24, 2022

t-charura commented Mar 26, 2022

TuanaCelik commented Mar 29, 2022

tstadel commented Apr 13, 2022

liorshk commented Mar 6, 2023

Filter ElasticSearch results by min_score #759

Filter ElasticSearch results by min_score #759

Comments

t-charura commented Mar 23, 2022

TuanaCelik commented Mar 24, 2022

t-charura commented Mar 26, 2022

TuanaCelik commented Mar 29, 2022

tstadel commented Apr 13, 2022

liorshk commented Mar 6, 2023