Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter ElasticSearch results by min_score #759

Open
t-charura opened this issue Mar 23, 2022 · 5 comments
Open

Filter ElasticSearch results by min_score #759

t-charura opened this issue Mar 23, 2022 · 5 comments
Labels
contributions wanted! Looking for external contributions feature request Ideas to improve an integration integration:elasticsearch

Comments

@t-charura
Copy link

Problem:
I want to retrieve all relevant (similar) documents from the ElasticsearchDocumentStore based on the _score using the EmbeddingRetriever (I am not using the Reader). Prior to the search, I don't know how many relevant Documents exist. To make sure, that I retrieve all relevant entries from the ElasticsearchDocumentStore I need to set top_k=10000 or higher and filter the results afterwards - only taking documents with a _score higher than x. Retrieving this many documents takes several seconds.

Solution
Filtering your query results by a minimum score value is already implemented in the Python Elasticsearch client. You could add another parameter (min_score) similar to tok_k and add it to the body that you use in client.search(). See my example:

body = { "size": top_k, "min_score": min_score, "query": self._get_vector_similarity_query(query_emb, top_k) }

I changed the body form the function def query_by_embedding(...) from the file haystack/document_stores/elasticsearch.py. Now the results contain only documents that have a _score higher than min_score.

Additional context
In case the user wants to filter the results by the cosine similarity metric the min_score parameter needs to be scaled appropriately before using it in the body.

@TuanaCelik
Copy link
Contributor

Hi @Schokomensch - do I understand correctly that you have already made these changes? Would you like to create a PR and we can have a look?

I can imagine this being a useful optional addition. So if you provide the parameter min_score it uses it, if not it defaults to the top_k. Does that make sense?

@TuanaCelik TuanaCelik self-assigned this Mar 24, 2022
@t-charura
Copy link
Author

Hi, @TuanaCelik, so far I implemented these changes by creating my own custom class that overwrites some class methods from the ElasticsearchDocumentStore and EmbeddingRetriever. I will create a proper PR in the beginning of next week.

Within Elasticsearch the min_score filter is applied only after the top_k (size) filter already reduces the results.
Therefore, I would suggest that whenever the user provides the min_score without setting the top_k parameter, I will set top_k=10000, which is the maximum value that Elasticsearch allows for search results (if you want to set top_k>10000 you would need to paginate your search results). The default value for min_score would be 0, since Elasticsearch does not allow None or False values within the (request) body.

@TuanaCelik
Copy link
Contributor

@Schokomensch This sounds good to me. Once you have the PR we can have a proper look at your implementation too. When you're ready, link the PR to this Issue so that we have a nice timeline of the discussions. Looking forward to it 👍🏾

@tstadel
Copy link
Member

tstadel commented Apr 13, 2022

@Schokomensch feel free to request a review from me and @TuanaCelik on your PR.

@ZanSara ZanSara added the contributions wanted! Looking for external contributions label Jul 19, 2022
@liorshk
Copy link

liorshk commented Mar 6, 2023

+1

@masci masci removed the contributions wanted! Looking for external contributions label Dec 13, 2023
@masci masci transferred this issue from deepset-ai/haystack May 25, 2024
@masci masci added integration:elasticsearch feature request Ideas to improve an integration contributions wanted! Looking for external contributions labels May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributions wanted! Looking for external contributions feature request Ideas to improve an integration integration:elasticsearch
Projects
Development

Successfully merging a pull request may close this issue.

6 participants