-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter ElasticSearch results by min_score #759
Comments
Hi @Schokomensch - do I understand correctly that you have already made these changes? Would you like to create a PR and we can have a look? I can imagine this being a useful optional addition. So if you provide the parameter |
Hi, @TuanaCelik, so far I implemented these changes by creating my own custom class that overwrites some class methods from the ElasticsearchDocumentStore and EmbeddingRetriever. I will create a proper PR in the beginning of next week. Within Elasticsearch the |
@Schokomensch This sounds good to me. Once you have the PR we can have a proper look at your implementation too. When you're ready, link the PR to this Issue so that we have a nice timeline of the discussions. Looking forward to it 👍🏾 |
@Schokomensch feel free to request a review from me and @TuanaCelik on your PR. |
+1 |
Problem:
I want to retrieve all relevant (similar) documents from the
ElasticsearchDocumentStore
based on the_score
using theEmbeddingRetriever
(I am not using the Reader). Prior to the search, I don't know how many relevant Documents exist. To make sure, that I retrieve all relevant entries from theElasticsearchDocumentStore
I need to settop_k=10000
or higher and filter the results afterwards - only taking documents with a_score
higher than x. Retrieving this many documents takes several seconds.Solution
Filtering your query results by a minimum score value is already implemented in the Python Elasticsearch client. You could add another parameter (
min_score
) similar totok_k
and add it to the body that you use inclient.search()
. See my example:body = { "size": top_k, "min_score": min_score, "query": self._get_vector_similarity_query(query_emb, top_k) }
I changed the body form the function
def query_by_embedding(...)
from the file haystack/document_stores/elasticsearch.py. Now the results contain only documents that have a_score
higher thanmin_score
.Additional context
In case the user wants to filter the results by the cosine similarity metric the min_score parameter needs to be scaled appropriately before using it in the body.
The text was updated successfully, but these errors were encountered: