-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hybrid search scoring is dependent on number of results requested #325
Comments
A couple of mitigations I can think of (though maybe not the easiest thing in the world to implement):
|
@HenryL27 Let me try to understand the bug here, if a document is not returned in the subquery and that is the most relevant document, that document is not surfacing? is that a correct understanding? BTW congrats you are the first user apart from Dev team who is trying this feature. 🥇 |
Yes, that's correct. Or, slightly more subtly, if a document is not returned in one of the subqueries, then its score for that subquery is 0 (when in actuality it should be something) |
I would argue with that, in reality that score is missing for a sub-query, which is counted as no hit and score is 0. That 0 score is used to do score combination, for your scenario for arithmetic mean formula will be (score1 + 0)/2. We can make this behavior configurable, for cases when score is 0 we may not take into account that sub-query. As for now you can increase the size so more hits are consumed by normalization processor for score post-processing |
@HenryL27 So, as per my understanding if a document is in one of the subquery it will have the score from that subquery but from the other subquery its score will be 0. and overall it will be considered in the Normalization. But as the result was returned from 1 subquery it may happen that the document go way below in the ranking. One of the suggestion from the community was to have a min score value(#150). So if a doc is present in 1 sub query and not in other then use the min score value for that document. Lets talk on the another use case where if a document is not returned any of the subquery then it is not possible to take that doc in the normalization. Also, if you look from vector search standpoint if its no in top K, then we can never know about that document. Same for text search also. |
I am removing bug tag from this github issue. Feel free to do +1 on the github issue related to the min score. Also if you don't want to consider increasing the size, I guess there were some processor which @msfroh was working named OverSamplingProcessor which can do this job in OpenSearch. From Normalization and Score combination feature standpoint, I don't think so we should oversample on the size. |
That is not the correct expectation. The results will be the best result for the query provided. I think there is some mismatch in the understanding of the feature. |
@navneet1v I think how we are interpreting this "mismatch" and what we think the correct expectation should be, should be discussed more in detail. Maybe a working session over Zoom/Chime? Let's say Now, if you are asserting that normalization does not have to hold the order preserving property, then maybe what we need is one that does provide that invariant. We felt this was a bug since such a property would be desirable. Otherwise, wouldn't you have worse results by enabling hybrid search? Intuitively, I think it does make sense to use a higher size (higher recall) and apply hybrid search (higher precision). We just want to make sure it's not the case that our expectation is correct and that the implementation has a bug. |
Hi @austintlee |
It is good that this topic is brought up publicly. I remember that I and @martin-gaievski have discussed about this exact limitation of current hybrid search before but wasn't abled to come up with any feasible solution to overcome it. At that time, there wasn't enough customer needs or concern for this limitation. |
To fix this problem to some extent we already have github issue which I pasted in earlier comment too: #150 |
https://dl.acm.org/doi/pdf/10.1145/237661.237715 For queries Q1, Q2, ..., trying to collect the top k results. Let si(d) be the scoring function for query Qi. Let Di represent the set of documents returned by Qi. While the intersection of the Di's is smaller than k, add another k top docs to each Di. Once the intersection is sufficiently large, any document that has not been seen is strictly worse than all k documents in the intersection. Then random-access score documents that have been seen by some but not all of the Qis on the Qis that have missed them. Now we have a full ordering and can return the top k. The paper also describes some ways to speed this up, while keeping a high (but not perfect) accuracy |
@HenryL27 thanks for sharing the gist, it is very helpful. Based on my understanding of OpenSearch and this hybrid query the contention point in implementing such thing is getting more documents to make We can very well inflate the size of K too during query execution but it will lead to multiple execution of Vector Search sub-query multiple times which will add to latency. The alternative ways to get those sufficient amount of documents is by inflating the I see may be we want to implement both of them. This way we can ensure that by default we are considering more documents( if size value is small) and if user need some customization based on their dataset they can always take advantage of OverSampling Processor. Please let me know your thoughts. |
Hoping that I understand this issue correctly: For the current issue (the original query), how many documents from the match query part enter the normalization and combination process? (is it I think this will also become more evident and introduce inconsistencies when the paging feature for hybrid search becomes available. |
@navneet1v Do you happen to know the answer to this?
|
For both of the queries number of documents that enter the normalization is ‘size’ if there are enough documents. |
@navneet1v We discussed this here a bit before for the non-normalization case: Another question that I have here: when we have normalization, do we again get the same number of docs as above in the coordinator node but only calculate normalized scores for I think we need another parameter to specify the minimum (or maybe also maximum) number of documents that should enter the normalization process from each subquery. We don't want to increase (Please correct me if I'm mistaken about any of the above.) |
The way normalization works is for every query(k-NN or non k-NN) present in the queries list, runs independently on data node and send if size is lets say 10, and there are 2 queries and we have 3 shards, then on coordinator node, per query we will have 30 documents(10*3). and in total 60 documents(30+30).
Adding a new parameter in the query is a good option, but I was thinking will it add complexity in query clause. Already hybrid query is more complex, there is no query clause that has similar behavior, so its not like adding new parameter is aligned with some other query clause. Hence one of the alternative that is proposed here was to use OverSamplingProcessor, which will be a SearchRequestProcessor which will increase the value of |
since there is no activity on the GH issue. Resolving the issue |
What is the bug?
Documents that would not get surfaced by individual subqueries in the hybrid query, but which would have good hybrid scores, do not get surfaced unless the "size" parameter is sufficiently large.
For example, say I execute a hybrid query
=[q1, q2]
withsize=2
.q1's top docs are
A C E
.q2's top docs are
B D E
.The hybrid top doc should probably be
E
, but since the subqueries only surfaceA
,B
,C
, andD
,E
does not get returned.How can one reproduce the bug?
Here's a query I ran against a simple wikipedia index
with this pipeline
change the
size
parameter to different things and observe the different top resultsWhat is the expected behavior?
The top results should be the best over all documents in the index, given the queries, normalization, and combination techniques.
What is your host/environment?
opensearch 2.10 release candidate
Do you have any screenshots?
with size=10
with size=100
Do you have any additional context?
Add any other context about the problem.
The text was updated successfully, but these errors were encountered: