Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support for setting a default min/max score for the upcoming Normalization and Score Combination feature #150

Open
SeyedAlirezaFatemi opened this issue Apr 6, 2023 · 4 comments
Assignees
Labels
backlog All the backlog features should be marked with this label Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement neural-search

Comments

@SeyedAlirezaFatemi
Copy link

Is your feature request related to a problem?

Related to RFC. The current problem with the RFC is that when we are combining scores from different queries (e.g. BM25 and kNN), we need the min and max score of each query part. However, when using approximate kNN, we cannot accurately calculate the min score unless we do an exact kNN search on the index which is not feasible. This leads to inconsistent score normalization, particularly when using pagination.

What solution would you like?

As discussed in detail in the RFC, one solution is to rely on the statistics we get from the documents we see during the current query. However, in specific scenarios where the min score can be known, we can do better. For example, when using BM25 or Cosine similarity in kNN, the user can optionally define the min score in the query to be 0 and -1, respectively.

By allowing the user to optionally define a min/max score in the query for normalization, we can ensure consistent score normalization across different queries for specific scenarios, particularly when using pagination. This would improve the accuracy and reliability of the search results for users.

Here is an example where we have the issue of pagination inconsistency when we use the general solution:
Let's assume we have a query that consists of a text match query and a kNN query and we use this formula for score normalization:
x_normalized = (x – min) / (max – min)
and we set the page size to 10. Assume the top 10 kNN scores are between 1 and 0.9 and then the scores for the rest of the documents fall to 0. This changes the scores after normalization drastically if we go to the next page and we might get pagination inconsistency and get missing/double results.

@navneet1v navneet1v added Enhancements Increases software capabilities beyond original client specifications and removed untriaged labels Apr 6, 2023
@navneet1v
Copy link
Collaborator

@SeyedAlirezaFatemi Thanks for creating the issue.

@navneet1v navneet1v added Features Introduces a new unit of functionality that satisfies a requirement backlog All the backlog features should be marked with this label neural-search labels Apr 6, 2023
@SeyedAlirezaFatemi
Copy link
Author

@navneet1v @martin-gaievski

I noticed that in the "An Analysis of Fusion Functions for Hybrid Retrieval" paper, they also mention a min-max normalization method ($𝜙_{TMM}$, Equation 4) that uses the theoretical minimum of a function.
"As an example, when $𝑓_{LEX}$ is BM25, then its infimum is 0. When $𝑓_{SEM}$ is cosine similarity, then that quantity is −1."

They also mention:
"Interestingly, the behavior of $𝜙_{TMM}$ appears to be more robust to the data distribution—its peak remains within a small neighborhood as we move from one dataset to another. We believe the reason $𝜙_{TMM}$-normalized scores are more stable is because it has one fewer data-dependent statistic in the transformation (i.e., minimum score in the retrieved set is replaced with minimum feasible value regardless of the candidate set)."

So It would be really nice to have this feature of defining a default min value for the normalization and get the max from the data.

@navneet1v
Copy link
Collaborator

@SeyedAlirezaFatemi thanks for providing this info. I will look into this. We are still in the development phase of the original scope.

@heemin32
Copy link
Collaborator

@SeyedAlirezaFatemi, is the inconsistent pagination result the main reason for supporting this? Even with the customer-provided min/max score, the inconsistency in pagination will still occur. There's an ongoing project aimed at improving pagination consistency for hybrid search. It would be great if you could take a look at #933 and share your thoughts on whether this feature would still provide value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog All the backlog features should be marked with this label Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement neural-search
Projects
Status: Backlog
Development

No branches or pull requests

4 participants