[RFC] Pagination in Hybrid query #933

vibrantvarun · 2024-10-15T00:53:12Z

Introduction

This document talks about the design to support pagination functionality in hybrid query. OpenSearch supports multiple pagination techniques, and this document primarily focuses on two key techniques: the from and size parameters technique and the scroll operation technique.

Problem Statement

Pagination is an ability to divide a large set of search results into smaller subsets, known as pages, and retrieving them one at a time. Currently, hybrid query users cannot navigate in the result set as per their requirements.

Terminologies and key concepts

Here we define the key terms and concepts that are used throughout this document to ensure clarity and consistency in understanding the design approach later.

The from and size parameters

The from parameter is the document number from which you want to start showing the results. The size parameter is the number of results that you want to show. Together, they let you return a subset of the search results.

Scroll operation

It allows users to paginate over large set of data in a single search request. Think of it like a live cursor on a database.
If you need to request volumes of data larger than 1 PB from, for example, a machine learning job, use the scroll operation instead. The scroll operation allows you to request an unlimited number of results.

The important thing to note about the scroll operation is that scoring of the result set is not performed. It deals with large sets of data, and scoring such a large amount of data will lead to system outage.

Understanding search workflow in context of from and size parameters

The search process in OpenSearch for traditional queries executes primarily in two phases: Query Phase and Fetch Phase. The Query Phase is responsible for retrieving the query results from the shard as objects containing docId and score. The maximum number of query results collected during the Query Phase is determined by the sum of from and size parameters provided in the search request. The Fetch Phase is responsible for firstly trimming the topN query results obtained from the Query Phase where topN is equal to the value of from parameter. The source of the remaining query results are fetched from the shards. By changing the values of from and size params, users can navigate to another page of results.

Current Architecture

Hybrid query workflow

In the current hybrid query workflow, each sub-query executes the search process and returns the from + size equivalent number of results to the co-ordinator node. In the next step, normalization processor combines the sub-query result on each shard by using normalization and score-combination techniques. Later, the Fetch phase will combine all results from the shards and will return the search response. The number of hits present in the search response will be equal to the size parameter sent in the search request. In the entire hybrid query search process, the default value of from is 0.
However, when customized from and size values are sent to retrieve another page of search results, it fails to do so. This failure occurs due to change in ground truth of hybrid query results.

Why ground truth changes when from > 0

Increasing from value will lead to increase in number of results retrieved for each subquery from a shard. Instead of the new results being added to the last page, they tend to appear earlier in the list, which changes the ground truth. This can happen in two scenarios:

The subquery with higher weightage will rank its newly added results above others after the score combination. This means the new entries will appear earlier in the final result set, either on the first few pages or in the middle, instead of being added at the end.
The newly added result can become a duplicate (check appendix section below). The combined score of a duplicate entry after score combination process will be higher thereby increasing the relevance. Here we assumed that the subqueries containing the duplicate result have equal weightage during the score combination process.

Requirements

Functional Requirements

The system must divide large query result sets into manageable pages to ensure efficient data retrieval.
Each page of results should consistently reflect the same underlying data, even when new data is added or updated, to avoid discrepancies when users navigate between pages.
Users should be able to navigate through pages by changing from parameter in the search request.
If the hybrid query returns duplicates from different queries, pagination should manage this to avoid showing the same result multiple times across different pages.
The system should allow dynamic page size configuration, enabling users to choose how many results are shown on each page.

Non Functional Requirements

The system should ensure quick response times for retrieving and rendering each page of results, even when dealing with large datasets from multiple subqueries.

Solution Overview

The proposal is to introduce a new parameter in the hybrid query clause, which will provide information on number of individual sub-query results to be retrieved from the shard. This new parameter will ensure that specified number of subquery results are retrieved, even when from and size values are changed. This process helps in maintaining the consistency by not adding new result entries when from > 0 is sent in the search request. Once the subquery results are retrieved, the rest of the hybrid query process remains unchanged. The from value will navigate to the requested page in the final result, and the size parameter will determine how many search results should be returned in the response.

Why this new parameter can only be added in the hybrid query clause?

Following places have been considered to add this new parameter

Sending it in the URL of the search request

It requires enhancement in SearchSourceBuilder of OpenSearch core to inculcate the changes related to the new parameter. Moreover, there is no use of this new parameter for other traditional query types as they are already supporting pagination functionality. This scope of this change is limited to hybrid query logic, therefore this option is not recommended.

Cluster settings or Index settings

For every hybrid query request, an additional call will be made to the OpenSearch core to get index or cluster settings. This will create overload and resource exhaustion in the cluster. Moreover, customers need to change the setting every time when they want to paginate on higher number of results. Additionally, settings change is captured by event listeners that triggers workflows for other important functionalities in OpenSearch. Therefore, this option is not recommended.

Sending it in the hybrid query clause [RECOMMENDED]

Adding in the hybrid query clause is recommended because the changes done by the new parameter are specific to hybrid query and do not impact other functionalities of OpenSearch. It helps customers to understand the results formation better by explicitly defining how many subquery results they want to process under hybrid query logic.

Changes in customer experience

The from and size parameters technique

From the customer’s perspective, the new parameter will act as an indicator on how deeply they want to apply hybridization of subquery results and support pagination on top of it.

The following example visualizes the addition of new parameter in the hybrid query clause. The query below retrieves up to 100 results of each subquery and applies normalization process on them. Then it returns the search response that contains 10 hits. The first hit in the response is the 5th entry after combining all the subquery results.

http://localhost:9200/my-nlp-index/_search?size=10&search_pipeline=nlp-search-pipeline

{
 "from":5,
 "query":{
   "hybrid":{
     (Name of the parameter will be finalized in the design doc later)
     "pagination_depth": 100, (new parameter where customer define
                               the depth of hybridization of subquery results they 
                               want OpenSearch take the depth as a reference 
                               and apply pagination on top of it)
     "queries":[
        "match":{
          "name":"foo"
        }
     ] 
   }
 }
}

The new parameter will be referenced as “pagination_depth” in the rest of document for explanatory purposes. There is open question below for the appropriate name suggestion.

Note: The default value of the pagination_depth will be 0. If the customer doesn’t want to apply pagination and just wants to perform the standard hybrid search, then they do not have to provide the depth information. In that case, the hybrid query will use the current formula of from + size for retrieving the shard results.

The scroll operation

Pagination using scroll api cannot be performed in hybrid query process. The reason being hybrid query needs scoring of documents and scroll api does not support it. The main use case of scroll operation is when users need to reindex the data to a new index. This use case can be easily achieved by any other traditional query. Therefore, an illegal argument exception will be thrown when user tries to apply scroll operation with hybrid query.

High level design

(red arrows == current behavior, green arrows == new flow)

Low level design

The interfaces in the following classes will be updated to inculcate the addition of new parameter.

As there are no component changes in the current hybrid query workflow, therefore flow diagram is not added.

Testability

Backward Compatibility

The changes made to inculcate the new parameter addition are backward compatible. If pagination_depth is not provided in the search request then the current formula of from+size will be used for the subquery results retrieval.

Unit and Integration Tests

Both the test will be added in the PR.

Open questions

Q) Possible names of the new parameter to be added?

Add the suggested names in the following list.

pagination_depth
subquery_results_retrieval_size
pagination_size

Appendix

What is Pagination?

It is a process to divide a large set of search results into smaller subsets, known as pages, and retrieving them one at a time.
There are four different type of pagination techniques supported by OpenSearch:

The from and size parameters : The from parameter is the document number from which you want to start showing the results. The size parameter is the number of results that you want to show. Together, they let you return a subset of the search results.
scroll search : It allows users to paginate over large set of data in a single search request. Think of it like a live cursor on a database.
search after : The search_after parameter provides a live cursor that uses the previous page’s results to obtain the next page’s results. It is similar to the scroll operation in that it is meant to scroll many queries in parallel. You can use search_after only when sorting is applied.
Point in Time with search_after: The search_after search results are not frozen in time, so they may be inconsistent because of concurrent document indexing or deletion. By creating a point in time reference, users can freeze the search results to guarantee the consistency.

To know more about pagination and its techniques, please read the OpenSearch documentation on it.

Duplicates

If a result entry is part of another subquery result set, it is considered a duplicate result.
For example, consider a scenario where a user sends a hybrid query request. The hybrid query contains two subqueries, Subquery A and Subquery B. If the result entry of Subquery A is also part of Subquery B’s results then it is considered a duplicate result.

Unique results

If a result entry is part of only one subquery, it is considered as unique result.

The text was updated successfully, but these errors were encountered:

navneet1v · 2024-10-16T00:22:17Z

Quite detailed RFC. Waiting for performance benchmarks with new attribute for pagination.

brandon-carag · 2024-10-16T19:35:00Z

These are exciting developments--also looking forward to seeing the benchmarks.

brianf-aws · 2024-11-17T07:20:04Z

This is an amazing PR description!

vibrantvarun added untriaged enhancement Roadmap:Vector Database/GenAI Project-wide roadmap label and removed untriaged labels Oct 15, 2024

opensearch-infra bot added this to OpenSearch Roadmap Oct 15, 2024

github-project-automation bot moved this to New in OpenSearch Roadmap Oct 15, 2024

vibrantvarun added the v2.19.0 label Oct 15, 2024

vibrantvarun self-assigned this Oct 15, 2024

vibrantvarun added this to Vector Search RoadMap Oct 15, 2024

github-project-automation bot moved this to Backlog in Vector Search RoadMap Oct 15, 2024

vibrantvarun moved this from Backlog to 2.19.0 in Vector Search RoadMap Oct 15, 2024

vibrantvarun mentioned this issue Oct 15, 2024

[FEATURE] Implement pagination for Hybrid Search #280

Open

minalsha removed Roadmap:Vector Database/GenAI Project-wide roadmap label v2.19.0 labels Nov 4, 2024

brianf-aws mentioned this issue Nov 17, 2024

Initial commit for RAG pipeline scripts opensearch-project/opensearch-py-ml#427

Open

5 tasks

heemin32 mentioned this issue Nov 20, 2024

[FEATURE] Support for setting a default min/max score for the upcoming Normalization and Score Combination feature #150

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Pagination in Hybrid query #933

[RFC] Pagination in Hybrid query #933

vibrantvarun commented Oct 15, 2024 •

edited

Loading

navneet1v commented Oct 16, 2024

brandon-carag commented Oct 16, 2024

brianf-aws commented Nov 17, 2024

[RFC] Pagination in Hybrid query #933

[RFC] Pagination in Hybrid query #933

Comments

vibrantvarun commented Oct 15, 2024 • edited Loading

Introduction

Problem Statement

Terminologies and key concepts

The from and size parameters

Scroll operation

Understanding search workflow in context of from and size parameters

Current Architecture

Hybrid query workflow

Why ground truth changes when from > 0

Requirements

Functional Requirements

Non Functional Requirements

Solution Overview

Why this new parameter can only be added in the hybrid query clause?

Sending it in the URL of the search request

Cluster settings or Index settings

Sending it in the hybrid query clause [RECOMMENDED]

Changes in customer experience

The from and size parameters technique

The scroll operation

High level design

Low level design

Testability

Backward Compatibility

Unit and Integration Tests

Open questions

Q) Possible names of the new parameter to be added?

Appendix

What is Pagination?

Duplicates

Unique results

navneet1v commented Oct 16, 2024

brandon-carag commented Oct 16, 2024

brianf-aws commented Nov 17, 2024

vibrantvarun commented Oct 15, 2024 •

edited

Loading