Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sharding operators: topk / bottomk #7582

Closed
wants to merge 2 commits into from
Closed

Conversation

wanghaao
Copy link

@wanghaao wanghaao commented Mar 11, 2024

What this PR does

Sharding topk/bottomk operators

@wanghaao wanghaao requested a review from a team as a code owner March 11, 2024 12:07
@CLAassistant
Copy link

CLAassistant commented Mar 11, 2024

CLA assistant check
All committers have signed the CLA.

@wanghaao wanghaao force-pushed the main branch 2 times, most recently from 7ed0ff5 to 8efab08 Compare March 11, 2024 12:17
@duricanikolic
Copy link
Contributor

The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase main and eventually move the CHANGELOG entry added / updated in this PR to the top of the CHANGELOG document. Thanks!

@wanghaao wanghaao reopened this Mar 12, 2024
@wanghaao
Copy link
Author

The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase main and eventually move the CHANGELOG entry added / updated in this PR to the top of the CHANGELOG document. Thanks!

Done.

@wanghaao
Copy link
Author

hello?

@ying-jeanne
Copy link
Contributor

ying-jeanne commented Jul 9, 2024

hello?

Hi @wanghaao, thanks for your contribution. This is a feature we had an internally discussion earlier, but it is not as straight forward as just shard the topk. Here is why originally from @pracucci internally, I would copy it here.

At first those operations seems shardable since each shard would returns the top/bottomk. We could then compute the top/bottomk of all shards.

It's not that easy. The topk (or bottomk) of all series is not necessarily the topk(concat(topk(shard 1), topk(shard2), ...)).

For example, let's consider the following data points:

foo{name=”a”,shard=”1”} 10
foo{name=”b”,shard=”1”} 9
foo{name=”e”,shard=”1”} 8
foo{name=”c”,shard=”2”} 9
foo{name=”d”,shard=”2”} 8
foo{name=”e”,shard=”2”} 8
Let's assume we run the query topk(2, sum by (name) (foo)). The parallel version of it is:

topk(2, sum by (name) (foo{shard=”1”})) -> topk(2, [a:10,b:9,e:8]) = [a:10,b:9]
topk(2, sum by (name) (foo{shard=”2”})) -> topk(2, [c:9,d:8,e:8]) = [c:10,d:9]

topk(2, concat(shard1, shard2)) -> topk(2, [a:10,b:9,c:10,d:9]) = [a:10,c:10]
The result is incorrect. We would have expected [e:16,a:10].

We should investigate if topk(k * shards) for the per-shard query guarantees correct results.

So in order to have the topk after concat, we should have K + B (a buffer) instead of K in the subquery, but I don't have a good mathematic equation to get that B on top of my head, since it would related to how even the data are distributed. That being said we won't accept this PR, but if you figure out a nice way to solve ⬆️ , feel free to comment here and update the PR.

@56quarters 56quarters closed this Oct 15, 2024
@colega
Copy link
Contributor

colega commented Oct 16, 2024

So in order to have the topk after concat, we should have K + B (a buffer) instead of K in the subquery

There's no buffer that could solve if there's a sum aggregation inside of topk.

topk and bottomk can be sharded if its expression is a plain vector selector.

My intuition tells me that topk(n, max by (...) (...)) should also work, and bottomk(n, min by (...) (...)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants