Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Prevent balancer from overloading the same QueryNode #38719

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

weiliu1031
Copy link
Contributor

issue: #38718
The balancer calculates the workload of executing tasks as an ongoing score for target nodes. However, a logic issue arises when GetSegmentTaskDelta or GetChannelTaskDelta is called with collectionID=-1, which incorrectly returns zero.

Due to the incorrect global score, the executing task's workload is not properly reflected for each collection. Consequently, each collection submits its own balance task, leading to the balancer assigning excessive tasks to the same QueryNode.

@sre-ci-robot sre-ci-robot added the size/L Denotes a PR that changes 100-499 lines. label Dec 24, 2024
@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: weiliu1031
To complete the pull request process, please assign wxyucs after the PR has been reviewed.
You can assign the PR to them by writing /assign @wxyucs in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mergify mergify bot added dco-passed DCO check passed. kind/bug Issues or changes related a bug labels Dec 24, 2024
Copy link
Contributor

mergify bot commented Dec 24, 2024

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

@weiliu1031
Copy link
Contributor Author

rerun go-sdk

Copy link

codecov bot commented Dec 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.33%. Comparing base (90de37e) to head (c99cb48).
Report is 19 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (90de37e) and HEAD (c99cb48). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (90de37e) HEAD (c99cb48)
2 1
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #38719       +/-   ##
===========================================
- Coverage   81.05%   69.33%   -11.72%     
===========================================
  Files        1381      292     -1089     
  Lines      195090    26187   -168903     
===========================================
- Hits       158131    18158   -139973     
+ Misses      31398     8029    -23369     
+ Partials     5561        0     -5561     
Components Coverage Δ
Client ∅ <ø> (∅)
Core 69.33% <ø> (ø)
Go ∅ <ø> (∅)

see 1089 files with indirect coverage changes

Copy link
Contributor

mergify bot commented Dec 24, 2024

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

@weiliu1031
Copy link
Contributor Author

rerun go-sdk

The balancer calculates the workload of executing tasks as an ongoing score for target nodes.
However, a logic issue arises when GetSegmentTaskDelta or GetChannelTaskDelta is called
with collectionID=-1, which incorrectly returns zero.

Due to the incorrect global score, the executing task's workload is not properly reflected
for each collection. Consequently, each collection submits its own balance task,
leading to the balancer assigning excessive tasks to the same QueryNode.

Signed-off-by: Wei Liu <[email protected]>
@weiliu1031 weiliu1031 force-pushed the fix_assign_too_much_task_to_qn branch from 34026c1 to c99cb48 Compare December 24, 2024 14:16
Copy link
Contributor

mergify bot commented Dec 24, 2024

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Dec 24, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-passed DCO check passed. kind/bug Issues or changes related a bug size/L Denotes a PR that changes 100-499 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants