Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: [2.4] Prevent balancer from overloading the same QueryNode #38720

Merged

Conversation

weiliu1031
Copy link
Contributor

issue: #38718
pr: #38719
The balancer calculates the workload of executing tasks as an ongoing score for target nodes. However, a logic issue arises when GetSegmentTaskDelta or GetChannelTaskDelta is called with collectionID=-1, which incorrectly returns zero.

Due to the incorrect global score, the executing task's workload is not properly reflected for each collection. Consequently, each collection submits its own balance task, leading to the balancer assigning excessive tasks to the same QueryNode.

@sre-ci-robot sre-ci-robot added the size/L Denotes a PR that changes 100-499 lines. label Dec 24, 2024
@sre-ci-robot sre-ci-robot requested review from sunby and yah01 December 24, 2024 13:39
@mergify mergify bot added dco-passed DCO check passed. kind/bug Issues or changes related a bug labels Dec 24, 2024
Copy link
Contributor

mergify bot commented Dec 24, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@weiliu1031 weiliu1031 force-pushed the fix_assign_too_much_task_to_qn24 branch 2 times, most recently from fdb3bdf to eaeba70 Compare December 24, 2024 15:22
Copy link
Contributor

mergify bot commented Dec 24, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@weiliu1031
Copy link
Contributor Author

/run-cpu-e2e

Copy link
Contributor

mergify bot commented Dec 24, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

The balancer calculates the workload of executing tasks as an ongoing score for target nodes.
However, a logic issue arises when GetSegmentTaskDelta or GetChannelTaskDelta is called
with collectionID=-1, which incorrectly returns zero.

Due to the incorrect global score, the executing task's workload is not properly reflected
for each collection. Consequently, each collection submits its own balance task,
leading to the balancer assigning excessive tasks to the same QueryNode.

Signed-off-by: Wei Liu <[email protected]>
@weiliu1031 weiliu1031 force-pushed the fix_assign_too_much_task_to_qn24 branch from eaeba70 to 1d5a58d Compare December 24, 2024 23:32
@mergify mergify bot added the ci-passed label Dec 25, 2024
Copy link

codecov bot commented Dec 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.42%. Comparing base (648078e) to head (6fd0ad3).
Report is 5 commits behind head on 2.4.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              2.4   #38720      +/-   ##
==========================================
+ Coverage   83.40%   83.42%   +0.01%     
==========================================
  Files         809      809              
  Lines      143068   143115      +47     
==========================================
+ Hits       119333   119400      +67     
+ Misses      19243    19229      -14     
+ Partials     4492     4486       -6     
Files with missing lines Coverage Δ
internal/querycoordv2/task/scheduler.go 90.27% <100.00%> (+1.50%) ⬆️

... and 34 files with indirect coverage changes

Signed-off-by: Wei Liu <[email protected]>
@mergify mergify bot removed the ci-passed label Dec 25, 2024
@xiaofan-luan
Copy link
Collaborator

/lgtm
/approve

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: weiliu1031, xiaofan-luan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mergify mergify bot added the ci-passed label Dec 25, 2024
@sre-ci-robot sre-ci-robot merged commit 9e43e55 into milvus-io:2.4 Dec 25, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed dco-passed DCO check passed. kind/bug Issues or changes related a bug lgtm size/L Denotes a PR that changes 100-499 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants