Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Balance channel may stuck at increasing replica number case #37641

Merged
merged 1 commit into from
Nov 14, 2024

Conversation

weiliu1031
Copy link
Contributor

issue: #37640
fix the pr #36549
cause balance channel will wait until new delegator becomes serviceable, but new delegator need to sync target version then becomes serviceable, and sync target version need to be wait all replica load done. so if increasing replica number and balance channel happens at same time, logic dead lock occurs.

@sre-ci-robot sre-ci-robot requested review from sunby and yah01 November 13, 2024 04:59
@sre-ci-robot sre-ci-robot added the size/XS Denotes a PR that changes 0-9 lines. label Nov 13, 2024
@mergify mergify bot added dco-passed DCO check passed. kind/bug Issues or changes related a bug labels Nov 13, 2024
Copy link
Contributor

@congqixia congqixia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

sre-ci-robot pushed a commit that referenced this pull request Nov 13, 2024
)

issue: #37640
pr: #37641
fix the pr #36549
cause balance channel will wait until new delegator becomes serviceable,
but new delegator need to sync target version then becomes serviceable,
and sync target version need to be wait all replica load done. so if
increasing replica number and balance channel happens at same time,
logic dead lock occurs.

Signed-off-by: Wei Liu <[email protected]>
Copy link

codecov bot commented Nov 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.55%. Comparing base (a9538f6) to head (5b4f16e).
Report is 6 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #37641       +/-   ##
===========================================
+ Coverage   68.07%   80.55%   +12.48%     
===========================================
  Files         290     1357     +1067     
  Lines       25455   190430   +164975     
===========================================
+ Hits        17328   153410   +136082     
- Misses       8127    31598    +23471     
- Partials        0     5422     +5422     
Components Coverage Δ
Client 61.25% <ø> (∅)
Core 68.07% <ø> (ø)
Go 83.19% <100.00%> (∅)
Files with missing lines Coverage Δ
internal/querycoordv2/observers/target_observer.go 85.35% <100.00%> (ø)

... and 1066 files with indirect coverage changes

Copy link
Contributor

mergify bot commented Nov 13, 2024

@weiliu1031 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

cause balance channel will wait until new delegator becomes serviceable,
but new delegator need to sync target version then becomes serviceable,
and sync target version need to be wait all replica load done. so if
increasing replica number and balance channel happens at same time,
logic dead lock occurs.

Signed-off-by: Wei Liu <[email protected]>
@weiliu1031 weiliu1031 force-pushed the fix_balance_and_replica branch from 589dfda to 5b4f16e Compare November 13, 2024 08:17
@sre-ci-robot sre-ci-robot removed the lgtm label Nov 13, 2024
Copy link
Contributor

mergify bot commented Nov 13, 2024

@weiliu1031 cpp-unit-test check failed, comment rerun cpp-unit-test can trigger the job again.

@weiliu1031
Copy link
Contributor Author

rerun cpp-unit-test

@mergify mergify bot added the ci-passed label Nov 13, 2024
Copy link
Contributor

@congqixia congqixia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: congqixia, weiliu1031

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot merged commit 1304b40 into milvus-io:master Nov 14, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed dco-passed DCO check passed. kind/bug Issues or changes related a bug lgtm size/XS Denotes a PR that changes 0-9 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants