Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance: optimize CPU usage for CheckHealth requests #35589

Merged

Conversation

jaime0815
Copy link
Contributor

@jaime0815 jaime0815 commented Aug 20, 2024

issue: #35563

  1. Use an internal health checker to monitor the cluster's health state, storing the latest state on the coordinator node. The CheckHealth request retrieves the cluster's health from this latest state on the proxy sides, which enhances cluster stability.
  2. Each health check will assess all collections and channels, with detailed failure messages temporarily saved in the latest state.
  3. Use CheckHealth request instead of the heavy GetMetrics request on the querynode and datanode

@sre-ci-robot sre-ci-robot added size/XL Denotes a PR that changes 500-999 lines. area/dependency Pull requests that update a dependency file area/internal-api labels Aug 20, 2024
@mergify mergify bot added dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement labels Aug 20, 2024
Copy link
Contributor

mergify bot commented Aug 20, 2024

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@jaime0815 jaime0815 force-pushed the refine-checkhealth-cpu-usage branch from 0c31c7d to f055ee7 Compare August 21, 2024 13:15
@sre-ci-robot sre-ci-robot added size/XXL Denotes a PR that changes 1000+ lines. and removed size/XL Denotes a PR that changes 500-999 lines. labels Aug 21, 2024
@jaime0815 jaime0815 force-pushed the refine-checkhealth-cpu-usage branch 8 times, most recently from 05e45c9 to d3708b8 Compare August 23, 2024 04:23
Copy link
Contributor

mergify bot commented Aug 23, 2024

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@jaime0815
Copy link
Contributor Author

/run-cpu-e2e

@jaime0815 jaime0815 force-pushed the refine-checkhealth-cpu-usage branch 3 times, most recently from 124c559 to 634e7eb Compare August 24, 2024 12:50
Copy link

codecov bot commented Aug 24, 2024

Codecov Report

Attention: Patch coverage is 86.36364% with 54 lines in your changes missing coverage. Please review.

Project coverage is 80.92%. Comparing base (9c8c1b3) to head (d6f6ebf).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
internal/util/healthcheck/checker.go 74.46% 29 Missing and 7 partials ⚠️
internal/querycoordv2/session/cluster.go 75.00% 2 Missing and 1 partial ⚠️
internal/querynodev2/metrics_info.go 62.50% 2 Missing and 1 partial ⚠️
pkg/util/merr/utils.go 40.00% 2 Missing and 1 partial ⚠️
internal/datacoord/session/datanode_manager.go 92.59% 2 Missing ⚠️
internal/util/mock/grpc_datanode_client.go 0.00% 2 Missing ⚠️
internal/util/mock/grpc_querynode_client.go 0.00% 2 Missing ⚠️
internal/util/wrappers/qn_wrapper.go 0.00% 2 Missing ⚠️
internal/querycoordv2/utils/util.go 87.50% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #35589      +/-   ##
==========================================
+ Coverage   80.89%   80.92%   +0.03%     
==========================================
  Files        1373     1374       +1     
  Lines      193162   193362     +200     
==========================================
+ Hits       156264   156485     +221     
+ Misses      31369    31361       -8     
+ Partials     5529     5516      -13     
Components Coverage Δ
Client 74.58% <ø> (ø)
Core 68.97% <ø> (ø)
Go 83.02% <86.36%> (+0.03%) ⬆️
Files with missing lines Coverage Δ
internal/datacoord/server.go 73.40% <100.00%> (+0.17%) ⬆️
internal/datacoord/services.go 85.49% <100.00%> (+0.03%) ⬆️
internal/datacoord/util.go 98.68% <100.00%> (ø)
internal/datanode/metrics_info.go 96.20% <100.00%> (ø)
internal/datanode/services.go 85.48% <100.00%> (+0.47%) ⬆️
internal/distributed/datanode/client/client.go 89.93% <100.00%> (+0.25%) ⬆️
internal/distributed/datanode/service.go 82.64% <100.00%> (+0.14%) ⬆️
internal/distributed/querynode/client/client.go 91.70% <100.00%> (+0.14%) ⬆️
internal/distributed/querynode/service.go 83.71% <100.00%> (+0.14%) ⬆️
...nternal/flushcommon/pipeline/flow_graph_manager.go 92.07% <100.00%> (+0.87%) ⬆️
... and 19 more

... and 23 files with indirect coverage changes

Copy link
Contributor

mergify bot commented Aug 24, 2024

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@jaime0815
Copy link
Contributor Author

/run-cpu-e2e

@mergify mergify bot added the ci-passed label Aug 25, 2024
Copy link
Contributor

mergify bot commented Dec 9, 2024

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@jaime0815 jaime0815 force-pushed the refine-checkhealth-cpu-usage branch 2 times, most recently from 064e956 to bf1d38d Compare December 11, 2024 03:56
@mergify mergify bot added the ci-passed label Dec 11, 2024
@jaime0815 jaime0815 force-pushed the refine-checkhealth-cpu-usage branch from bf1d38d to 49f216c Compare December 13, 2024 08:48
@mergify mergify bot removed the ci-passed label Dec 13, 2024
@jaime0815 jaime0815 added this to the 2.5.0 milestone Dec 13, 2024
Copy link
Contributor

mergify bot commented Dec 13, 2024

@jaime0815 go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Dec 13, 2024

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@jaime0815 jaime0815 force-pushed the refine-checkhealth-cpu-usage branch from 49f216c to 8a8555a Compare December 16, 2024 02:14
Copy link
Contributor

mergify bot commented Dec 16, 2024

@jaime0815 go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Dec 16, 2024

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@jaime0815 jaime0815 force-pushed the refine-checkhealth-cpu-usage branch from 8a8555a to c101d74 Compare December 16, 2024 07:17
Copy link
Contributor

mergify bot commented Dec 16, 2024

@jaime0815 go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link
Contributor

mergify bot commented Dec 16, 2024

@jaime0815 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@jaime0815 jaime0815 force-pushed the refine-checkhealth-cpu-usage branch from c101d74 to d6f6ebf Compare December 16, 2024 09:21
Copy link
Contributor

mergify bot commented Dec 16, 2024

@jaime0815 go-sdk check failed, comment rerun go-sdk can trigger the job again.

@jaime0815
Copy link
Contributor Author

rerun go-sdk

@mergify mergify bot added the ci-passed label Dec 16, 2024
@czs007
Copy link
Collaborator

czs007 commented Dec 17, 2024

/approve
/lgtm

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: czs007, jaime0815

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot merged commit 28fdbc4 into milvus-io:master Dec 17, 2024
20 checks passed
@jaime0815 jaime0815 deleted the refine-checkhealth-cpu-usage branch December 17, 2024 03:07
jaime0815 added a commit to jaime0815/milvus that referenced this pull request Dec 18, 2024
jaime0815 added a commit to jaime0815/milvus that referenced this pull request Dec 18, 2024
sre-ci-robot pushed a commit that referenced this pull request Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved area/dependency Pull requests that update a dependency file area/internal-api ci-passed dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement lgtm size/XXL Denotes a PR that changes 1000+ lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants