Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Service exception #34995

Closed
1 task done
hotwinters opened this issue Jul 25, 2024 · 10 comments
Closed
1 task done

[Bug]: Service exception #34995

hotwinters opened this issue Jul 25, 2024 · 10 comments
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@hotwinters
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.3.16
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):go  sdk
- OS(Ubuntu or CentOS): CentOS 7
- CPU/Memory: 8/32g
- GPU: 
- Others:

Current Behavior

Every few seconds, the CPU usage of Milvus increases to 5 cores and then decreases. The SDK shows a high delay in creating and deleting libraries, so I restarted my Gosdk server and found that the delay was still there. After restarting Milvus, the functionality returned to normal

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

[2024/07/25 06:17:40.538 +00:00] [ERROR] [retry/retry.go:46] ["retry func failed"] ["retry time"=0] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:104 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:336 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:209 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:238 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:119 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:104\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:336\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:209\n | github.com/milvus-io/milvus/pkg/util/retry.Do\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:238\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:119\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).updateNextTarget\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:281\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).check\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:201\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*taskDispatcher[...]).schedule.func1.1\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/task_dispatcher.go:101\n | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | \t/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:67\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:104 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:336 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:209 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n | /go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:238 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:119 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *status.Error"] [stack="github.com/milvus-io/milvus/pkg/util/retry.Do\n\t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:46\ngithub.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:238\ngithub.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:119\ngithub.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).updateNextTarget\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:281\ngithub.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).check\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:201\ngithub.com/milvus-io/milvus/internal/querycoordv2/observers.(*taskDispatcher[...]).schedule.func1.1\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/task_dispatcher.go:101\ngithub.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n\t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\ngithub.com/panjf2000/ants/v2.(*goWorker).run.func1\n\t/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:67"]

Anything else?

No response

@hotwinters hotwinters added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 25, 2024
@yanliang567
Copy link
Contributor

@hotwinters quick questions:

  1. how many seconds did earch request increase?
  2. what did you do before the latency increase, what are the reproduce steps?
  3. the error above is a timeout and please offer the full milvus logs for us to investigate the issue. Could you please refer this doc to export the whole Milvus logs for investigation? For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.
    /unassign

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 26, 2024
@hotwinters
Copy link
Author

log.zip
@yanliang567 This is a compressed RAR file that only exports a portion of the logs. I'm not sure if there is any information you need, but I noticed that during the exception period, it takes about 8-20 seconds for me to create and load a new collection with a delay

@xiaofan-luan
Copy link
Collaborator

I don't have any tool to uncomress rar file.

how many collections are there existing in your cluster?

@xiaofan-luan
Copy link
Collaborator

/assign @smellthemoon

could you help on it

@hotwinters
Copy link
Author

log.zip
@xiaofan-luan @smellthemoon I made it into a zip file and re uploaded it,i use stanalone,about 130 col at memory,and I delete 60 temporary collections every 20 minutes and then create another 60

@hotwinters
Copy link
Author

log-cpu-100.zip
Today I found that the CPU usage rate has been consistently maintained at 100%, and this is the latest log.

@hotwinters
Copy link
Author

Why does the CPU continue to use 100% after having too many cols

@smellthemoon
Copy link
Contributor

Why does the CPU continue to use 100% after having too many cols

After restarting, milvus will recover info and load the previous data. However, due to the frequent creation of collections, there are too many tasks in the task pool on the querynode side, the latency increases, the CPU usage continues to rise, and finally milvus restarts.

@smellthemoon
Copy link
Contributor

Frequent creation and deletion of database tables may cause some problems and impacts. If necessary, you can use cluster to improve performance by adding machines. @hotwinters

Copy link

stale bot commented Aug 31, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Aug 31, 2024
@stale stale bot closed this as completed Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants