[Bug]: milvus rootcoord always crashes and restarts, the error message is oomkill, the memory has been allocated to 48G #35230

TonyAnn · 2024-08-02T06:35:59Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:2.2.16
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar 
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

milvus rootcoord always crashes and restarts, the error message is oomkill, the memory has been allocated to 48G

Expected Behavior

The detailed rootcoord log is shown in the attachment
rootcoord_280160cbb3b-json.log
rootcoord_json.log
rootcoord_280160cbb3b-json.log

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

xiaofan-luan · 2024-08-04T09:32:29Z

are you sure it's coordinator reach memory limit?

from the log querynode-11298 is full of memory and this numbers shows that some node reboot a lot of times.

If you do say coordinators have so much memory usage, please offer a pprof file of rootcoord so we can do futher analysis

xiaofan-luan · 2024-08-04T09:32:58Z

I don't think it is possible for rootcoord to have a lot of memory in a short period of time.

xiaofan-luan · 2024-08-04T09:46:35Z

from the log, rootcoord works find without seeing any problems

yanliang567 · 2024-08-05T01:35:30Z

@TonyAnn it looks like you have created thousands of collections in the cluster, which could be already improved in latest milvus relase. could you please retry on milvus 2.4.7 or 2.3.20

xiaofan-luan · 2024-08-05T04:46:36Z

@TonyAnn it looks like you have created thousands of collections in the cluster, which could be already improved in latest milvus relase. could you please retry on milvus 2.4.7 or 2.3.20

even with 10k collections, the memory usage is not expected?

xiaofan-luan · 2024-08-05T04:47:01Z

we need pprof to understand why. maybe it's due to some kind of retry

yanliang567 · 2024-08-05T07:22:05Z

pprof.milvus.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz
here received the pprof from user
/assign @congqixia
please help to take a look

congqixia · 2024-08-05T07:24:26Z

inuse space is very low, but the alloc space is huge
how many collection do you have in you instance?
@TonyAnn

congqixia · 2024-08-05T07:37:33Z

@TonyAnn also 2.2.16 is a very old version. There might be some optimization related to this issue in 2.3.x and 2.4.x. It highly recommended to upgrade you cluster to a more stable version

TonyAnn · 2024-08-06T09:01:44Z

inuse space is very low, but the alloc space is huge how many collection do you have in you instance? @TonyAnn

@congqixia There are a large number of collections in the cluster. There is a liveness detection script that continuously creates and deletes collections. This script may be the cause.

TonyAnn · 2024-08-06T09:02:28Z

why

OK, it will be upgraded in a while.

stale · 2024-09-07T19:28:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

TonyAnn added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 2, 2024

TonyAnn assigned yanliang567 Aug 2, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 5, 2024

yanliang567 assigned TonyAnn and unassigned yanliang567 Aug 5, 2024

sre-ci-robot assigned congqixia Aug 5, 2024

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Aug 5, 2024

stale bot added the stale indicates no udpates for 30 days label Sep 7, 2024

stale bot closed this as completed Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: milvus rootcoord always crashes and restarts, the error message is oomkill, the memory has been allocated to 48G #35230

[Bug]: milvus rootcoord always crashes and restarts, the error message is oomkill, the memory has been allocated to 48G #35230

TonyAnn commented Aug 2, 2024

xiaofan-luan commented Aug 4, 2024

xiaofan-luan commented Aug 4, 2024

xiaofan-luan commented Aug 4, 2024

yanliang567 commented Aug 5, 2024

xiaofan-luan commented Aug 5, 2024

xiaofan-luan commented Aug 5, 2024

yanliang567 commented Aug 5, 2024

congqixia commented Aug 5, 2024

congqixia commented Aug 5, 2024

TonyAnn commented Aug 6, 2024

TonyAnn commented Aug 6, 2024

stale bot commented Sep 7, 2024

[Bug]: milvus rootcoord always crashes and restarts, the error message is oomkill, the memory has been allocated to 48G #35230

[Bug]: milvus rootcoord always crashes and restarts, the error message is oomkill, the memory has been allocated to 48G #35230

Comments

TonyAnn commented Aug 2, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

xiaofan-luan commented Aug 4, 2024

xiaofan-luan commented Aug 4, 2024

xiaofan-luan commented Aug 4, 2024

yanliang567 commented Aug 5, 2024

xiaofan-luan commented Aug 5, 2024

xiaofan-luan commented Aug 5, 2024

yanliang567 commented Aug 5, 2024

congqixia commented Aug 5, 2024

congqixia commented Aug 5, 2024

TonyAnn commented Aug 6, 2024

TonyAnn commented Aug 6, 2024

stale bot commented Sep 7, 2024