[Bug]: Milvus Standalone keeps restarting/crashing #38171

tanvlt · 2024-12-03T04:29:39Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.4.15
- Deployment mode(standalone or cluster): Standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq 
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): Kubernetes
- CPU/Memory: Azure standard_d16_v5 (16/64)
- GPU: 
- Others: 
 + External storage: S3

Current Behavior

Recently weeks Milvus v2.4.14 keeps restarting in a few period in a day even though there was not much usage.
It could be related to this issue fix: dead lock if query node crash during shard client init #37354
On last Saturday i upgraded v2.4.15, but still got the restarting issue a few times on yesterday and today

Expected Behavior

Stop restarting

Steps To Reproduce

No response

Milvus Log

logs.tar.gz

Anything else?

No response

yanliang567 · 2024-12-03T06:18:08Z

@tanvlt checking the logs, Milvus is restarting for lost the heartbeat with etcd service.
Normal Killing 57m (x2 over 25h) kubelet Container standalone failed liveness probe, will be restarted
please doube check

the etcd is running against SSD volumes for high performance
request more cpu resource for milvus pod to keep alive with etcd(this only happens when there are heavy workloads running on milvus)
/assign @tanvlt
/unassign

tanvlt · 2024-12-03T07:47:21Z

Hi @yanliang567 thanks for checking

About etcd, it is running on Azure disk: size 5GB, Storage type Standard SSD LRS, IOPS 500. Is it enough or should be IOPS 3000?
About the CPU: that node is set only for Milvus and i supposed it is able to use all of CPU on that node, it was consuming a lot during restarting time

But Can you also help to look into the previous pod log, that i collected before crashing?
Explore-logs-2024-12-03 14_43_23.txt
I just wanted to make sure that is the issue from Disk or CPU setup.

yanliang567 · 2024-12-03T08:01:56Z

the average cpu usage is around 500%, which is too high. how many cpu cores did you request and limit for the milvus pod? if milvus is running exclusively on the node, please set the request and the limit the same value, which helps in milvus stability and performance.
the previous pod log is all INFO, please set the milvus log level to debug so that we can see more info if it reproduced.

tanvlt · 2024-12-03T08:18:39Z

hi @yanliang567

I only request 1 core and did not set the limit
Example: there is 16 cores, how many should i request and limit?
Btw, there is about 9000 collection in my Milvus this moment

yanliang567 · 2024-12-03T08:25:24Z

@tanvlt would you like a call for talking about your scenarios? please free to mail me via [email protected] with your available time and contact info.

xiaofan-luan · 2024-12-03T11:17:51Z

I guess this might be a too many collection issue.
@bigsheeper has been working on it for a while. The latest Milvus optimized on it.
But in general 9000 collection is still too much for one milvus one single cluster. What is the target colleciton number?

bigsheeper · 2024-12-03T11:35:50Z

I guess this might be a too many collection issue. @bigsheeper has been working on it for a while. The latest Milvus optimized on it. But in general 9000 collection is still too much for one milvus one single cluster. What is the target colleciton number?

Yes, I guess this is related to periodic, large-scale metadata transactions triggered by high number of collections.

@tanvlt Could you please check the meta request rate monitoring and see if the periods of high transaction rates align with the times when the Milvus restarted? This information would help us better understand the issue. The monitoring looks like this:

tanvlt · 2024-12-04T04:48:06Z

hi @bigsheeper i have not enabled that monitor yet, let me enable and get back to you soon

tanvlt · 2024-12-04T05:18:43Z

hi @xiaofan-luan We have not a certain target for number of collections, we have just started our product and it will be increase more.
Btw @xiaofan-luan
i have checked in FQA https://milvus.io/docs/v2.4.x/product_faq.md#Is-there-a-limit-to-the-total-number-of-collections-and-partitions-in-Milvus
It mentions that i can create up to 65.000 collections, is that correct?
We are following "One collection per tenant" approach and no shard or partition setting
https://milvus.io/docs/v2.4.x/multi_tenancy.md

xiaofan-luan · 2024-12-04T06:10:54Z

We do see severe performance bottle neck and stability issue when collection number is more than 5k in Milvus 2.4.x.
@bigsheeper is actually working on improving and some of the recent release might help.
The goal of the new release is to support 10K collections with 1000partitions in each collection, right now it is still challenging.
To implement an multi tenancy app, see https://milvus.io/docs/multi_tenancy.md. Partitionkey might be something you actaully need

zrg-team · 2024-12-04T06:40:24Z

Hi @xiaofan-luan seems we started the project before the Partition-key-based tenant was implemented.
Could you share partition-key usage documents? Do you have any idea how to smoothly migrate from collection to partion key approach?

zrg-team · 2024-12-04T10:22:31Z

For partition_key_field approach, we will store tenant data in same collection but difference partition_key_field ?

tanvlt · 2024-12-05T15:18:22Z

hi @bigsheeper I've created grafana dashboard form this link: https://milvus.io/docs/visualize.md
but can not see "Meta request rate" charts.
can you share what metric name of that?

bigsheeper · 2024-12-06T02:00:48Z

hi @bigsheeper I've created grafana dashboard form this link: https://milvus.io/docs/visualize.md but can not see "Meta request rate" charts. can you share what metric name of that?

@tanvlt , sure.

sum(rate(milvus_meta_op_count{app_kubernetes_io_instance=~"$instance", app_kubernetes_io_name="$app_name", namespace="$namespace"}[1m])) by (meta_op_type, status)

tanvlt · 2024-12-10T15:01:01Z

hi @bigsheeper today i got restarting issue once and here is the monitoring charts:
Resource
I already set the CPU limit to 600%

meta request rate

meta request rate when normal

bigsheeper · 2024-12-16T02:45:14Z

Hello @tanvlt ,
To help us investigate the issue further, could you please provide the log at the time when the Milvus restarted?
Thank you for your cooperation!

xiaofan-luan · 2024-12-16T05:07:27Z

i think 9000 collections might be too high for milvus and not a reasonable for current milvus version.
the reason of high cpu usage might be too many collections and periodcally check collection states might be reason.

xiaofan-luan · 2024-12-16T05:08:01Z

For partition_key_field, each of user using one parititonkey. and put all the data into one colleciton should work

tanvlt · 2024-12-20T15:28:12Z

hi @bigsheeper sorry for late, i got once yesterday but seems the milvus log is too much and using export-log script dose not get the log yesterday, so i export from my log center the period before the issue occurs, i hope it can be helpful
Explore-logs-2024-12-20 22_23_01.txt

tanvlt · 2024-12-20T15:30:22Z

hi @xiaofan-luan i understand the number collections issue, but we can not change to partition_key this time, we are still finding the suitable approach for converting

tanvlt added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 3, 2024

tanvlt assigned yanliang567 Dec 3, 2024

sre-ci-robot assigned tanvlt and unassigned yanliang567 Dec 3, 2024

yanliang567 added help wanted Extra attention is needed and removed kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Milvus Standalone keeps restarting/crashing #38171

[Bug]: Milvus Standalone keeps restarting/crashing #38171

tanvlt commented Dec 3, 2024

yanliang567 commented Dec 3, 2024

tanvlt commented Dec 3, 2024

yanliang567 commented Dec 3, 2024

tanvlt commented Dec 3, 2024

yanliang567 commented Dec 3, 2024

xiaofan-luan commented Dec 3, 2024

bigsheeper commented Dec 3, 2024 •

edited

Loading

tanvlt commented Dec 4, 2024

tanvlt commented Dec 4, 2024

xiaofan-luan commented Dec 4, 2024

zrg-team commented Dec 4, 2024 •

edited

Loading

zrg-team commented Dec 4, 2024

tanvlt commented Dec 5, 2024

bigsheeper commented Dec 6, 2024

tanvlt commented Dec 10, 2024

bigsheeper commented Dec 16, 2024

xiaofan-luan commented Dec 16, 2024

xiaofan-luan commented Dec 16, 2024

tanvlt commented Dec 20, 2024

tanvlt commented Dec 20, 2024

[Bug]: Milvus Standalone keeps restarting/crashing #38171

[Bug]: Milvus Standalone keeps restarting/crashing #38171

Comments

tanvlt commented Dec 3, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Dec 3, 2024

tanvlt commented Dec 3, 2024

yanliang567 commented Dec 3, 2024

tanvlt commented Dec 3, 2024

yanliang567 commented Dec 3, 2024

xiaofan-luan commented Dec 3, 2024

bigsheeper commented Dec 3, 2024 • edited Loading

tanvlt commented Dec 4, 2024

tanvlt commented Dec 4, 2024

xiaofan-luan commented Dec 4, 2024

zrg-team commented Dec 4, 2024 • edited Loading

zrg-team commented Dec 4, 2024

tanvlt commented Dec 5, 2024

bigsheeper commented Dec 6, 2024

tanvlt commented Dec 10, 2024

bigsheeper commented Dec 16, 2024

xiaofan-luan commented Dec 16, 2024

xiaofan-luan commented Dec 16, 2024

tanvlt commented Dec 20, 2024

tanvlt commented Dec 20, 2024

bigsheeper commented Dec 3, 2024 •

edited

Loading

zrg-team commented Dec 4, 2024 •

edited

Loading