Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Milvus Standalone keeps restarting/crashing #38171

Open
1 task done
tanvlt opened this issue Dec 3, 2024 · 20 comments
Open
1 task done

[Bug]: Milvus Standalone keeps restarting/crashing #38171

tanvlt opened this issue Dec 3, 2024 · 20 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@tanvlt
Copy link

tanvlt commented Dec 3, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4.15
- Deployment mode(standalone or cluster): Standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq 
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): Kubernetes
- CPU/Memory: Azure standard_d16_v5 (16/64)
- GPU: 
- Others: 
 + External storage: S3

Current Behavior

Expected Behavior

  • Stop restarting

Steps To Reproduce

No response

Milvus Log

logs.tar.gz

Anything else?

No response

@tanvlt tanvlt added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 3, 2024
@yanliang567
Copy link
Contributor

@tanvlt checking the logs, Milvus is restarting for lost the heartbeat with etcd service.
Normal Killing 57m (x2 over 25h) kubelet Container standalone failed liveness probe, will be restarted
please doube check

  1. the etcd is running against SSD volumes for high performance
  2. request more cpu resource for milvus pod to keep alive with etcd(this only happens when there are heavy workloads running on milvus)
    /assign @tanvlt
    /unassign

@sre-ci-robot sre-ci-robot assigned tanvlt and unassigned yanliang567 Dec 3, 2024
@yanliang567 yanliang567 added help wanted Extra attention is needed and removed kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 3, 2024
@tanvlt
Copy link
Author

tanvlt commented Dec 3, 2024

Hi @yanliang567 thanks for checking

  • About etcd, it is running on Azure disk: size 5GB, Storage type Standard SSD LRS, IOPS 500. Is it enough or should be IOPS 3000?
  • About the CPU: that node is set only for Milvus and i supposed it is able to use all of CPU on that node, it was consuming a lot during restarting time
    image
    But Can you also help to look into the previous pod log, that i collected before crashing?
    Explore-logs-2024-12-03 14_43_23.txt
    I just wanted to make sure that is the issue from Disk or CPU setup.

@yanliang567
Copy link
Contributor

  1. the average cpu usage is around 500%, which is too high. how many cpu cores did you request and limit for the milvus pod? if milvus is running exclusively on the node, please set the request and the limit the same value, which helps in milvus stability and performance.
  2. the previous pod log is all INFO, please set the milvus log level to debug so that we can see more info if it reproduced.

@tanvlt
Copy link
Author

tanvlt commented Dec 3, 2024

hi @yanliang567

  • I only request 1 core and did not set the limit
  • Example: there is 16 cores, how many should i request and limit?
    Btw, there is about 9000 collection in my Milvus this moment
    image

@yanliang567
Copy link
Contributor

@tanvlt would you like a call for talking about your scenarios? please free to mail me via [email protected] with your available time and contact info.

@xiaofan-luan
Copy link
Collaborator

I guess this might be a too many collection issue.
@bigsheeper has been working on it for a while. The latest Milvus optimized on it.
But in general 9000 collection is still too much for one milvus one single cluster. What is the target colleciton number?

@bigsheeper
Copy link
Contributor

bigsheeper commented Dec 3, 2024

I guess this might be a too many collection issue. @bigsheeper has been working on it for a while. The latest Milvus optimized on it. But in general 9000 collection is still too much for one milvus one single cluster. What is the target colleciton number?

Yes, I guess this is related to periodic, large-scale metadata transactions triggered by high number of collections.

@tanvlt Could you please check the meta request rate monitoring and see if the periods of high transaction rates align with the times when the Milvus restarted? This information would help us better understand the issue. The monitoring looks like this:
image

@tanvlt
Copy link
Author

tanvlt commented Dec 4, 2024

hi @bigsheeper i have not enabled that monitor yet, let me enable and get back to you soon

@tanvlt
Copy link
Author

tanvlt commented Dec 4, 2024

hi @xiaofan-luan We have not a certain target for number of collections, we have just started our product and it will be increase more.
Btw @xiaofan-luan
i have checked in FQA https://milvus.io/docs/v2.4.x/product_faq.md#Is-there-a-limit-to-the-total-number-of-collections-and-partitions-in-Milvus
It mentions that i can create up to 65.000 collections, is that correct?
We are following "One collection per tenant" approach and no shard or partition setting
https://milvus.io/docs/v2.4.x/multi_tenancy.md

@xiaofan-luan
Copy link
Collaborator

We do see severe performance bottle neck and stability issue when collection number is more than 5k in Milvus 2.4.x.
@bigsheeper is actually working on improving and some of the recent release might help.
The goal of the new release is to support 10K collections with 1000partitions in each collection, right now it is still challenging.
To implement an multi tenancy app, see https://milvus.io/docs/multi_tenancy.md. Partitionkey might be something you actaully need

@zrg-team
Copy link

zrg-team commented Dec 4, 2024

Hi @xiaofan-luan seems we started the project before the Partition-key-based tenant was implemented.
Could you share partition-key usage documents? Do you have any idea how to smoothly migrate from collection to partion key approach?

@zrg-team
Copy link

zrg-team commented Dec 4, 2024

For partition_key_field approach, we will store tenant data in same collection but difference partition_key_field ?

@tanvlt
Copy link
Author

tanvlt commented Dec 5, 2024

hi @bigsheeper I've created grafana dashboard form this link: https://milvus.io/docs/visualize.md
but can not see "Meta request rate" charts.
can you share what metric name of that?
image

@bigsheeper
Copy link
Contributor

hi @bigsheeper I've created grafana dashboard form this link: https://milvus.io/docs/visualize.md but can not see "Meta request rate" charts. can you share what metric name of that?

@tanvlt , sure.

sum(rate(milvus_meta_op_count{app_kubernetes_io_instance=~"$instance", app_kubernetes_io_name="$app_name", namespace="$namespace"}[1m])) by (meta_op_type, status)

image

@tanvlt
Copy link
Author

tanvlt commented Dec 10, 2024

hi @bigsheeper today i got restarting issue once and here is the monitoring charts:
Resource
I already set the CPU limit to 600%
image
meta request rate
image
meta request rate when normal
Screenshot 2024-12-10 at 9 59 54 PM

@bigsheeper
Copy link
Contributor

Hello @tanvlt ,
To help us investigate the issue further, could you please provide the log at the time when the Milvus restarted?
Thank you for your cooperation!

@xiaofan-luan
Copy link
Collaborator

i think 9000 collections might be too high for milvus and not a reasonable for current milvus version.
the reason of high cpu usage might be too many collections and periodcally check collection states might be reason.

@xiaofan-luan
Copy link
Collaborator

For partition_key_field, each of user using one parititonkey. and put all the data into one colleciton should work

@tanvlt
Copy link
Author

tanvlt commented Dec 20, 2024

hi @bigsheeper sorry for late, i got once yesterday but seems the milvus log is too much and using export-log script dose not get the log yesterday, so i export from my log center the period before the issue occurs, i hope it can be helpful
Explore-logs-2024-12-20 22_23_01.txt

@tanvlt
Copy link
Author

tanvlt commented Dec 20, 2024

hi @xiaofan-luan i understand the number collections issue, but we can not change to partition_key this time, we are still finding the suitable approach for converting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants