[Bug]: Milvus cannot load collection #38457

gavinshark · 2024-12-13T13:52:46Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.5.0-beta
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): attu
- OS(Ubuntu or CentOS):
- CPU/Memory: 16C64GB
- GPU: NA
- Others:

Current Behavior

A hnsw collection cannot be loaded.

Expected Behavior

The hnsw collection can be loaded

Steps To Reproduce

VDBbenchmark corhere 10M 99% filter HNSW scenario

Milvus Log

No response

Anything else?

No response

yanliang567 · 2024-12-15T01:12:59Z

@gavinshark I think the Milvus is not running healthy, Please refer this doc to export the whole Milvus logs for investigation.
For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.
/assigm @gavinshark
/unassign

gavinshark · 2024-12-17T02:42:14Z

Hi yanglliang, I have sent th related log file to [email protected]. Please help to check it.

yanliang567 · 2024-12-17T06:58:17Z

After checking the logs, I suggest you add more memory to the query nodes. @gavinshark
[2024/12/13 09:53:25.340 +00:00] [WARN] [meta/failed_load_cache.go:72] ["FailedLoadCache hits failed record"] [collectionID=454534392506528839] [error="load segment failed, OOM if load, maxSegmentSize = 18597.925884246826 MB, memUsage = 41928.635314941406 MB, predictMemUsage = 60526.56119918823 MB, totalMem = 61440 MB thresholdFactor = 0.900000"]

gavinshark · 2024-12-20T01:38:44Z

memUsage = 41928.635314941406 MB is not correct. The real memUsage is about 8000MB

gavinshark · 2024-12-20T01:44:08Z

By the way, the collection is 768 dim 10M cohere datasert, query node memory is 60GB, segment number is 3 and index is HNSW with M=16.

gavinshark · 2024-12-20T01:44:26Z

3 query node in the cluster

yanliang567 · 2024-12-20T03:46:30Z

According to the sizing tool, 60GB is just fit for 10M 768dim dataset. if you have some scalar fields, it requires more. Also milvus need some memory for itself. please add more memory and retry

gavinshark · 2024-12-20T05:56:20Z

I have done the same test in Milvus 2.4.15. The problem does not happen. The memory usage is about 10GB per node and 30GB total. Each node has 35GB free memeory space.

gavinshark · 2024-12-20T05:58:39Z

According to the sizing tool, 60GB is just fit for 10M 768dim dataset. if you have some scalar fields, it requires more. Also milvus need some memory for itself. please add more memory and retry

The cluster have 3 query node. Each query node has 60GB and total query nodes has 180GB.

yanliang567 · 2024-12-20T06:17:09Z

Mmm...could you please refer to this doc: https://github.com/milvus-io/birdwatcher to backup etcd backup with birdwatcher

gavinshark · 2024-12-20T14:36:32Z

Mmm...could you please refer to this doc: https://github.com/milvus-io/birdwatcher to backup etcd backup with birdwatcher

Sent an Email to you.

gavinshark · 2024-12-24T05:31:15Z

I meet the same issue on 2.5.0 version. By the way, the dataset is imported by VDBBench and the index is compacted into 3 segment to improve the performance by changing the segment max size. The compaction is done, but the loading failed(neither from vdbbench or attu)

gavinshark · 2024-12-24T06:32:52Z

[2024/12/24 05:26:57.041 +00:00] [INFO] [task/executor.go:228] ["load segments..."] [taskID=1735007751540] [collectionID=454803907019330956] [replicaID=454821872001089539] [segmentID=454803907070007119] [node=146] [source=segment_checker] [shardLeader=146]
[2024/12/24 05:26:57.041 +00:00] [INFO] [task/executor.go:228] ["load segments..."] [taskID=1735007751539] [collectionID=454803907019330956] [replicaID=454821872001089539] [segmentID=454803907070007115] [node=146] [source=segment_checker] [shardLeader=146]
[2024/12/24 05:26:57.043 +00:00] [INFO] [task/executor.go:228] ["load segments..."] [taskID=1735007751541] [collectionID=454803907019330956] [replicaID=454821872001089539] [segmentID=454803907070084622] [node=146] [source=segment_checker] [shardLeader=146]

gavinshark · 2024-12-24T06:33:34Z

[2024/12/24 05:26:57.062 +00:00] [WARN] [task/executor.go:232] ["failed to load segment"] [taskID=1735007751541] [collectionID=454803907019330956] [replicaID=454821872001089539] [segmentID=454803907070084622] [node=146] [source=segment_checker] [shardLeader=146] [error="load segment failed, OOM if load, maxSegmentSize = 18597.378524780273 MB, memUsage = 37306.86931705475 MB, predictMemUsage = 55904.24784183502 MB, totalMem = 61440 MB thresholdFactor = 0.900000"]

gavinshark · 2024-12-24T06:35:00Z

It seems all 3 segments are loaded by a single query node. So I have 2 questions:

What is the laod balance strategy of segment loading?
Why the scenario happened?

gavinshark · 2024-12-24T06:49:05Z

I configure the shard number with 3 .

yanliang567 · 2024-12-24T07:04:03Z

/assign @XuanYang-cn
please help to take a look

xiaofan-luan · 2024-12-26T10:02:32Z

I configure the shard number with 3 .

I guess this is the reason:

Your segment size is too huge, index build is slow and can not be done
On 2.5, when index built is not done, segment can not be balanced to other machines.

To verify my guess:

check and see if all your collection index built is done.
offer your index node log and datanode log so we can help you on check.

Suggestions:

Segment node should be less than 4GB, unless you fully understand what you are doing. -> right now your segment size is 18G, don't know if your indexnode is able to built index on such large segemnt.
Add more index node on index built state so index built could be faster.

XuanYang-cn · 2024-12-27T02:44:42Z

I'd like to sync up some information:

It's a VDBbench test with 10M cohere dataset, around 30GB data.
With default segment size, there'are 30 segments. So we advise to increase segment maxSize, to ensure there're 3 * 10GB segments in 3 - 64GB querynodes.
The index is not a problem. querynodes(64G * 3) are unable to load 3-shard- 3 * 10G segments.(also it can be loaded in 2.4.15)
However, if load-then-insert-compact, querynodes can work perfectly.

gavinshark added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 13, 2024

gavinshark assigned yanliang567 Dec 13, 2024

sre-ci-robot unassigned yanliang567 Dec 15, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 15, 2024

yanliang567 assigned gavinshark Dec 17, 2024

yanliang567 added help wanted Extra attention is needed and removed kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it. labels Dec 17, 2024

sre-ci-robot assigned XuanYang-cn Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Milvus cannot load collection #38457

[Bug]: Milvus cannot load collection #38457

gavinshark commented Dec 13, 2024

yanliang567 commented Dec 15, 2024

gavinshark commented Dec 17, 2024

yanliang567 commented Dec 17, 2024

gavinshark commented Dec 20, 2024

gavinshark commented Dec 20, 2024

gavinshark commented Dec 20, 2024

yanliang567 commented Dec 20, 2024

gavinshark commented Dec 20, 2024

gavinshark commented Dec 20, 2024

yanliang567 commented Dec 20, 2024

gavinshark commented Dec 20, 2024

gavinshark commented Dec 24, 2024

gavinshark commented Dec 24, 2024

gavinshark commented Dec 24, 2024

gavinshark commented Dec 24, 2024

gavinshark commented Dec 24, 2024

yanliang567 commented Dec 24, 2024

xiaofan-luan commented Dec 26, 2024

XuanYang-cn commented Dec 27, 2024

[Bug]: Milvus cannot load collection #38457

[Bug]: Milvus cannot load collection #38457

Comments

gavinshark commented Dec 13, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Dec 15, 2024

gavinshark commented Dec 17, 2024

yanliang567 commented Dec 17, 2024

gavinshark commented Dec 20, 2024

gavinshark commented Dec 20, 2024

gavinshark commented Dec 20, 2024

yanliang567 commented Dec 20, 2024

gavinshark commented Dec 20, 2024

gavinshark commented Dec 20, 2024

yanliang567 commented Dec 20, 2024

gavinshark commented Dec 20, 2024

gavinshark commented Dec 24, 2024

gavinshark commented Dec 24, 2024

gavinshark commented Dec 24, 2024

gavinshark commented Dec 24, 2024

gavinshark commented Dec 24, 2024

yanliang567 commented Dec 24, 2024

xiaofan-luan commented Dec 26, 2024

XuanYang-cn commented Dec 27, 2024