Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Milvus cannot load collection #38457

Open
1 task done
gavinshark opened this issue Dec 13, 2024 · 19 comments
Open
1 task done

[Bug]: Milvus cannot load collection #38457

gavinshark opened this issue Dec 13, 2024 · 19 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@gavinshark
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.5.0-beta
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): attu
- OS(Ubuntu or CentOS):
- CPU/Memory: 16C64GB
- GPU: NA
- Others:

Current Behavior

A hnsw collection cannot be loaded.

Expected Behavior

The hnsw collection can be loaded

Steps To Reproduce

VDBbenchmark corhere 10M 99% filter HNSW scenario

Milvus Log

No response

Anything else?

No response

@gavinshark gavinshark added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 13, 2024
@yanliang567
Copy link
Contributor

@gavinshark I think the Milvus is not running healthy, Please refer this doc to export the whole Milvus logs for investigation.
For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.
/assigm @gavinshark
/unassign

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 15, 2024
@gavinshark
Copy link
Author

Hi yanglliang, I have sent th related log file to [email protected]. Please help to check it.

@yanliang567
Copy link
Contributor

After checking the logs, I suggest you add more memory to the query nodes. @gavinshark
[2024/12/13 09:53:25.340 +00:00] [WARN] [meta/failed_load_cache.go:72] ["FailedLoadCache hits failed record"] [collectionID=454534392506528839] [error="load segment failed, OOM if load, maxSegmentSize = 18597.925884246826 MB, memUsage = 41928.635314941406 MB, predictMemUsage = 60526.56119918823 MB, totalMem = 61440 MB thresholdFactor = 0.900000"]

@yanliang567 yanliang567 added help wanted Extra attention is needed and removed kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it. labels Dec 17, 2024
@gavinshark
Copy link
Author

memUsage = 41928.635314941406 MB is not correct. The real memUsage is about 8000MB

@gavinshark
Copy link
Author

By the way, the collection is 768 dim 10M cohere datasert, query node memory is 60GB, segment number is 3 and index is HNSW with M=16.

@gavinshark
Copy link
Author

3 query node in the cluster

@yanliang567
Copy link
Contributor

According to the sizing tool, 60GB is just fit for 10M 768dim dataset. if you have some scalar fields, it requires more. Also milvus need some memory for itself. please add more memory and retry
image

@gavinshark
Copy link
Author

I have done the same test in Milvus 2.4.15. The problem does not happen. The memory usage is about 10GB per node and 30GB total. Each node has 35GB free memeory space.

@gavinshark
Copy link
Author

According to the sizing tool, 60GB is just fit for 10M 768dim dataset. if you have some scalar fields, it requires more. Also milvus need some memory for itself. please add more memory and retry image

The cluster have 3 query node. Each query node has 60GB and total query nodes has 180GB.

@yanliang567
Copy link
Contributor

Mmm...could you please refer to this doc: https://github.com/milvus-io/birdwatcher to backup etcd backup with birdwatcher

@gavinshark
Copy link
Author

Mmm...could you please refer to this doc: https://github.com/milvus-io/birdwatcher to backup etcd backup with birdwatcher

Sent an Email to you.

@gavinshark
Copy link
Author

I meet the same issue on 2.5.0 version. By the way, the dataset is imported by VDBBench and the index is compacted into 3 segment to improve the performance by changing the segment max size. The compaction is done, but the loading failed(neither from vdbbench or attu)

@gavinshark
Copy link
Author

[2024/12/24 05:26:57.041 +00:00] [INFO] [task/executor.go:228] ["load segments..."] [taskID=1735007751540] [collectionID=454803907019330956] [replicaID=454821872001089539] [segmentID=454803907070007119] [node=146] [source=segment_checker] [shardLeader=146]
[2024/12/24 05:26:57.041 +00:00] [INFO] [task/executor.go:228] ["load segments..."] [taskID=1735007751539] [collectionID=454803907019330956] [replicaID=454821872001089539] [segmentID=454803907070007115] [node=146] [source=segment_checker] [shardLeader=146]
[2024/12/24 05:26:57.043 +00:00] [INFO] [task/executor.go:228] ["load segments..."] [taskID=1735007751541] [collectionID=454803907019330956] [replicaID=454821872001089539] [segmentID=454803907070084622] [node=146] [source=segment_checker] [shardLeader=146]

@gavinshark
Copy link
Author

[2024/12/24 05:26:57.062 +00:00] [WARN] [task/executor.go:232] ["failed to load segment"] [taskID=1735007751541] [collectionID=454803907019330956] [replicaID=454821872001089539] [segmentID=454803907070084622] [node=146] [source=segment_checker] [shardLeader=146] [error="load segment failed, OOM if load, maxSegmentSize = 18597.378524780273 MB, memUsage = 37306.86931705475 MB, predictMemUsage = 55904.24784183502 MB, totalMem = 61440 MB thresholdFactor = 0.900000"]

@gavinshark
Copy link
Author

It seems all 3 segments are loaded by a single query node. So I have 2 questions:

  1. What is the laod balance strategy of segment loading?
  2. Why the scenario happened?

@gavinshark
Copy link
Author

I configure the shard number with 3 .

@yanliang567
Copy link
Contributor

/assign @XuanYang-cn
please help to take a look

@xiaofan-luan
Copy link
Collaborator

I configure the shard number with 3 .

I guess this is the reason:

  1. Your segment size is too huge, index build is slow and can not be done
  2. On 2.5, when index built is not done, segment can not be balanced to other machines.

To verify my guess:

  1. check and see if all your collection index built is done.
  2. offer your index node log and datanode log so we can help you on check.

Suggestions:

  1. Segment node should be less than 4GB, unless you fully understand what you are doing. -> right now your segment size is 18G, don't know if your indexnode is able to built index on such large segemnt.
  2. Add more index node on index built state so index built could be faster.

@XuanYang-cn
Copy link
Contributor

I'd like to sync up some information:

  1. It's a VDBbench test with 10M cohere dataset, around 30GB data.
  2. With default segment size, there'are 30 segments. So we advise to increase segment maxSize, to ensure there're 3 * 10GB segments in 3 - 64GB querynodes.
  3. The index is not a problem. querynodes(64G * 3) are unable to load 3-shard- 3 * 10G segments.(also it can be loaded in 2.4.15)
  4. However, if load-then-insert-compact, querynodes can work perfectly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants