Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: querynode often restart when loading diskann index to local #35038

Closed
1 task done
wangqia0309 opened this issue Jul 26, 2024 · 11 comments
Closed
1 task done

[Bug]: querynode often restart when loading diskann index to local #35038

wangqia0309 opened this issue Jul 26, 2024 · 11 comments
Assignees
Labels
help wanted Extra attention is needed stale indicates no udpates for 30 days

Comments

@wangqia0309
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.4.1
- Deployment mode(standalone or cluster):k8s
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4
- OS(Ubuntu or CentOS): ubuntu
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

the querynode pod often restart casually when loading index from minio, all of the querynodes use the same share nfs directory(mounted as local directory) as localStorage to store diskann index.
the total index data is dozens of tb, corrspond with 10 querynode pod with 50 cpu,800gb memory.
the log attach a file when restart occured
please review the error
error.log

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@wangqia0309 wangqia0309 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 26, 2024
@xiaofan-luan
Copy link
Collaborator

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.4.1
- Deployment mode(standalone or cluster):k8s
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4
- OS(Ubuntu or CentOS): ubuntu
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

the querynode pod often restart casually when loading index from minio, all of the querynodes use the same share nfs directory(mounted as local directory) as localStorage to store diskann index. the total index data is dozens of tb, corrspond with 10 querynode pod with 50 cpu,800gb memory. the log attach a file when restart occured please review the error error.log

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

you can not use nfs for diskann index. DiskANN requires at least 100K iops and nfs usually not performant enought.

You need local nvme ssd for caching

@wangqia0309
Copy link
Author

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.4.1
- Deployment mode(standalone or cluster):k8s
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4
- OS(Ubuntu or CentOS): ubuntu
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

the querynode pod often restart casually when loading index from minio, all of the querynodes use the same share nfs directory(mounted as local directory) as localStorage to store diskann index. the total index data is dozens of tb, corrspond with 10 querynode pod with 50 cpu,800gb memory. the log attach a file when restart occured please review the error error.log

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

you can not use nfs for diskann index. DiskANN requires at least 100K iops and nfs usually not performant enought.

You need local nvme ssd for caching

can you review the error log , and explain the relationship between the error and nfs more in detail

@xiaofan-luan
Copy link
Collaborator

-- | --
  |   | 2024-07-26 19:03:06 | PC=0x7fe92f73103b m=55 sigcode=18446744073709551610
  |   | 2024-07-26 19:03:06 | SIGABRT: abort
  |   | 2024-07-26 19:03:06 | what(): Error:RemoveDir:No such file or directory
  |   | 2024-07-26 19:03:06 | terminate called after throwing an instance of 'milvus::SegcoreError'

@xiaofan-luan
Copy link
Collaborator

you don't have a directory specified

@yanliang567
Copy link
Contributor

/assign @wangqia0309
/unassign

@yanliang567 yanliang567 added help wanted Extra attention is needed and removed kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 5, 2024
@wangqia0309
Copy link
Author

does it means that milvus querynode actually does not support to configure the local storage dir with mounted share-nfs directory?

@xiaofan-luan
Copy link
Collaborator

share nfs is just too slow for milvus

@xiaofan-luan
Copy link
Collaborator

you can configure it to local storage, but usually we will see nfs/nas has super long latency or reported error when io issues

@rohan-puri
Copy link

@xiaofan-luan where does the 100k IOPS requirement comes from?

Can you point me out to resources on how milvus makes use of nvme ssds ? I am exploring milvus's use and optimizations for nvme drives (mainly what optimizations milvus has done for storage?)

Copy link

stale bot commented Nov 9, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Nov 9, 2024
@xiaofan-luan
Copy link
Collaborator

@xiaofan-luan where does the 100k IOPS requirement comes from?

Can you point me out to resources on how milvus makes use of nvme ssds ? I am exploring milvus's use and optimizations for nvme drives (mainly what optimizations milvus has done for storage?)

We need SSD mainly on querynode(if you use mmap or diskann), and also etcd.

@stale stale bot closed this as completed Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed stale indicates no udpates for 30 days
Projects
None yet
Development

No branches or pull requests

4 participants