Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark] diskann index inserts 100 million data, querynode disk usage peaks at over 100G #25163

Closed
1 task done
elstic opened this issue Jun 27, 2023 · 26 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@elstic
Copy link
Contributor

elstic commented Jun 27, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.2.0-20230626-eac54cbb
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.4.0.dev36
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task : fouramf-concurrent-n5lrq, id : 2
case: test_concurrent_locust_100m_diskann_ddl_dql_filter_cluster
This is a frequently run test case, which was available in previous versions by testing.

server:

fouram-45-7069-etcd-0                                             1/1     Running       0               6m4s    10.104.4.106    4am-node11   <none>           <none>
fouram-45-7069-etcd-1                                             1/1     Running       0               6m4s    10.104.20.203   4am-node22   <none>           <none>
fouram-45-7069-etcd-2                                             1/1     Running       0               6m3s    10.104.15.13    4am-node20   <none>           <none>
fouram-45-7069-milvus-datacoord-7685b67fc-pl6r5                   1/1     Running       1 (2m3s ago)    6m4s    10.104.4.87     4am-node11   <none>           <none>
fouram-45-7069-milvus-datanode-f87b86d88-n4xwz                    1/1     Running       1 (2m4s ago)    6m4s    10.104.21.17    4am-node24   <none>           <none>
fouram-45-7069-milvus-indexcoord-79b9795579-jzl68                 1/1     Running       1 (2m3s ago)    6m4s    10.104.9.239    4am-node14   <none>           <none>
fouram-45-7069-milvus-indexnode-86c4d777c4-q4brg                  1/1     Running       0               6m4s    10.104.9.238    4am-node14   <none>           <none>
fouram-45-7069-milvus-proxy-78d5df4cdc-27znx                      1/1     Running       1 (2m3s ago)    6m4s    10.104.4.88     4am-node11   <none>           <none>
fouram-45-7069-milvus-querycoord-7cb6c4ddb8-wstvc                 1/1     Running       1 (2m3s ago)    6m4s    10.104.4.89     4am-node11   <none>           <none>
fouram-45-7069-milvus-querynode-867596d85b-hk6rz                  1/1     Running       0               6m4s    10.104.6.50     4am-node13   <none>           <none>
fouram-45-7069-milvus-rootcoord-d7c486488-kqwhq                   1/1     Running       1 (2m3s ago)    6m4s    10.104.9.240    4am-node14   <none>           <none>
fouram-45-7069-minio-0                                            1/1     Running       0               6m4s    10.104.6.54     4am-node13   <none>           <none>
fouram-45-7069-minio-1                                            1/1     Running       0               6m4s    10.104.4.104    4am-node11   <none>           <none>
fouram-45-7069-minio-2                                            1/1     Running       0               6m4s    10.104.16.227   4am-node21   <none>           <none>
fouram-45-7069-minio-3                                            1/1     Running       0               6m3s    10.104.20.205   4am-node22   <none>           <none>
fouram-45-7069-pulsar-bookie-0                                    1/1     Running       0               6m4s    10.104.4.103    4am-node11   <none>           <none>
fouram-45-7069-pulsar-bookie-1                                    1/1     Running       0               6m4s    10.104.15.11    4am-node20   <none>           <none>
fouram-45-7069-pulsar-bookie-2                                    1/1     Running       0               6m4s    10.104.16.230   4am-node21   <none>           <none>
fouram-45-7069-pulsar-bookie-init-shm2z                           0/1     Completed     0               6m4s    10.104.15.5     4am-node20   <none>           <none>
fouram-45-7069-pulsar-broker-0                                    1/1     Running       0               6m4s    10.104.15.6     4am-node20   <none>           <none>
fouram-45-7069-pulsar-proxy-0                                     1/1     Running       0               6m4s    10.104.16.225   4am-node21   <none>           <none>
fouram-45-7069-pulsar-pulsar-init-8jk8d                           0/1     Completed     0               6m4s    10.104.15.254   4am-node20   <none>           <none>
fouram-45-7069-pulsar-recovery-0                                  1/1     Running       0               6m4s    10.104.4.90     4am-node11   <none>           <none>
fouram-45-7069-pulsar-zookeeper-0                                 1/1     Running       0               6m4s    10.104.21.19    4am-node24   <none>           <none>
fouram-45-7069-pulsar-zookeeper-1                                 1/1     Running       0               4m57s   10.104.6.57     4am-node13   <none>           <none>
fouram-45-7069-pulsar-zookeeper-2                                 1/1     Running       0               4m20s   10.104.5.94     4am-node12   <none>           <none>

client log:

[2023-06-26 12:02:38,308 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_0dreLHRo): 99900000 (base.py:468)
[2023-06-26 12:02:38,459 -  INFO - fouram]: [Base] Start inserting, ids: 99950000 - 99999999, data size: 100,000,000 (base.py:308)
[2023-06-26 12:02:40,008 -  INFO - fouram]: [Time] Collection.insert run in 1.5493s (api_request.py:45)
[2023-06-26 12:02:40,011 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_0dreLHRo): 99900000 (base.py:468)
[2023-06-26 12:02:40,062 -  INFO - fouram]: [Base] Total time of insert: 3187.9628s, average number of vector bars inserted per second: 31367.9946, average time to insert 50000 vectors per time: 1.594s (base.py:379)
[2023-06-26 12:02:40,062 -  INFO - fouram]: [Base] Start flush collection fouram_0dreLHRo (base.py:277)
[2023-06-26 12:02:43,125 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:441)
[2023-06-26 12:02:43,125 -  INFO - fouram]: [Base] Start release collection fouram_0dreLHRo (base.py:288)
[2023-06-26 12:02:43,127 -  INFO - fouram]: [Base] Start build index of DISKANN for collection fouram_0dreLHRo, params:{'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}} (base.py:427)
[2023-06-26 17:34:27,390 -  INFO - fouram]: [Time] Index run in 19904.2613s (api_request.py:45)
[2023-06-26 17:34:27,391 -  INFO - fouram]: [CommonCases] RT of build index DISKANN: 19904.2613s (common_cases.py:96)
[2023-06-26 17:34:27,416 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:441)
[2023-06-26 17:34:27,416 -  INFO - fouram]: [CommonCases] Prepare index DISKANN done. (common_cases.py:99)
[2023-06-26 17:34:27,416 -  INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:107)
[2023-06-26 17:34:27,418 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_0dreLHRo): 100000000 (base.py:468)
[2023-06-26 17:34:27,418 -  INFO - fouram]: [Base] Start load collection fouram_0dreLHRo,replica_number:1,kwargs:{} (base.py:283)
[2023-06-26 18:51:04,491 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=1, message=failed to load segment: follower 12 failed to load segment, reason load segment failed, disk space is not enough, collectionID = 442440457093644855, usedDiskAfterLoad = 100294 MB, totalDisk = 102400 MB, thresholdFactor = 0.950000)>, <Time:{'RPC start': '2023-06-26 18:51:04.489527', 'RPC error': '2023-06-26 18:51:04.491093'}> (decorators.py:108)
[2023-06-26 18:51:04,493 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=1, message=failed to load segment: follower 12 failed to load segment, reason load segment failed, disk space is not enough, collectionID = 442440457093644855, usedDiskAfterLoad = 100294 MB, totalDisk = 102400 MB, thresholdFactor = 0.950000)>, <Time:{'RPC start': '2023-06-26 17:34:27.474903', 'RPC error': '2023-06-26 18:51:04.493072'}> (decorators.py:108)
[2023-06-26 18:51:04,493 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=1, message=failed to load segment: follower 12 failed to load segment, reason load segment failed, disk space is not enough, collectionID = 442440457093644855, usedDiskAfterLoad = 100294 MB, totalDisk = 102400 MB, thresholdFactor = 0.950000)>, <Time:{'RPC start': '2023-06-26 17:34:27.418905', 'RPC error': '2023-06-26 18:51:04.493252'}> (decorators.py:108)
[2023-06-26 18:51:04,494 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=failed to load segment: follower 12 failed to load segment, reason load segment failed, disk space is not enough, collectionID = 442440457093644855, usedDiskAfterLoad = 100294 MB, totalDisk = 102400 MB, thresholdFactor = 0.950000)> (api_request.py:53)
[2023-06-26 18:51:04,495 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=1, message=failed to load segment: follower 12 failed to load segment, reason load segment failed, disk space is not enough, collectionID = 442440457093644855, usedDiskAfterLoad = 100294 MB, totalDisk = 102400 MB, thresholdFactor = 0.950000)> (func_check.py:52)
FAILED[

client pod : fouramf-concurrent-n5lrq-1120963268

Expected Behavior

load success.

Steps To Reproduce

1. create a collection or use an existing collection
        2. build index on vector column  => diskann
        3. insert a certain number of vectors   => 100m
        4. flush collection
        5. build index on vector column with the same parameters
        6. build index on on scalars column or not
        7. count the total number of rows
        8. load collection  ==> failed
       # 9. perform concurrent operations
       # 10. clean all collections or not

Milvus Log

No response

Anything else?

No response

@elstic elstic added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Jun 27, 2023
@elstic elstic added this to the 2.2.11 milestone Jun 27, 2023
@yanliang567
Copy link
Contributor

/assign @xige-16
/unassign

@sre-ci-robot sre-ci-robot assigned xige-16 and unassigned yanliang567 Jun 27, 2023
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 27, 2023
@xiaofan-luan
Copy link
Collaborator

disk space is not enough

As the error said: disk space is not enough

@xiaofan-luan
Copy link
Collaborator

@elstic
probably need to check the disk space?

@elstic
Copy link
Contributor Author

elstic commented Jun 27, 2023

@elstic probably need to check the disk space?

@xiaofan-luan

We have parameters that limit the disk in this way, but previously the case was passable and we did not change the case parameters or other configuration information.

 --set queryNode.resources.limits.cpu=8,queryNode.resources.limits.memory=32Gi,queryNode.resources.limits.ephemeral-storage=100Gi

So I assume that the current image needs to use more disk space

@xiaofan-luan
Copy link
Collaborator

@xige-16
pls take a glance at it

@elstic elstic changed the title [Bug]: [benchmark]diskann index load failed after inserting 100 million data [Bug]: [benchmark]diskann index failed to load after inserting 100 million data with excessive disk usage Jun 29, 2023
@elstic
Copy link
Contributor Author

elstic commented Jun 30, 2023

Comparison between the most recent version and the previous version, after inserting 100m data, diskann index load when occupying disk comparison:

image: 2.2.0-20230627-936ebf32 , Peak disk usage is about 104G, after stabilization about 56.3G
image

image: v2.2.7 , The v2.2.7 version has a peak disk usage of around 65G. Stabilised at around 56G
image

In the same case, the peak disk usage in v2.2.7 is around 65G, while the current version occasionally exceeds 100G.Peak disk usage is approximately 40G more

@xige-16

@yanliang567 yanliang567 modified the milestones: 2.2.11, 2.2.12 Jul 3, 2023
@xige-16
Copy link
Contributor

xige-16 commented Jul 3, 2023

The load process of milvus has remained unchanged, you can test whether it is caused by the upgrade of knowhere

@xige-16
Copy link
Contributor

xige-16 commented Jul 3, 2023

Comparison between the most recent version and the previous version, after inserting 100m data, diskann index load when occupying disk comparison:

image: 2.2.0-20230627-936ebf32 , Peak disk usage is about 104G, after stabilization about 56.3G image

image: v2.2.7 , The v2.2.7 version has a peak disk usage of around 65G. Stabilised at around 56G image

In the same case, the peak disk usage in v2.2.7 is around 65G, while the current version occasionally exceeds 100G.Peak disk usage is approximately 40G more

@xige-16

This picture shows that the disk space of minio is not enough, the minio pod of the new version uses more disk space than the old version

@LoveEachDay
Copy link
Contributor

Querynodes are evicted due to the disk usage exceeding 100GB, which was set by queryNode.resources.limits.ephemeral-storage=100Gi.

@xiaofan-luan
Copy link
Collaborator

might be related to compaction issue?

@xige-16
Copy link
Contributor

xige-16 commented Jul 4, 2023

might be related to compaction issue?

There are two phenomena in this issue. The disk usage during the minio and querynode load processes has increased, but the disk usage in the final state has not changed, indicating that the size of the index has not changed. The high probability is that the old segments have not been cleaned up in time after compaction Cause, I will check the log to confirm

@elstic
Copy link
Contributor Author

elstic commented Jul 19, 2023

This issue still exists .

Disk usage of querynode until after stable search:
image

server :
(The querynode was evicted several times before it could be searched properly.)

fouramf-p559l-33-1518-etcd-0                                      1/1     Running                  0                 14h     10.104.4.203    4am-node11   <none>           <none>
fouramf-p559l-33-1518-etcd-1                                      1/1     Running                  0                 14h     10.104.13.165   4am-node16   <none>           <none>
fouramf-p559l-33-1518-etcd-2                                      1/1     Running                  0                 14h     10.104.9.234    4am-node14   <none>           <none>
fouramf-p559l-33-1518-milvus-datacoord-6b95b4f45f-bfcsg           1/1     Running                  0                 14h     10.104.17.168   4am-node23   <none>           <none>
fouramf-p559l-33-1518-milvus-datanode-64fdd568d4-2x2km            1/1     Running                  0                 14h     10.104.9.228    4am-node14   <none>           <none>
fouramf-p559l-33-1518-milvus-indexcoord-66df6d745-b9nbk           1/1     Running                  0                 14h     10.104.13.162   4am-node16   <none>           <none>
fouramf-p559l-33-1518-milvus-indexnode-65755cc48-l8tks            1/1     Running                  0                 14h     10.104.12.233   4am-node17   <none>           <none>
fouramf-p559l-33-1518-milvus-proxy-5f45b67f5c-j5znr               1/1     Running                  0                 14h     10.104.9.227    4am-node14   <none>           <none>
fouramf-p559l-33-1518-milvus-querycoord-b495678cd-dfghm           1/1     Running                  0                 14h     10.104.12.232   4am-node17   <none>           <none>
fouramf-p559l-33-1518-milvus-querynode-7cfc45cc74-gtbgj           0/1     Error                    0                 7h4m    10.104.17.89    4am-node23   <none>           <none>
fouramf-p559l-33-1518-milvus-querynode-7cfc45cc74-jcp5c           0/1     Error                    0                 6h57m   10.104.15.74    4am-node20   <none>           <none>
fouramf-p559l-33-1518-milvus-querynode-7cfc45cc74-l9c8g           0/1     Error                    0                 7h10m   10.104.17.88    4am-node23   <none>           <none>
fouramf-p559l-33-1518-milvus-querynode-7cfc45cc74-pfczx           0/1     ContainerStatusUnknown   1                 7h17m   10.104.17.86    4am-node23   <none>           <none>
fouramf-p559l-33-1518-milvus-querynode-7cfc45cc74-qbkcj           0/1     Error                    0                 6h40m   10.104.15.105   4am-node20   <none>           <none>
fouramf-p559l-33-1518-milvus-querynode-7cfc45cc74-r8kl2           0/1     Error                    0                 6h46m   10.104.15.83    4am-node20   <none>           <none>
fouramf-p559l-33-1518-milvus-querynode-7cfc45cc74-wpxpw           1/1     Running                  0                 6h34m   10.104.15.108   4am-node20   <none>           <none>
fouramf-p559l-33-1518-milvus-querynode-7cfc45cc74-z2sd8           0/1     Error                    0                 14h     10.104.17.169   4am-node23   <none>           <none>
fouramf-p559l-33-1518-milvus-querynode-7cfc45cc74-zbnlz           0/1     Error                    0                 6h52m   10.104.15.78    4am-node20   <none>           <none>
fouramf-p559l-33-1518-milvus-rootcoord-7cf6c89488-pqdg2           1/1     Running                  0                 14h     10.104.12.230   4am-node17   <none>           <none>
fouramf-p559l-33-1518-minio-0                                     1/1     Running                  0                 14h     10.104.12.237   4am-node17   <none>           <none>
fouramf-p559l-33-1518-minio-1                                     1/1     Running                  0                 14h     10.104.21.40    4am-node24   <none>           <none>
fouramf-p559l-33-1518-minio-2                                     1/1     Running                  0                 14h     10.104.4.204    4am-node11   <none>           <none>
fouramf-p559l-33-1518-minio-3                                     1/1     Running                  0                 14h     10.104.9.232    4am-node14   <none>           <none>
fouramf-p559l-33-1518-pulsar-bookie-0                             1/1     Running                  0                 14h     10.104.12.238   4am-node17   <none>           <none>
fouramf-p559l-33-1518-pulsar-bookie-1                             1/1     Running                  0                 14h     10.104.21.43    4am-node24   <none>           <none>
fouramf-p559l-33-1518-pulsar-bookie-2                             1/1     Running                  0                 14h     10.104.4.207    4am-node11   <none>           <none>
fouramf-p559l-33-1518-pulsar-bookie-init-7qz9r                    0/1     Completed                0                 14h     10.104.21.38    4am-node24   <none>           <none>
fouramf-p559l-33-1518-pulsar-broker-0                             1/1     Running                  0                 14h     10.104.13.163   4am-node16   <none>           <none>
fouramf-p559l-33-1518-pulsar-proxy-0                              1/1     Running                  0                 14h     10.104.4.200    4am-node11   <none>           <none>
fouramf-p559l-33-1518-pulsar-pulsar-init-qnm5x                    0/1     Completed                0                 14h     10.104.21.37    4am-node24   <none>           <none>
fouramf-p559l-33-1518-pulsar-recovery-0                           1/1     Running                  0                 14h     10.104.12.231   4am-node17   <none>           <none>
fouramf-p559l-33-1518-pulsar-zookeeper-0                          1/1     Running                  0                 14h     10.104.9.230    4am-node14   <none>           <none>
fouramf-p559l-33-1518-pulsar-zookeeper-1                          1/1     Running                  0                 14h     10.104.4.209    4am-node11   <none>           <none>
fouramf-p559l-33-1518-pulsar-zookeeper-2                          1/1     Running                  0                 14h     10.104.23.193   4am-node27   <none>           <none>

minio disk monitoring:
image

@elstic
Copy link
Contributor Author

elstic commented Jul 26, 2023

The problem of querynode disk usage exceeding 100g is also found on the master branch, the same case on the master branch, the querynode disk usage is 108g.

@yanliang567 yanliang567 modified the milestones: 2.2.12, 2.2.13 Aug 4, 2023
@xige-16
Copy link
Contributor

xige-16 commented Aug 7, 2023

@elstic Please check if this PR has any effect #25899

@elstic
Copy link
Contributor Author

elstic commented Aug 8, 2023

@elstic Please check if this PR has any effect #25899

This issue still exists.
Validated image: 2.2.0-20230807-ef31fe23

ce381945-7109-448e-8220-7ee39c42db65

@smellthemoon
Copy link
Contributor

@elstic Please check if this PR has any effect #25899

#25896 may be this one?

@elstic
Copy link
Contributor Author

elstic commented Aug 16, 2023

Recent image: '2.2.0-20230814-27fe2a45' inserted 100 million data load successfully

@stale
Copy link

stale bot commented Sep 19, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Sep 19, 2023
@elstic
Copy link
Contributor Author

elstic commented Sep 20, 2023

this problem has not occurred recently

@elstic elstic closed this as completed Sep 20, 2023
@elstic
Copy link
Contributor Author

elstic commented Oct 26, 2023

diskann index inserts 100 million data, querynode disk usage peaks at over 100g
case: test_concurrent_locust_100m_diskann_ddl_dql_filter_cluster

image: master-20231023-0c33ddb7
querynode disk setup:(100G)

{'queryNode': {'resources': {'limits': {'cpu': '8', 'memory': '32Gi', 'ephemeral-storage': '100Gi'}

server:

fouramf-hqsbb-36-5136-etcd-0                                      1/1     Running       0               4m21s   10.104.18.122   4am-node25   <none>           <none>
fouramf-hqsbb-36-5136-etcd-1                                      1/1     Running       0               4m21s   10.104.23.163   4am-node27   <none>           <none>
fouramf-hqsbb-36-5136-etcd-2                                      1/1     Running       0               4m21s   10.104.16.158   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-milvus-datacoord-54c74794c9-6b5xv           1/1     Running       0               4m21s   10.104.16.149   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-milvus-datanode-5f46b6dd95-jlqbj            1/1     Running       0               4m21s   10.104.16.151   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-milvus-indexcoord-55b6fd5f88-9wj5q          1/1     Running       0               4m21s   10.104.16.150   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-milvus-indexnode-5ddd6dd445-pdwtt           1/1     Running       0               4m21s   10.104.24.67    4am-node29   <none>           <none>
fouramf-hqsbb-36-5136-milvus-proxy-5d456bd744-v8p6f               1/1     Running       0               4m21s   10.104.16.152   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-milvus-querycoord-75d48fc58f-grvtt          1/1     Running       0               4m21s   10.104.16.146   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-milvus-querynode-5ddc775ff-wgtwc            1/1     Running       0               4m21s   10.104.19.139   4am-node28   <none>           <none>
fouramf-hqsbb-36-5136-milvus-rootcoord-846bd64d7-4dnvt            1/1     Running       0               4m21s   10.104.16.148   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-minio-0                                     1/1     Running       0               4m21s   10.104.16.155   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-minio-1                                     1/1     Running       0               4m21s   10.104.18.119   4am-node25   <none>           <none>
fouramf-hqsbb-36-5136-minio-2                                     1/1     Running       0               4m21s   10.104.23.159   4am-node27   <none>           <none>
fouramf-hqsbb-36-5136-minio-3                                     1/1     Running       0               4m21s   10.104.15.212   4am-node20   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-bookie-0                             1/1     Running       0               4m21s   10.104.18.123   4am-node25   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-bookie-1                             1/1     Running       0               4m21s   10.104.23.169   4am-node27   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-bookie-2                             1/1     Running       0               4m20s   10.104.16.160   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-bookie-init-hw76w                    0/1     Completed     0               4m21s   10.104.19.137   4am-node28   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-broker-0                             1/1     Running       0               4m21s   10.104.23.156   4am-node27   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-proxy-0                              1/1     Running       0               4m21s   10.104.19.138   4am-node28   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-pulsar-init-74c56                    0/1     Completed     0               4m21s   10.104.19.136   4am-node28   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-recovery-0                           1/1     Running       0               4m21s   10.104.16.147   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-zookeeper-0                          1/1     Running       0               4m21s   10.104.18.120   4am-node25   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-zookeeper-1                          1/1     Running       0               3m41s   10.104.20.110   4am-node22   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-zookeeper-2                          1/1     Running       0               2m58s   10.104.15.225   4am-node20   <none>           <none> (base.py:257)
[2023-10-24 05:29:03,188 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'STATUS|fouramf-hqsbb-36-5136-milvus|fouramf-hqsbb-36-5136-minio|fouramf-hqsbb-36-5136-etcd|fouramf-hqsbb-36-5136-pulsar|fouramf-hqsbb-36-5136-kafka'  (util_cmd.py:14)
[2023-10-24 05:29:12,264 -  INFO - fouram]: [CliClient] pod details of release(fouramf-hqsbb-36-5136): 
I1024 05:29:04.435429     482 request.go:665] Waited for 1.159915509s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/certificates.k8s.io/v1?timeout=32s
NAME                                                              READY   STATUS                   RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-hqsbb-36-5136-etcd-0                                      1/1     Running                  0                18h     10.104.18.122   4am-node25   <none>           <none>
fouramf-hqsbb-36-5136-etcd-1                                      1/1     Running                  0                18h     10.104.23.163   4am-node27   <none>           <none>
fouramf-hqsbb-36-5136-etcd-2                                      1/1     Running                  0                18h     10.104.16.158   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-milvus-datacoord-54c74794c9-6b5xv           1/1     Running                  0                18h     10.104.16.149   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-milvus-datanode-5f46b6dd95-jlqbj            1/1     Running                  0                18h     10.104.16.151   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-milvus-indexcoord-55b6fd5f88-9wj5q          1/1     Running                  0                18h     10.104.16.150   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-milvus-indexnode-5ddd6dd445-pdwtt           1/1     Running                  0                18h     10.104.24.67    4am-node29   <none>           <none>
fouramf-hqsbb-36-5136-milvus-proxy-5d456bd744-v8p6f               1/1     Running                  0                18h     10.104.16.152   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-milvus-querycoord-75d48fc58f-grvtt          1/1     Running                  0                18h     10.104.16.146   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-milvus-querynode-5ddc775ff-5f6nk            0/1     Error                    0                11h     10.104.20.52    4am-node22   <none>           <none>
fouramf-hqsbb-36-5136-milvus-querynode-5ddc775ff-5spk4            0/1     ContainerStatusUnknown   1                11h     10.104.21.161   4am-node24   <none>           <none>
fouramf-hqsbb-36-5136-milvus-querynode-5ddc775ff-fmcps            0/1     Error                    0                12h     10.104.19.207   4am-node28   <none>           <none>
fouramf-hqsbb-36-5136-milvus-querynode-5ddc775ff-hrbs7            1/1     Running                  0                11h     10.104.18.71    4am-node25   <none>           <none>
fouramf-hqsbb-36-5136-milvus-querynode-5ddc775ff-mlbkf            0/1     ContainerStatusUnknown   1                12h     10.104.19.208   4am-node28   <none>           <none>
fouramf-hqsbb-36-5136-milvus-querynode-5ddc775ff-wgtwc            0/1     Error                    0                18h     10.104.19.139   4am-node28   <none>           <none>
fouramf-hqsbb-36-5136-milvus-rootcoord-846bd64d7-4dnvt            1/1     Running                  0                18h     10.104.16.148   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-minio-0                                     1/1     Running                  0                18h     10.104.16.155   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-minio-1                                     1/1     Running                  0                18h     10.104.18.119   4am-node25   <none>           <none>
fouramf-hqsbb-36-5136-minio-2                                     1/1     Running                  0                18h     10.104.23.159   4am-node27   <none>           <none>
fouramf-hqsbb-36-5136-minio-3                                     1/1     Running                  0                18h     10.104.15.212   4am-node20   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-bookie-0                             1/1     Running                  0                18h     10.104.18.123   4am-node25   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-bookie-1                             1/1     Running                  0                18h     10.104.23.169   4am-node27   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-bookie-2                             1/1     Running                  0                18h     10.104.16.160   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-bookie-init-hw76w                    0/1     Completed                0                18h     10.104.19.137   4am-node28   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-broker-0                             1/1     Running                  0                18h     10.104.23.156   4am-node27   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-proxy-0                              1/1     Running                  0                18h     10.104.19.138   4am-node28   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-pulsar-init-74c56                    0/1     Completed                0                18h     10.104.19.136   4am-node28   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-recovery-0                           1/1     Running                  0                18h     10.104.16.147   4am-node21   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-zookeeper-0                          1/1     Running                  0                18h     10.104.18.120   4am-node25   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-zookeeper-1                          1/1     Running                  0                18h     10.104.20.110   4am-node22   <none>           <none>
fouramf-hqsbb-36-5136-pulsar-zookeeper-2                          1/1     Running                  0                18h     10.104.15.225   4am-node20   <none>           <none> (cli_client.py:132)

The querynode was evicted for using more than 100 gigabytes of disk.

image
image

Validated Peak Disk Usage Less Than 140 Gigabytes

@elstic elstic changed the title [Bug]: [benchmark]diskann index failed to load after inserting 100 million data with excessive disk usage [Bug]: [benchmark] diskann index inserts 100 million data, querynode disk usage peaks at over 100G Oct 26, 2023
@elstic elstic modified the milestones: 2.2.13, 2.3.2 Oct 26, 2023
@elstic elstic reopened this Oct 26, 2023
@elstic elstic removed the stale indicates no udpates for 30 days label Oct 26, 2023
@yanliang567 yanliang567 modified the milestones: 2.3.2, 2.3.3 Nov 7, 2023
@yanliang567 yanliang567 modified the milestones: 2.3.3, 2.3.4 Nov 16, 2023
Copy link

stale bot commented Dec 16, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Dec 16, 2023
@xiaofan-luan
Copy link
Collaborator

@elstic
is this still a problem?

@stale stale bot removed the stale indicates no udpates for 30 days label Dec 17, 2023
@elstic
Copy link
Contributor Author

elstic commented Dec 18, 2023

@elstic is this still a problem?

This issue has not arisen recently and I will close it.

@elstic elstic closed this as completed Dec 18, 2023
@sre-ci-robot
Copy link
Contributor

@nikcoderr: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@xiaofan-luan
Copy link
Collaborator

Hi Actually I am using Milvus Standalone single node, how can I set the memory usage for query as I am facing issue for indexing 100M vector data. I am using DISKANN index type and using the default configuration in milvus.yaml file

... DiskIndex: MaxDegree: 56 SearchListSize: 100 PQCodeBugetGBRatio: 0.125 SearchCacheBudgetGBRatio: 0.125 BeamWidthRatio: 4.0 ...

I have used docker-compose to install milvus 2.4.x

I indexed the data but after indexing using describe_index function I am seeing {'index_type': 'IVF_SQ8', 'metric_type': 'L2', 'params': {'nlist': 1000}, 'field_name': 'emb', 'index_name': 'vector_index', 'total_rows': 100000000, 'indexed_rows': 100000000, 'pending_index_rows': 0, 'state': 'Finished'}

Also milvus was down ( assuming as connection was destroyed, i had to down it from docker and then docker-compose up.

Please help me resolve this issue. Thanks

  1. when you create index, you have to specify you are using diskANN index, please check your code of you create index, I guess you are using IVFSQ8 index for now. The config on server is a default config for diskann, it only used if you create diskANN index without specifying default index params.

  2. I don't thinks it's reasonable to run 100m(assuming it's 768 dim) vector search on single node. especially for diskANN. This need a node with more than 32core 128 GB memory and index build and failure recovery will be slow as well. IVFSQ8 is better at index build speed. Also you need high performance nvme SSD to run diskANN

  3. I would recommend you check https://zilliz.com/pricing For our Serverless tier and Capacity optimized instance. But if you can't use fully managed service. I'd like to take a call and help you to setup if necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

8 participants
@LoveEachDay @xige-16 @elstic @sre-ci-robot @smellthemoon @yanliang567 @xiaofan-luan and others