Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Search failed: no available shard delegator found: service unavailable after scaling-out queryNode #27913

Closed
1 task done
ThreadDao opened this issue Oct 25, 2023 · 9 comments
Assignees
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20231023-8c605ca8
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  pulsar  
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.1.post1.dev18
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Test steps:

  1. deploy milvus cluster with 1 queryNode
  2. create a collection named fouram_X7VYT1q9
  3. build index {'index_type': 'IVF_SQ8', 'metric_type': 'L2', 'params': {'nlist': 1024}}
  4. insert 1million 128dim entities with batch 5w. After completing the insertion, call flush and index
  5. load collection
  6. do concurrent operation for 1 hours: concurrent search and scene_insert_delete_flush (insert -> delete -> flush)
  7. scaling-out queryNode from 1 to 2, and wait pods running
  8. Do concurrent search + scene_insert_delete_flush for 1 hours as before scaling, but some search failed:
[2023-10-24 13:08:56,908 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=65538, message=failed to search: attempt #0: failed to search/query delegator 3 for channel by-dev-rootcoord-dml_1_445159425357120190v1: fail to Search, QueryNode ID=3, reason=worker(9) query failed: context canceled: attempt #1: no available shard delegator found: service unavailable)>, <Time:{'RPC start': '2023-10-24 13:08:51.867804', 'RPC error': '2023-10-24 13:08:56.908761'}> (decorators.py:128)

Expected Behavior

search succ

Steps To Reproduce

4am argo link: https://argo-workflows.zilliz.cc/archived-workflows/qa/c7bf2d43-9671-4318-aaff-95acc59e0747?nodeId=fouramf-sshf8

grafana link: https://grafana-4am.zilliz.cc/d/uLf5cJ3Ga/milvus2-0?orgId=1&var-datasource=prometheus&var-cluster=&var-namespace=qa-milvus&var-instance=fouramf-sshf8-7-4493&var-collection=All&var-app_name=milvus&from=1698148318201&to=1698157292893

Milvus Log

pods after scaling:

fouramf-sshf8-7-4493-etcd-0                                       1/1     Running     0               61m     10.104.19.26    4am-node28   <none>           <none>
fouramf-sshf8-7-4493-etcd-1                                       1/1     Running     0               62m     10.104.24.194   4am-node29   <none>           <none>
fouramf-sshf8-7-4493-etcd-2                                       1/1     Running     0               64m     10.104.4.149    4am-node11   <none>           <none>
fouramf-sshf8-7-4493-milvus-datacoord-769b5f89b8-5vcjs            1/1     Running     0               132m    10.104.16.7     4am-node21   <none>           <none>
fouramf-sshf8-7-4493-milvus-datanode-7545dbdd7d-c6s6n             1/1     Running     1 (128m ago)    132m    10.104.21.246   4am-node24   <none>           <none>
fouramf-sshf8-7-4493-milvus-indexcoord-9cb5cc9c5-6bcrs            1/1     Running     0               132m    10.104.18.205   4am-node25   <none>           <none>
fouramf-sshf8-7-4493-milvus-indexnode-5979985b47-dz7x8            1/1     Running     0               132m    10.104.16.8     4am-node21   <none>           <none>
fouramf-sshf8-7-4493-milvus-proxy-5f877c8b8c-kg8qc                1/1     Running     0               132m    10.104.16.9     4am-node21   <none>           <none>
fouramf-sshf8-7-4493-milvus-querycoord-644db98648-7h5kc           1/1     Running     0               132m    10.104.20.100   4am-node22   <none>           <none>
fouramf-sshf8-7-4493-milvus-querynode-764649997c-76rvz            1/1     Running     0               64m     10.104.16.40    4am-node21   <none>           <none>
fouramf-sshf8-7-4493-milvus-querynode-764649997c-9ll27            1/1     Running     0               132m    10.104.23.47    4am-node27   <none>           <none>
fouramf-sshf8-7-4493-milvus-rootcoord-7b5c5ffbd-2vkvt             1/1     Running     0               132m    10.104.20.101   4am-node22   <none>           <none>
fouramf-sshf8-7-4493-minio-0                                      1/1     Running     0               132m    10.104.14.205   4am-node18   <none>           <none>
fouramf-sshf8-7-4493-minio-1                                      1/1     Running     0               132m    10.104.19.194   4am-node28   <none>           <none>
fouramf-sshf8-7-4493-minio-2                                      1/1     Running     0               132m    10.104.12.93    4am-node17   <none>           <none>
fouramf-sshf8-7-4493-minio-3                                      1/1     Running     0               132m    10.104.24.98    4am-node29   <none>           <none>
fouramf-sshf8-7-4493-pulsar-bookie-0                              1/1     Running     0               132m    10.104.20.104   4am-node22   <none>           <none>
fouramf-sshf8-7-4493-pulsar-bookie-1                              1/1     Running     0               132m    10.104.19.199   4am-node28   <none>           <none>
fouramf-sshf8-7-4493-pulsar-bookie-2                              1/1     Running     0               132m    10.104.17.102   4am-node23   <none>           <none>
fouramf-sshf8-7-4493-pulsar-broker-0                              1/1     Running     0               132m    10.104.18.204   4am-node25   <none>           <none>
fouramf-sshf8-7-4493-pulsar-proxy-0                               1/1     Running     0               132m    10.104.15.55    4am-node20   <none>           <none>
fouramf-sshf8-7-4493-pulsar-recovery-0                            1/1     Running     0               132m    10.104.17.99    4am-node23   <none>           <none>
fouramf-sshf8-7-4493-pulsar-zookeeper-0                           1/1     Running     0               132m    10.104.14.206   4am-node18   <none>           <none>
fouramf-sshf8-7-4493-pulsar-zookeeper-1                           1/1     Running     0               132m    10.104.24.104   4am-node29   <none>           <none>
fouramf-sshf8-7-4493-pulsar-zookeeper-2                           1/1     Running     0               131m    10.104.20.116   4am-node22   <none> 

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 25, 2023
@ThreadDao ThreadDao added this to the 2.3.2 milestone Oct 25, 2023
@ThreadDao ThreadDao added the severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. label Oct 25, 2023
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 25, 2023
@yanliang567
Copy link
Contributor

/assign @weiliu1031
/unassign

@ThreadDao
Copy link
Contributor Author

Resubmit the issue with new image: master-20231025-53246b1b
https://argo-workflows.zilliz.cc/archived-workflows/qa/a3dd4ad9-18c6-4371-9d1c-a1dd903a3449
In addition to the previous errors:

[2023-10-25 08:06:11,874 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=65538, message=failed to search: attempt #0: failed to search/query delegator 12 for channel by-dev-rootcoord-dml_1_445177182117692044v1: fail to Search, QueryNode ID=12, reason=worker(17) query failed: context canceled: attempt #1: no available shard delegator found: service unavailable)>, <Time:{'RPC start': '2023-10-25 08:06:06.036287', 'RPC error': '2023-10-25 08:06:11.874681'}> (decorators.py:128)

new search error also apperated before and after scaling

[2023-10-25 06:56:01,351 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=65538, message=failed to search: attempt #0: failed to search/query delegator 3 for channel by-dev-rootcoord-dml_0_445177182117692044v0: fail to Search, QueryNode ID=3, reason=worker(3) query failed: segment not found: segment=445177182119305278: segment not loaded: attempt #1: no available shard delegator found: service unavailable)>, <Time:{'RPC start': '2023-10-25 06:56:00.709731', 'RPC error': '2023-10-25 06:56:01.351202'}> (decorators.py:128)

@ThreadDao
Copy link
Contributor Author

@ThreadDao ThreadDao modified the milestones: 2.3.2, 2.3.3 Oct 31, 2023
@yanliang567 yanliang567 modified the milestones: 2.3.3, 2.3.4 Nov 16, 2023
@ThreadDao
Copy link
Contributor Author

  • image: master-20231121-d73dac52
    After scale-out 1 queryNode and upgrade image master-20231121-2fc74399, all search requests failed.
[2023-11-21 18:14:37,369 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=503, message=failed to search: channel lacks[channel=channel not subscribed]: channel not available[channel=by-dev-rootcoord-dml_0_445798469189239003v0])> (func_check.py:46)

[2023-11-21 18:14:37,552 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=65535, message=empty grpc client: failed to connect 10.104.4.20:21123, reason: context deadline exceeded)>, <Time:{'RPC start': '2023-11-21 18:13:10.191689', 'RPC error': '2023-11-21 18:14:37.552861'}> (decorators.py:128)

Copy link

stale bot commented Dec 22, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Dec 22, 2023
@weiliu1031
Copy link
Contributor

please verify this with latest image

@stale stale bot removed the stale indicates no udpates for 30 days label Dec 25, 2023
@weiliu1031
Copy link
Contributor

/assign @ThreadDao

@deven298
Copy link

up

@yanliang567 yanliang567 modified the milestones: 2.3.5, 2.3.6 Jan 22, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.6, 2.3.7 Jan 30, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.7, 2.3.9, 2.3.10 Feb 18, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.10, 2.3.11 Feb 28, 2024
@ThreadDao
Copy link
Contributor Author

Didn't show up again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants