Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [Nightly] Load collection occasionally failed for timeout #37272

Closed
1 task done
NicoYuan1986 opened this issue Oct 29, 2024 · 33 comments
Closed
1 task done

[Bug]: [Nightly] Load collection occasionally failed for timeout #37272

NicoYuan1986 opened this issue Oct 29, 2024 · 33 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@NicoYuan1986
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master(1e75a42)
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Load collection occasionally failed for timeout.

[pytest : test] [2024-10-28 20:33:03 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 180], kwargs: {} (api_request.py:62)
[pytest : test] [2024-10-28 20:33:08 - ERROR - pymilvus.decorators]: RPC error: [load_collection], <MilvusException: (code=65535, message=call query coordinator LoadCollection: stack trace: /workspace/source/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
[pytest : test] /workspace/source/internal/util/grpcclient/client.go:555 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
[pytest : test] /workspace/source/internal/util/grpcclient/client.go:569 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
[pytest : test] /workspace/source/internal/distributed/rootcoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.wrapGrpcCall[...]
[pytest : test] /workspace/source/internal/distributed/rootcoord/client/client.go:275 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.(*Client).ShowPartitions
[pytest : test] /workspace/source/internal/querycoordv2/meta/coordinator_broker.go:185 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetPartitions
[pytest : test] /workspace/source/internal/querycoordv2/job/job_load.go:137 github.com/milvus-io/milvus/internal/querycoordv2/job.(*LoadCollectionJob).Execute
[pytest : test] /workspace/source/internal/querycoordv2/job/scheduler.go:159 github.com/milvus-io/milvus/internal/querycoordv2/job.(*Scheduler).process
[pytest : test] /workspace/source/internal/querycoordv2/job/scheduler.go:135 github.com/milvus-io/milvus/internal/querycoordv2/job.(*Scheduler).processQueue
[pytest : test] /go/pkg/mod/golang.org/[email protected]/src/runtime/asm_amd64.s:1695 runtime.goexit: rpc error: code = DeadlineExceeded desc = context deadline exceeded)>, <Time:{'RPC start': '2024-10-28 20:33:03.007136', 'RPC error': '2024-10-28 20:33:08.013727'}> (decorators.py:140)
[pytest : test] [2024-10-28 20:33:08 - ERROR - ci_test]: Traceback (most recent call last):
[pytest : test]   File "/milvus/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
[pytest : test]     res = func(*args, **_kwargs)
[pytest : test]   File "/milvus/tests/python_client/utils/api_request.py", line 63, in api_request
[pytest : test]     return func(*arg, **kwargs)
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 429, in load
[pytest : test]     conn.load_collection(
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 141, in handler
[pytest : test]     raise e from e
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 137, in handler
[pytest : test]     return func(*args, **kwargs)
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 176, in handler
[pytest : test]     return func(self, *args, **kwargs)
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 116, in handler
[pytest : test]     raise e from e
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 86, in handler
[pytest : test]     return func(*args, **kwargs)
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 1171, in load_collection
[pytest : test]     check_status(response)
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/utils.py", line 63, in check_status
[pytest : test]     raise MilvusException(status.code, status.reason, status.error_code)
[pytest : test] pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=call query coordinator LoadCollection: stack trace: /workspace/source/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
[pytest : test] /workspace/source/internal/util/grpcclient/client.go:555 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
[pytest : test] /workspace/source/internal/util/grpcclient/client.go:569 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
[pytest : test] /workspace/source/internal/distributed/rootcoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.wrapGrpcCall[...]
[pytest : test] /workspace/source/internal/distributed/rootcoord/client/client.go:275 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.(*Client).ShowPartitions
[pytest : test] /workspace/source/internal/querycoordv2/meta/coordinator_broker.go:185 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetPartitions
[pytest : test] /workspace/source/internal/querycoordv2/job/job_load.go:137 github.com/milvus-io/milvus/internal/querycoordv2/job.(*LoadCollectionJob).Execute
[pytest : test] /workspace/source/internal/querycoordv2/job/scheduler.go:159 github.com/milvus-io/milvus/internal/querycoordv2/job.(*Scheduler).process
[pytest : test] /workspace/source/internal/querycoordv2/job/scheduler.go:135 github.com/milvus-io/milvus/internal/querycoordv2/job.(*Scheduler).processQueue
[pytest : test] /go/pkg/mod/golang.org/[email protected]/src/runtime/asm_amd64.s:1695 runtime.goexit: rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
[pytest : test]  (api_request.py:45)
[pytest : test] [2024-10-28 20:33:08 - ERROR - ci_test]: (api_response) : <MilvusException: (code=65535, message=call query coordinator LoadCollection: stack trace: /workspace/source/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
[pytest : test] /workspace/source/internal/util/grpcclient/client.go:555 github.com/milvus-io/milvus/internal/util/grpcclient.(*...... (api_request.py:46)

Expected Behavior

pass

Steps To Reproduce

No response

Milvus Log

  1. link: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI(new)/detail/master/162/pipeline/114
  2. log: artifacts-milvus-distributed-kafka-mdk-master-162-py-n-162-e2e-logs.tar.gz
  3. failed time: [pytest : test] [gw0] [ 68%] FAILED testcases/test_search.py::TestSearchString::test_search_with_different_string_expr[True-varchar == "0"|| varchar == "1"|| varchar =="2"]

Anything else?

No response

@NicoYuan1986 NicoYuan1986 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 29, 2024
@NicoYuan1986 NicoYuan1986 added this to the 2.5.0 milestone Oct 29, 2024
@yanliang567
Copy link
Contributor

/assign @czs007
/unassign

@sre-ci-robot sre-ci-robot assigned czs007 and unassigned yanliang567 Oct 30, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 30, 2024
@xiaofan-luan
Copy link
Collaborator

/assign @bigsheeper

@bigsheeper
Copy link
Contributor

ShowCollection timeout during executing load job:
image
image

@bigsheeper
Copy link
Contributor

bigsheeper commented Nov 2, 2024

@liliu-z
Copy link
Member

liliu-z commented Nov 6, 2024

@bigsheeper Plz assign this back to verify

@bigsheeper
Copy link
Contributor

/assign @NicoYuan1986
/unassign

should be fixed, please help to make verify

@NicoYuan1986
Copy link
Contributor Author

reproduce:
https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI(new)/detail/master/174/pipeline/138/

[pytest : test] [2024-11-06 19:35:59 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 180], kwargs: {} (api_request.py:62)
[pytest : test] [2024-11-06 19:36:04 - ERROR - pymilvus.decorators]: RPC error: [load_collection], <MilvusException: (code=65535, message=call query coordinator LoadCollection: stack trace: /workspace/source/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
[pytest : test] /workspace/source/internal/util/grpcclient/client.go:555 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
[pytest : test] /workspace/source/internal/util/grpcclient/client.go:569 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
[pytest : test] /workspace/source/internal/distributed/rootcoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.wrapGrpcCall[...]
[pytest : test] /workspace/source/internal/distributed/rootcoord/client/client.go:275 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.(*Client).ShowPartitions
[pytest : test] /workspace/source/internal/querycoordv2/meta/coordinator_broker.go:185 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetPartitions
[pytest : test] /workspace/source/internal/querycoordv2/job/job_load.go:137 github.com/milvus-io/milvus/internal/querycoordv2/job.(*LoadCollectionJob).Execute
[pytest : test] /workspace/source/internal/querycoordv2/job/scheduler.go:159 github.com/milvus-io/milvus/internal/querycoordv2/job.(*Scheduler).process
[pytest : test] /workspace/source/internal/querycoordv2/job/scheduler.go:135 github.com/milvus-io/milvus/internal/querycoordv2/job.(*Scheduler).processQueue
[pytest : test] /go/pkg/mod/golang.org/[email protected]/src/runtime/asm_amd64.s:1695 runtime.goexit: rpc error: code = DeadlineExceeded desc = context deadline exceeded)>, <Time:{'RPC start': '2024-11-06 19:35:59.163745', 'RPC error': '2024-11-06 19:36:04.168344'}> (decorators.py:140)
[pytest : test] [2024-11-06 19:36:04 - ERROR - ci_test]: Traceback (most recent call last):
[pytest : test]   File "/milvus/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
[pytest : test]     res = func(*args, **_kwargs)
[pytest : test]   File "/milvus/tests/python_client/utils/api_request.py", line 63, in api_request
[pytest : test]     return func(*arg, **kwargs)
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 429, in load
[pytest : test]     conn.load_collection(
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 141, in handler
[pytest : test]     raise e from e
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 137, in handler
[pytest : test]     return func(*args, **kwargs)
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 176, in handler
[pytest : test]     return func(self, *args, **kwargs)
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 116, in handler
[pytest : test]     raise e from e
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/decorators.py", line 86, in handler
[pytest : test]     return func(*args, **kwargs)
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 1172, in load_collection
[pytest : test]     check_status(response)
[pytest : test]   File "/usr/local/lib/python3.8/site-packages/pymilvus/client/utils.py", line 63, in check_status
[pytest : test]     raise MilvusException(status.code, status.reason, status.error_code)
[pytest : test] pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=call query coordinator LoadCollection: stack trace: /workspace/source/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
[pytest : test] /workspace/source/internal/util/grpcclient/client.go:555 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
[pytest : test] /workspace/source/internal/util/grpcclient/client.go:569 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
[pytest : test] /workspace/source/internal/distributed/rootcoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.wrapGrpcCall[...]
[pytest : test] /workspace/source/internal/distributed/rootcoord/client/client.go:275 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.(*Client).ShowPartitions
[pytest : test] /workspace/source/internal/querycoordv2/meta/coordinator_broker.go:185 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetPartitions
[pytest : test] /workspace/source/internal/querycoordv2/job/job_load.go:137 github.com/milvus-io/milvus/internal/querycoordv2/job.(*LoadCollectionJob).Execute
[pytest : test] /workspace/source/internal/querycoordv2/job/scheduler.go:159 github.com/milvus-io/milvus/internal/querycoordv2/job.(*Scheduler).process
[pytest : test] /workspace/source/internal/querycoordv2/job/scheduler.go:135 github.com/milvus-io/milvus/internal/querycoordv2/job.(*Scheduler).processQueue
[pytest : test] /go/pkg/mod/golang.org/[email protected]/src/runtime/asm_amd64.s:1695 runtime.goexit: rpc error: code = DeadlineExceeded desc = context deadline exceeded)>
[pytest : test]  (api_request.py:45)
[pytest : test] [2024-11-06 19:36:04 - ERROR - ci_test]: (api_response) : <MilvusException: (code=65535, message=call query coordinator LoadCollection: stack trace: /workspace/source/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
[pytest : test] /workspace/source/internal/util/grpcclient/client.go:555 github.com/milvus-io/milvus/internal/util/grpcclient.(*...... (api_request.py:46)

@bigsheeper Can you help take a look?

@bigsheeper
Copy link
Contributor

still due to ddl issue:
image
image

@SimFG The switch for concurrent DDL execution is not enabled:
image

@bigsheeper
Copy link
Contributor

/assign @NicoYuan1986
UseLockScheduler enabled, please help to verify

@yanliang567
Copy link
Contributor

@NicoYuan1986
Copy link
Contributor Author

reproduce: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI(new)/detail/master/182/pipeline/114
[pytest : test] FAILED testcases/test_query.py::TestQueryParams::test_query_with_dup_primary_key[8-3]

@bigsheeper

@bigsheeper
Copy link
Contributor

bigsheeper commented Nov 15, 2024

reproduce: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI(new)/detail/master/182/pipeline/114 [pytest : test] FAILED testcases/test_query.py::TestQueryParams::test_query_with_dup_primary_key[8-3]

@bigsheeper

The nightly CI uses only one database, and when create collection is slow (p99 reaching 13 seconds), it still blocks operations like showPartition, causing timeouts. PR #37352 can not optimize this scenarios.
image
image

Perhaps we should investigate why create collection is so slow. Any comment? @xiaofan-luan @SimFG

@xiaofan-luan
Copy link
Collaborator

@SimFG @bigsheeper
I think there is definitely a regression here because we didn't see any problems here before.

@SimFG could you help on analyze the reason of why create collection become slow?

@SonyK16
Copy link

SonyK16 commented Nov 15, 2024 via email

@xiaofan-luan
Copy link
Collaborator

cool work dude

On Fri, 15 Nov 2024 at 20:44 Xiaofan @.> wrote: @SimFG https://github.com/SimFG @bigsheeper https://github.com/bigsheeper I think there is definitely a regression here because we didn't see any problems here before. @SimFG https://github.com/SimFG could you help on analyze the reason of why create collection become slow? — Reply to this email directly, view it on GitHub <#37272 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZDLW276PB6FRV6I6V65NXL2A2BNHAVCNFSM6AAAAABQZULUH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBQGE2DIOBUHE . You are receiving this because you are subscribed to this thread.Message ID: @.>

We've been working on the collection number and partition number optimization, stay tuned!

@yanliang567
Copy link
Contributor

/assign @bigsheeper
/unassign @NicoYuan1986

@xiaofan-luan
Copy link
Collaborator

func (t *createCollectionTask) GetLockerKey() LockerKey {
return NewLockerKeyChain(
NewClusterLockerKey(false),
NewDatabaseLockerKey(t.Req.GetDbName(), true),
)
}

I there a special reason that create collection need database level write key?
can we change to read key instead?

@SimFG

@xiaofan-luan
Copy link
Collaborator

mean time, 10s might be too long for create collection, need to investigate on it.

@bigsheeper

@czs007 czs007 assigned SimFG and bigsheeper and unassigned bigsheeper Nov 18, 2024
@SimFG
Copy link
Contributor

SimFG commented Nov 18, 2024

We analyzed a create collection process with a trace ID of 6197d3687f64a4753322ddf813083451. The entire request response time was 9.4s, and it was stuck in the process of writing meta. From the monitoring, the meta operation delay was relatively high during this period, about 1s. After checking etcd monitoring, we found that it was caused by the delay in the fsync step. During this period, etcd's memory decreased. We guessed that it was doing compaction-type actions. For reference: https://runbooks.prometheus-operator.dev/runbooks/etcd/etcdhighfsyncdurations/
image
image
image
image

@liliu-z
Copy link
Member

liliu-z commented Nov 18, 2024

Too many collections leads to too much meta data. Sounds not 2.5 blocking issue.
Let's collect these large collection num caused issue together

@xiaofan-luan
Copy link
Collaborator

which environment it is?
just can't imagine a 300MB data can cause etcd 1s to sync all the data.

@xiaofan-luan
Copy link
Collaborator

do we have disk util monitoring?

@xiaofan-luan
Copy link
Collaborator

I know pingcap is doing 50GB etcd and they don't seems to see issues like us. Maybe we should try to optimize some etcd params see if it helps

@SimFG
Copy link
Contributor

SimFG commented Nov 19, 2024

do we have disk util monitoring?

I looked for a series of monitoring, but it seems that there is no information about disk latency, only disk iops, but I didn't find any abnormality in iops.

image

@xiaofan-luan
Copy link
Collaborator

iops is not high. but this is milvus disk, not etcd.

@SimFG
Copy link
Contributor

SimFG commented Nov 19, 2024

@xiaofan-luan I checked another case of drop collection timeout failure and found that the fsync delay was also high. Then I checked the monitoring on the physical machine where etcd is located and found that the io seemed to be relatively high during this period.
#37209 (comment)

@liliu-z
Copy link
Member

liliu-z commented Nov 19, 2024

Checking etcd metrics

@liliu-z
Copy link
Member

liliu-z commented Nov 19, 2024

@SimFG check my comments on #37209, Plz take a look if they are the same

@SimFG
Copy link
Contributor

SimFG commented Nov 19, 2024

@liliu-z I think it's consistent

@xiaofan-luan
Copy link
Collaborator

maybe we should avoid deploy etcd together with minio.
Seems that etcd is affected by minio's traffic

@liliu-z
Copy link
Member

liliu-z commented Nov 20, 2024

/assign @yanliang567
/assign @NicoYuan1986

@liliu-z
Copy link
Member

liliu-z commented Nov 20, 2024

/unassign @SimFG
/unassign @bigsheeper

@NicoYuan1986
Copy link
Contributor Author

not reproduce for these days. close it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

8 participants