Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: During the rolling upgrade process, the datanode experiences a sudden surge in memory usage, leading to OOM kills, and the querynode also continues to crash #37665

Open
1 task done
zhuwenxing opened this issue Nov 14, 2024 · 17 comments
Assignees
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

zhuwenxing commented Nov 14, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:v2.4.14 --> master-20241114-cd181e4c-amd64
- Deployment mode(standalone or cluster):mixcoord
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2024-11-14T06:05:08.223Z] + kubectl get pods -o wide

[2024-11-14T06:05:08.225Z] + grep kafka-mixcoord-5485

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-etcd-0                                       1/1     Running            0                 46m     10.104.17.54    4am-node23   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-etcd-1                                       1/1     Running            0                 46m     10.104.25.192   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-etcd-2                                       1/1     Running            0                 46m     10.104.32.84    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-0                                      2/2     Running            0                 46m     10.104.33.144   4am-node36   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-1                                      2/2     Running            0                 46m     10.104.25.195   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-2                                      2/2     Running            0                 46m     10.104.34.150   4am-node37   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-exporter-598459d88d-s6rnl              1/1     Running            3 (46m ago)       46m     10.104.1.52     4am-node10   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-zookeeper-0                            1/1     Running            0                 46m     10.104.25.194   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-zookeeper-1                            1/1     Running            0                 46m     10.104.16.205   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-zookeeper-2                            1/1     Running            0                 46m     10.104.32.87    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-5b84fdfbf5-flrm4             1/1     Running            1 (7m34s ago)     24m     10.104.16.233   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-5b84fdfbf5-fqflz             1/1     Running            1 (7m38s ago)     23m     10.104.32.94    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-5b84fdfbf5-p9kbm             1/1     Running            1 (7m35s ago)     23m     10.104.25.200   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-cc7cd9cdf-7mjb6              0/1     Terminating        0                 45m     10.104.1.54     4am-node10   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-cc7cd9cdf-f56rp              0/1     Terminating        0                 45m     10.104.9.10     4am-node14   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-cc7cd9cdf-s8n8q              0/1     Terminating        0                 45m     10.104.26.215   4am-node32   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-indexnode-7dddb7ff89-77mfr            1/1     Running            0                 36m     10.104.32.91    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-indexnode-7dddb7ff89-ckwq6            1/1     Running            0                 38m     10.104.16.207   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-indexnode-7dddb7ff89-v9wzt            1/1     Running            0                 37m     10.104.26.243   4am-node32   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-mixcoord-785c66488f-cpdxl             1/1     Running            0                 33m     10.104.16.223   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-proxy-5b4469bfd9-qn2q5                1/1     Running            0                 22m     10.104.16.234   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-querynode-0-6c7758b944-6g6bt          0/1     CrashLoopBackOff   9 (3m48s ago)     45m     10.104.26.219   4am-node32   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-querynode-0-6c7758b944-j6ckw          0/1     CrashLoopBackOff   9 (4m18s ago)     45m     10.104.9.11     4am-node14   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-querynode-0-6c7758b944-kc2nh          0/1     CrashLoopBackOff   10 (3m19s ago)    45m     10.104.1.55     4am-node10   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-0                                      1/1     Running            0                 46m     10.104.33.143   4am-node36   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-1                                      1/1     Running            0                 46m     10.104.25.191   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-2                                      1/1     Running            0                 46m     10.104.32.85    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-3                                      1/1     Running            0                 46m     10.104.34.148   4am-node37   <none>           <none>

image
image

The time point when the memory usage surged drastically coincided with the time when mixcoord started to upgrade.

Logs when the query node crashes

[2024/11/14 06:00:50.517 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=DropCollection] [channel=kafka-mixcoord-5485-rootcoord-dml_8_453918478401953022v0] [collectionID=453918478401953022] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/manager.go:198] ["start merging..."] [role=querynode] [nodeID=48] [vchannel="{\"kafka-mixcoord-5485-rootcoord-dml_2_453918478401952460v2\":{}}"]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:178] ["get signal"] [pchannel=kafka-mixcoord-5485-rootcoord-dml_2] [signal=pause] [isMain=true]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:211] ["stop working"] [pchannel=kafka-mixcoord-5485-rootcoord-dml_2] [isMain=true]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:201] ["handle signal done"] [pchannel=kafka-mixcoord-5485-rootcoord-dml_2] [signal=pause] [isMain=true]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:178] ["get signal"] [pchannel=kafka-mixcoord-5485-rootcoord-dml_2] [signal=pause] [isMain=false]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:211] ["stop working"] [pchannel=kafka-mixcoord-5485-rootcoord-dml_2] [isMain=false]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:201] ["handle signal done"] [pchannel=kafka-mixcoord-5485-rootcoord-dml_2] [signal=pause] [isMain=false]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:150] ["add new target"] [vchannel=kafka-mixcoord-5485-rootcoord-dml_2_453918478401952460v2] [isMain=true]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgdispatcher/dispatcher.go:178] ["get signal"] [pchannel=kafka-mixcoord-5485-rootcoord-dml_2] [signal=terminate] [isMain=false]
[2024/11/14 06:00:50.540 +00:00] [INFO] [msgstream/mq_msgstream.go:218] ["start to close mq msg stream"] ["producer num"=0] ["consumer num"=1]
[2024/11/14 06:00:50.541 +00:00] [INFO] [pipeline/delete_node.go:57] ["pipeline fetch delete msg"] [collectionID=453918478401350763] [partitionID=-1] [deleteRowNum=1] [timestampMin=453919126392406017] [timestampMax=453919126392406017]
[2024/11/14 06:00:50.542 +00:00] [DEBUG] [delegator/delegator_data.go:182] ["start to process delete"] [collectionID=453918478401350763] [channel=kafka-mixcoord-5485-rootcoord-dml_0_453918478401350763v0] [replicaID=453918478559870985] [ts=453919126418620419]
[2024/11/14 06:00:50.542 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=CreateCollection] [channel=kafka-mixcoord-5485-rootcoord-dml_10_453918478401350747v0] [collectionID=453918478401350747] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/11/14 06:00:50.542 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=DropCollection] [channel=kafka-mixcoord-5485-rootcoord-dml_10_453918478401350747v0] [collectionID=453918478401350747] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/11/14 06:00:50.542 +00:00] [DEBUG] [gc/gc_tuner.go:90] ["GC Tune done"] ["previous GOGC"=200] ["heapuse "=43] ["total memory"=850] ["next GC"=88] ["new GOGC"=200] [gc-pause=134.246µs] [gc-pause-end=1731564050542301578]
I20241114 06:00:50.547470   527 VectorMemIndex.cpp:405] [SERVER][Load][milvus] construct binary set...
I20241114 06:00:50.547560   527 VectorMemIndex.cpp:408] [SERVER][Load][milvus] add index data to binary set: HNSW
I20241114 06:00:50.547580   527 VectorMemIndex.cpp:421] [SERVER][Load][milvus] load index into Knowhere...
W20241114 06:00:50.547721   527 hnsw.cc:509] [KNOWHERE][Deserialize][milvus] hnsw inner error: Invalid metric type of float type(float32, float16 and bfloat16):551472220233
I20241114 06:00:50.547762   527 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][milvus] Load index: done (0.056755 ms)
 => failed to Deserialize index: hnsw inner error at /workspace/source/internal/core/src/index/VectorMemIndex.cpp:192

[2024/11/14 06:00:50.547 +00:00] [WARN] [segments/load_index_info.go:210] ["CStatus returns err"] [traceID=74963be0f3628827c3e3a2d3c9e1fd06] [error=" => failed to Deserialize index: hnsw inner error at /workspace/source/internal/core/src/index/VectorMemIndex.cpp:192\n"] [extra="AppendIndex failed"]

SIGNAL CATCH BY NON-GO SIGNAL HANDLER

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5485/pipeline
log:
artifacts-kafka-mixcoord-5485-server-logs.tar.gz

cluster: 4am
ns: chaos-testing
pod info

[2024-11-14T06:05:08.223Z] + kubectl get pods -o wide

[2024-11-14T06:05:08.225Z] + grep kafka-mixcoord-5485

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-etcd-0                                       1/1     Running            0                 46m     10.104.17.54    4am-node23   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-etcd-1                                       1/1     Running            0                 46m     10.104.25.192   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-etcd-2                                       1/1     Running            0                 46m     10.104.32.84    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-0                                      2/2     Running            0                 46m     10.104.33.144   4am-node36   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-1                                      2/2     Running            0                 46m     10.104.25.195   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-2                                      2/2     Running            0                 46m     10.104.34.150   4am-node37   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-exporter-598459d88d-s6rnl              1/1     Running            3 (46m ago)       46m     10.104.1.52     4am-node10   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-zookeeper-0                            1/1     Running            0                 46m     10.104.25.194   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-zookeeper-1                            1/1     Running            0                 46m     10.104.16.205   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-kafka-zookeeper-2                            1/1     Running            0                 46m     10.104.32.87    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-5b84fdfbf5-flrm4             1/1     Running            1 (7m34s ago)     24m     10.104.16.233   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-5b84fdfbf5-fqflz             1/1     Running            1 (7m38s ago)     23m     10.104.32.94    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-5b84fdfbf5-p9kbm             1/1     Running            1 (7m35s ago)     23m     10.104.25.200   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-cc7cd9cdf-7mjb6              0/1     Terminating        0                 45m     10.104.1.54     4am-node10   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-cc7cd9cdf-f56rp              0/1     Terminating        0                 45m     10.104.9.10     4am-node14   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-datanode-cc7cd9cdf-s8n8q              0/1     Terminating        0                 45m     10.104.26.215   4am-node32   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-indexnode-7dddb7ff89-77mfr            1/1     Running            0                 36m     10.104.32.91    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-indexnode-7dddb7ff89-ckwq6            1/1     Running            0                 38m     10.104.16.207   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-indexnode-7dddb7ff89-v9wzt            1/1     Running            0                 37m     10.104.26.243   4am-node32   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-mixcoord-785c66488f-cpdxl             1/1     Running            0                 33m     10.104.16.223   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-proxy-5b4469bfd9-qn2q5                1/1     Running            0                 22m     10.104.16.234   4am-node21   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-querynode-0-6c7758b944-6g6bt          0/1     CrashLoopBackOff   9 (3m48s ago)     45m     10.104.26.219   4am-node32   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-querynode-0-6c7758b944-j6ckw          0/1     CrashLoopBackOff   9 (4m18s ago)     45m     10.104.9.11     4am-node14   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-milvus-querynode-0-6c7758b944-kc2nh          0/1     CrashLoopBackOff   10 (3m19s ago)    45m     10.104.1.55     4am-node10   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-0                                      1/1     Running            0                 46m     10.104.33.143   4am-node36   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-1                                      1/1     Running            0                 46m     10.104.25.191   4am-node30   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-2                                      1/1     Running            0                 46m     10.104.32.85    4am-node39   <none>           <none>

[2024-11-14T06:05:08.480Z] kafka-mixcoord-5485-minio-3                                      1/1     Running            0                 46m     10.104.34.148   4am-node37   <none>           <none>

Anything else?

it is a stable reproduced issue

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 14, 2024
@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Nov 14, 2024
@zhuwenxing zhuwenxing added this to the 2.5.0 milestone Nov 14, 2024
@yanliang567
Copy link
Contributor

/assign @xiaocai2333
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 14, 2024
@congqixia
Copy link
Contributor

after offline discussion with @foxspy , it looks like an issue about index engine version
/assign @foxspy
/unassign @xiaocai2333

@zhuwenxing
Copy link
Contributor Author

@congqixia
what about datanode memory?
in @weiliu1031 opinion, it was caused by the large number of collections

sre-ci-robot pushed a commit that referenced this issue Nov 15, 2024
issue: #37665 #37631 #37620 #37587 #36906 
knowhere has add default nlist value, so some invalid param test ut with
no nlist param will be valid.

Signed-off-by: xianliang.li <[email protected]>
@foxspy
Copy link
Contributor

foxspy commented Nov 15, 2024

@zhuwenxing pls verify with the lastest master, thanks~

@foxspy
Copy link
Contributor

foxspy commented Nov 15, 2024

/assign @zhuwenxing
/unassign

@sre-ci-robot sre-ci-robot assigned zhuwenxing and unassigned foxspy Nov 15, 2024
@zhuwenxing
Copy link
Contributor Author

querynode crash was fixed in milvus-io-master-d159629-20241115
However, after this fix, there has been a significant increase in the interruptions between search/query, which previously lasted around 10 seconds during the rolling upgrade, but now can last up to 10 minutes.

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5489/pipeline

log
artifacts-kafka-mixcoord-5489-server-logs.tar.gz

image

/assign @weiliu1031
/unassign

@zhuwenxing
Copy link
Contributor Author

@zhuwenxing
Copy link
Contributor Author

@foxspy

querynode crash was fixed in milvus-io-master-d159629-20241115

There is an error in the statement here, it still crashes during the upgrade process.

[2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:383] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="start reduce query result, traceID = bf1049c9822b3b71d5c551fc5a39372e,  vChannel = pulsar-mixcoord-5493-rootcoord-dml_8_453946215647690701v0, segmentIDs = []"] [duration=388.270221ms]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:399] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="do search with channel done , vChannel = pulsar-mixcoord-5493-rootcoord-dml_11_453946215647690701v3, segmentIDs = []"] [duration=388.358458ms]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:74] ["shard leader get valid search results"] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [numbers=2]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:77] [reduceSearchResultData] [traceID=bf1049c9822b3b71d5c551fc5a39372e] ["result No."=0] [nq=5] [topk=1]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:77] [reduceSearchResultData] [traceID=bf1049c9822b3b71d5c551fc5a39372e] ["result No."=1] [nq=5] [topk=1]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:301] ["skip duplicated search result"] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [count=0]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:399] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="do search with channel done , vChannel = pulsar-mixcoord-5493-rootcoord-dml_8_453946215647690701v0, segmentIDs = []"] [duration=388.388494ms]
_ZNKSt14default_deleteIN6milvus5index9IndexBaseEEclEPS2_
	/usr/include/c++/12/bits/unique_ptr.h:95 pc=0x7fadab6cc814
_ZNSt10unique_ptrIN6milvus5index9IndexBaseESt14default_deleteIS2_EED4Ev
	/usr/include/c++/12/bits/unique_ptr.h:396 pc=0x7fadab6cc814
_ZN6milvus7segcore13LoadIndexInfoD4Ev
	/workspace/source/internal/core/src/segcore/Types.h:32 pc=0x7fadab6cc814
DeleteLoadIndexInfo
	/workspace/source/internal/core/src/segcore/load_index_c.cpp:60 pc=0x7fadab6cc814
runtime.asmcgocall
	/usr/local/go/src/runtime/asm_amd64.s:872 pc=0x1eb40c7


SIGSEGV: segmentation violation
PC=0x7fada783f47d m=233 sigcode=1
signal arrived during cgo execution

goroutine 3688 [syscall, locked to thread]:
_ZN7hnswlib15HierarchicalNSWIffLNS_9QuantTypeE0EED4Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/thirdparty/hnswlib/hnswlib/hnswalg.h:194 pc=0x7fada783f47d
_ZN7hnswlib15HierarchicalNSWIffLNS_9QuantTypeE0EED4Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/thirdparty/hnswlib/hnswlib/hnswalg.h:195 pc=0x7fada783f47d
_ZN8knowhere13HnswIndexNodeIfLN7hnswlib9QuantTypeE0EED4Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/index/hnsw/hnsw.cc:579 pc=0x7fada783f47d
_ZN8knowhere13HnswIndexNodeIfLN7hnswlib9QuantTypeE0EED0Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/index/hnsw/hnsw.cc:581 pc=0x7fada783f47d
_ZN8knowhere5IndexINS_9IndexNodeEED4Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/include/knowhere/index/index.h:207 pc=0x7fadab39d8b2
_ZN6milvus5index14VectorMemIndexIfED4Ev
	/workspace/source/internal/core/src/index/VectorMemIndex.h:34 pc=0x7fadab39d8b2
_ZN6milvus5index14VectorMemIndexIfED0Ev
	/workspace/source/internal/core/src/index/VectorMemIndex.h:34 pc=0x7fadab39d8b2
_ZNKSt14default_deleteIN6milvus5index9IndexBaseEEclEPS2_
	/usr/include/c++/12/bits/unique_ptr.h:95 pc=0x7fadab6cc814
_ZNSt10unique_ptrIN6milvus5index9IndexBaseESt14default_deleteIS2_EED4Ev
	/usr/include/c++/12/bits/unique_ptr.h:396 pc=0x7fadab6cc814
_ZN6milvus7segcore13LoadIndexInfoD4Ev
	/workspace/source/internal/core/src/segcore/Types.h:32 pc=0x7fadab6cc814
DeleteLoadIndexInfo
	/workspace/source/internal/core/src/segcore/load_index_c.cpp:60 pc=0x7fadab6cc814
runtime.asmcgocall
	/usr/local/go/src/runtime/asm_amd64.s:872 pc=0x1eb40c7
runtime.cgocall(0x4fd22a0, 0xc001a1eed8)
	/usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc001a1eeb0 sp=0xc001a1ee78 pc=0x1e4444b
github.com/milvus-io/milvus/internal/querynodev2/segments._Cfunc_DeleteLoadIndexInfo(0x7faae68a0000)
	_cgo_gotypes.go:510 +0x3f fp=0xc001a1eed8 sp=0xc001a1eeb0 pc=0x4d862ff
github.com/milvus-io/milvus/internal/querynodev2/segments.deleteLoadIndexInfo.func1.1(0x10000c0039ec060?)
	/workspace/source/internal/querynodev2/segments/load_index_info.go:62 +0x34 fp=0xc001a1ef10 sp=0xc001a1eed8 pc=0x4d8d2f4
github.com/milvus-io/milvus/internal/querynodev2/segments.deleteLoadIndexInfo.func1()
	/workspace/source/internal/querynodev2/segments/load_index_info.go:62 +0x17 fp=0xc001a1ef28 sp=0xc001a1ef10 pc=0x4d8d297
github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1()
	/workspace/source/pkg/util/conc/pool.go:81 +0xb3 fp=0xc001a1ef88 sp=0xc001a1ef28 pc=0x4d60953
github.com/panjf2000/ants/v2.(*goWorker).run.func1()
	/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:67 +0x8d fp=0xc001a1efe0 sp=0xc001a1ef88 pc=0x3b211ad
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc001a1efe8 sp=0xc001a1efe0 pc=0x1eb4441
created by github.com/panjf2000/ants/v2.(*goWorker).run in goroutine 3698
	/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:48 +0x5c

@foxspy
Copy link
Contributor

foxspy commented Nov 16, 2024

@foxspy

querynode crash was fixed in milvus-io-master-d159629-20241115

There is an error in the statement here, it still crashes during the upgrade process.

[2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:383] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="start reduce query result, traceID = bf1049c9822b3b71d5c551fc5a39372e,  vChannel = pulsar-mixcoord-5493-rootcoord-dml_8_453946215647690701v0, segmentIDs = []"] [duration=388.270221ms]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:399] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="do search with channel done , vChannel = pulsar-mixcoord-5493-rootcoord-dml_11_453946215647690701v3, segmentIDs = []"] [duration=388.358458ms]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:74] ["shard leader get valid search results"] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [numbers=2]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:77] [reduceSearchResultData] [traceID=bf1049c9822b3b71d5c551fc5a39372e] ["result No."=0] [nq=5] [topk=1]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:77] [reduceSearchResultData] [traceID=bf1049c9822b3b71d5c551fc5a39372e] ["result No."=1] [nq=5] [topk=1]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:301] ["skip duplicated search result"] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [count=0]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:399] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="do search with channel done , vChannel = pulsar-mixcoord-5493-rootcoord-dml_8_453946215647690701v0, segmentIDs = []"] [duration=388.388494ms]
_ZNKSt14default_deleteIN6milvus5index9IndexBaseEEclEPS2_
	/usr/include/c++/12/bits/unique_ptr.h:95 pc=0x7fadab6cc814
_ZNSt10unique_ptrIN6milvus5index9IndexBaseESt14default_deleteIS2_EED4Ev
	/usr/include/c++/12/bits/unique_ptr.h:396 pc=0x7fadab6cc814
_ZN6milvus7segcore13LoadIndexInfoD4Ev
	/workspace/source/internal/core/src/segcore/Types.h:32 pc=0x7fadab6cc814
DeleteLoadIndexInfo
	/workspace/source/internal/core/src/segcore/load_index_c.cpp:60 pc=0x7fadab6cc814
runtime.asmcgocall
	/usr/local/go/src/runtime/asm_amd64.s:872 pc=0x1eb40c7


SIGSEGV: segmentation violation
PC=0x7fada783f47d m=233 sigcode=1
signal arrived during cgo execution

goroutine 3688 [syscall, locked to thread]:
_ZN7hnswlib15HierarchicalNSWIffLNS_9QuantTypeE0EED4Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/thirdparty/hnswlib/hnswlib/hnswalg.h:194 pc=0x7fada783f47d
_ZN7hnswlib15HierarchicalNSWIffLNS_9QuantTypeE0EED4Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/thirdparty/hnswlib/hnswlib/hnswalg.h:195 pc=0x7fada783f47d
_ZN8knowhere13HnswIndexNodeIfLN7hnswlib9QuantTypeE0EED4Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/index/hnsw/hnsw.cc:579 pc=0x7fada783f47d
_ZN8knowhere13HnswIndexNodeIfLN7hnswlib9QuantTypeE0EED0Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/index/hnsw/hnsw.cc:581 pc=0x7fada783f47d
_ZN8knowhere5IndexINS_9IndexNodeEED4Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/include/knowhere/index/index.h:207 pc=0x7fadab39d8b2
_ZN6milvus5index14VectorMemIndexIfED4Ev
	/workspace/source/internal/core/src/index/VectorMemIndex.h:34 pc=0x7fadab39d8b2
_ZN6milvus5index14VectorMemIndexIfED0Ev
	/workspace/source/internal/core/src/index/VectorMemIndex.h:34 pc=0x7fadab39d8b2
_ZNKSt14default_deleteIN6milvus5index9IndexBaseEEclEPS2_
	/usr/include/c++/12/bits/unique_ptr.h:95 pc=0x7fadab6cc814
_ZNSt10unique_ptrIN6milvus5index9IndexBaseESt14default_deleteIS2_EED4Ev
	/usr/include/c++/12/bits/unique_ptr.h:396 pc=0x7fadab6cc814
_ZN6milvus7segcore13LoadIndexInfoD4Ev
	/workspace/source/internal/core/src/segcore/Types.h:32 pc=0x7fadab6cc814
DeleteLoadIndexInfo
	/workspace/source/internal/core/src/segcore/load_index_c.cpp:60 pc=0x7fadab6cc814
runtime.asmcgocall
	/usr/local/go/src/runtime/asm_amd64.s:872 pc=0x1eb40c7
runtime.cgocall(0x4fd22a0, 0xc001a1eed8)
	/usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc001a1eeb0 sp=0xc001a1ee78 pc=0x1e4444b
github.com/milvus-io/milvus/internal/querynodev2/segments._Cfunc_DeleteLoadIndexInfo(0x7faae68a0000)
	_cgo_gotypes.go:510 +0x3f fp=0xc001a1eed8 sp=0xc001a1eeb0 pc=0x4d862ff
github.com/milvus-io/milvus/internal/querynodev2/segments.deleteLoadIndexInfo.func1.1(0x10000c0039ec060?)
	/workspace/source/internal/querynodev2/segments/load_index_info.go:62 +0x34 fp=0xc001a1ef10 sp=0xc001a1eed8 pc=0x4d8d2f4
github.com/milvus-io/milvus/internal/querynodev2/segments.deleteLoadIndexInfo.func1()
	/workspace/source/internal/querynodev2/segments/load_index_info.go:62 +0x17 fp=0xc001a1ef28 sp=0xc001a1ef10 pc=0x4d8d297
github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1()
	/workspace/source/pkg/util/conc/pool.go:81 +0xb3 fp=0xc001a1ef88 sp=0xc001a1ef28 pc=0x4d60953
github.com/panjf2000/ants/v2.(*goWorker).run.func1()
	/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:67 +0x8d fp=0xc001a1efe0 sp=0xc001a1ef88 pc=0x3b211ad
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc001a1efe8 sp=0xc001a1efe0 pc=0x1eb4441
created by github.com/panjf2000/ants/v2.(*goWorker).run in goroutine 3698
	/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:48 +0x5c

@zhuwenxing give me the loki link, pls

@zhuwenxing
Copy link
Contributor Author

cluster: 4am
ns: chaos-testing
pod info

2024-11-15T11:18:20.053Z] [2024-11-15 11:18:19 - INFO - ci_test]: kubectl get pod|grep pulsar-mixcoord-5493
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-etcd-0                                       1/1     Running            0                  38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-etcd-1                                       1/1     Running            0                  38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-etcd-2                                       1/1     Running            0                  38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-datanode-69dc484dd9-7dxhk             1/1     Running            0                  35m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-indexnode-79d978fff7-bvrcn            1/1     Running            0                  28m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-indexnode-79d978fff7-mdlks            1/1     Running            0                  27m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-indexnode-79d978fff7-tgr9b            1/1     Running            0                  26m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-mixcoord-68848ddc5c-h4zkd             1/1     Running            0                  23m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-proxy-6fd6c7d64b-6wt8w                1/1     Running            0                  35m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-querynode-0-78fd8db695-h6mvj          0/1     CrashLoopBackOff   6 (10s ago)        35m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-querynode-0-78fd8db695-lp88q          0/1     CrashLoopBackOff   6 (68s ago)        35m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-querynode-1-77d8965ccc-99hqp          1/1     Running            0                  67s
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-milvus-querynode-1-77d8965ccc-vnfbw          1/1     Running            0                  15m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-minio-0                                      1/1     Running            0                  38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-minio-1                                      1/1     Running            0                  38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-minio-2                                      1/1     Running            0                  38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-minio-3                                      1/1     Running            0                  38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-pulsar-bookie-0                              1/1     Running            0                  38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-pulsar-bookie-1                              1/1     Running            0                  38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-pulsar-bookie-init-q8gb9                     0/1     Completed          0                  38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-pulsar-broker-0                              1/1     Running            4 (86s ago)        38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-pulsar-proxy-0                               1/1     Running            0                  38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-pulsar-pulsar-init-sj4kv                     0/1     Completed          0                  38m
[2024-11-15T11:18:20.053Z] pulsar-mixcoord-5493-pulsar-zookeeper-0                           1/1     Running            0                  38m

here is the pod info
@foxspy

@bigsheeper
Copy link
Contributor

This PR (#34278) accelerates the subscription speed of the dispatcher. However, the subscription becomes too fast for the dispatcher to merge in time, causing the DataNode OOM.

Test with PR #34278
Consumer num = 1.5k
image

After revert PR #34278
Consumer num = 214
image

I think we should limit the concurrency of DataNode subscriptions.

congqixia added a commit to congqixia/milvus that referenced this issue Nov 16, 2024
Related to milvus-io#37665

Thread number went rocket high when there is lots of kafka consumers on
datanode. Since the internal implementation is CGO, using which directly
will make cgo thread leaked.

This PR add a worker pool for kafka API utilzing CGO calls to limit
thread number.

Signed-off-by: Congqi Xia <[email protected]>
@foxspy
Copy link
Contributor

foxspy commented Nov 16, 2024

@foxspy

querynode crash was fixed in milvus-io-master-d159629-20241115

There is an error in the statement here, it still crashes during the upgrade process.

[2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:383] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="start reduce query result, traceID = bf1049c9822b3b71d5c551fc5a39372e,  vChannel = pulsar-mixcoord-5493-rootcoord-dml_8_453946215647690701v0, segmentIDs = []"] [duration=388.270221ms]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:399] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="do search with channel done , vChannel = pulsar-mixcoord-5493-rootcoord-dml_11_453946215647690701v3, segmentIDs = []"] [duration=388.358458ms]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:74] ["shard leader get valid search results"] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [numbers=2]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:77] [reduceSearchResultData] [traceID=bf1049c9822b3b71d5c551fc5a39372e] ["result No."=0] [nq=5] [topk=1]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:77] [reduceSearchResultData] [traceID=bf1049c9822b3b71d5c551fc5a39372e] ["result No."=1] [nq=5] [topk=1]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [segments/result.go:301] ["skip duplicated search result"] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [count=0]
[2024/11/15 11:21:42.155 +00:00] [DEBUG] [querynodev2/handlers.go:399] [tr/searchDelegator] [traceID=bf1049c9822b3b71d5c551fc5a39372e] [msg="do search with channel done , vChannel = pulsar-mixcoord-5493-rootcoord-dml_8_453946215647690701v0, segmentIDs = []"] [duration=388.388494ms]
_ZNKSt14default_deleteIN6milvus5index9IndexBaseEEclEPS2_
	/usr/include/c++/12/bits/unique_ptr.h:95 pc=0x7fadab6cc814
_ZNSt10unique_ptrIN6milvus5index9IndexBaseESt14default_deleteIS2_EED4Ev
	/usr/include/c++/12/bits/unique_ptr.h:396 pc=0x7fadab6cc814
_ZN6milvus7segcore13LoadIndexInfoD4Ev
	/workspace/source/internal/core/src/segcore/Types.h:32 pc=0x7fadab6cc814
DeleteLoadIndexInfo
	/workspace/source/internal/core/src/segcore/load_index_c.cpp:60 pc=0x7fadab6cc814
runtime.asmcgocall
	/usr/local/go/src/runtime/asm_amd64.s:872 pc=0x1eb40c7


SIGSEGV: segmentation violation
PC=0x7fada783f47d m=233 sigcode=1
signal arrived during cgo execution

goroutine 3688 [syscall, locked to thread]:
_ZN7hnswlib15HierarchicalNSWIffLNS_9QuantTypeE0EED4Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/thirdparty/hnswlib/hnswlib/hnswalg.h:194 pc=0x7fada783f47d
_ZN7hnswlib15HierarchicalNSWIffLNS_9QuantTypeE0EED4Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/thirdparty/hnswlib/hnswlib/hnswalg.h:195 pc=0x7fada783f47d
_ZN8knowhere13HnswIndexNodeIfLN7hnswlib9QuantTypeE0EED4Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/index/hnsw/hnsw.cc:579 pc=0x7fada783f47d
_ZN8knowhere13HnswIndexNodeIfLN7hnswlib9QuantTypeE0EED0Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/src/index/hnsw/hnsw.cc:581 pc=0x7fada783f47d
_ZN8knowhere5IndexINS_9IndexNodeEED4Ev
	/workspace/source/cmake_build/thirdparty/knowhere/knowhere-src/include/knowhere/index/index.h:207 pc=0x7fadab39d8b2
_ZN6milvus5index14VectorMemIndexIfED4Ev
	/workspace/source/internal/core/src/index/VectorMemIndex.h:34 pc=0x7fadab39d8b2
_ZN6milvus5index14VectorMemIndexIfED0Ev
	/workspace/source/internal/core/src/index/VectorMemIndex.h:34 pc=0x7fadab39d8b2
_ZNKSt14default_deleteIN6milvus5index9IndexBaseEEclEPS2_
	/usr/include/c++/12/bits/unique_ptr.h:95 pc=0x7fadab6cc814
_ZNSt10unique_ptrIN6milvus5index9IndexBaseESt14default_deleteIS2_EED4Ev
	/usr/include/c++/12/bits/unique_ptr.h:396 pc=0x7fadab6cc814
_ZN6milvus7segcore13LoadIndexInfoD4Ev
	/workspace/source/internal/core/src/segcore/Types.h:32 pc=0x7fadab6cc814
DeleteLoadIndexInfo
	/workspace/source/internal/core/src/segcore/load_index_c.cpp:60 pc=0x7fadab6cc814
runtime.asmcgocall
	/usr/local/go/src/runtime/asm_amd64.s:872 pc=0x1eb40c7
runtime.cgocall(0x4fd22a0, 0xc001a1eed8)
	/usr/local/go/src/runtime/cgocall.go:157 +0x4b fp=0xc001a1eeb0 sp=0xc001a1ee78 pc=0x1e4444b
github.com/milvus-io/milvus/internal/querynodev2/segments._Cfunc_DeleteLoadIndexInfo(0x7faae68a0000)
	_cgo_gotypes.go:510 +0x3f fp=0xc001a1eed8 sp=0xc001a1eeb0 pc=0x4d862ff
github.com/milvus-io/milvus/internal/querynodev2/segments.deleteLoadIndexInfo.func1.1(0x10000c0039ec060?)
	/workspace/source/internal/querynodev2/segments/load_index_info.go:62 +0x34 fp=0xc001a1ef10 sp=0xc001a1eed8 pc=0x4d8d2f4
github.com/milvus-io/milvus/internal/querynodev2/segments.deleteLoadIndexInfo.func1()
	/workspace/source/internal/querynodev2/segments/load_index_info.go:62 +0x17 fp=0xc001a1ef28 sp=0xc001a1ef10 pc=0x4d8d297
github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1()
	/workspace/source/pkg/util/conc/pool.go:81 +0xb3 fp=0xc001a1ef88 sp=0xc001a1ef28 pc=0x4d60953
github.com/panjf2000/ants/v2.(*goWorker).run.func1()
	/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:67 +0x8d fp=0xc001a1efe0 sp=0xc001a1ef88 pc=0x3b211ad
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc001a1efe8 sp=0xc001a1efe0 pc=0x1eb4441
created by github.com/panjf2000/ants/v2.(*goWorker).run in goroutine 3698
	/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:48 +0x5c

@zhuwenxing give me the loki link, pls

image
image

Index was built by faiss_hnsw, but loaded with hnswlib. When indexNode is upgraded, hnsw index will be built directly from faiss_hnsw, which cannot be parsed for queryNode that has not been upgraded. It needs to be isolated by version to ensure that the index creation capability with version less than or equal to 5 is still built through hnswlib. After the upgrade is completed, the version will reach 6, and the index construction will be completed by faiss_hnsw.

@foxspy
Copy link
Contributor

foxspy commented Nov 17, 2024

@zhuwenxing Fixed and republished knowhere, please re-verify, thanks~

sre-ci-robot pushed a commit that referenced this issue Nov 18, 2024
Related to #37665

Thread number went rocket high when there is lots of kafka consumers on
datanode. Since the internal implementation is CGO, using which directly
will make cgo thread leaked.

This PR add a worker pool for kafka API utilzing CGO calls to limit
thread number.

Signed-off-by: Congqi Xia <[email protected]>
@zhuwenxing
Copy link
Contributor Author

@foxspy
querynode crash issue was verified and fix in master-00edec2-20241118

sre-ci-robot pushed a commit that referenced this issue Nov 18, 2024
Limit the maximum concurrency of channel tasks for each DataNode to
prevent excessive subscriptions from causing DataNode OOM.

issue: #37665

Signed-off-by: bigsheeper <[email protected]>
@yanliang567
Copy link
Contributor

@bigsheeper we still need a fix for datanode high memory usage issue

@liliu-z
Copy link
Member

liliu-z commented Nov 21, 2024

@bigsheeper we still need a fix for datanode high memory usage issue

As mentioned, this is for 5K collections scenario.

@yanliang567 yanliang567 removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 25, 2024
@yanliang567
Copy link
Contributor

pending for 10k collections enhancement

@yanliang567 yanliang567 modified the milestones: 2.5.0, 2.5.1, 2.5.2 Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

8 participants