-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: During the rolling upgrade process, the datanode experiences a sudden surge in memory usage, leading to OOM kills, and the querynode also continues to crash #37665
Comments
/assign @xiaocai2333 |
after offline discussion with @foxspy , it looks like an issue about index engine version |
@congqixia |
@zhuwenxing pls verify with the lastest master, thanks~ |
/assign @zhuwenxing |
querynode crash was fixed in log /assign @weiliu1031 |
/assign @bigsheeper with 5000 empty collections, datanode oom with 16G memory failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5491/pipeline |
There is an error in the statement here, it still crashes during the upgrade process.
|
@zhuwenxing give me the loki link, pls |
cluster: 4am
here is the pod info |
This PR (#34278) accelerates the subscription speed of the dispatcher. However, the subscription becomes too fast for the dispatcher to merge in time, causing the DataNode OOM. Test with PR #34278 After revert PR #34278 I think we should limit the concurrency of DataNode subscriptions. |
Related to milvus-io#37665 Thread number went rocket high when there is lots of kafka consumers on datanode. Since the internal implementation is CGO, using which directly will make cgo thread leaked. This PR add a worker pool for kafka API utilzing CGO calls to limit thread number. Signed-off-by: Congqi Xia <[email protected]>
Index was built by faiss_hnsw, but loaded with hnswlib. When indexNode is upgraded, hnsw index will be built directly from faiss_hnsw, which cannot be parsed for queryNode that has not been upgraded. It needs to be isolated by version to ensure that the index creation capability with version less than or equal to 5 is still built through hnswlib. After the upgrade is completed, the version will reach 6, and the index construction will be completed by faiss_hnsw. |
@zhuwenxing Fixed and republished knowhere, please re-verify, thanks~ |
Related to #37665 Thread number went rocket high when there is lots of kafka consumers on datanode. Since the internal implementation is CGO, using which directly will make cgo thread leaked. This PR add a worker pool for kafka API utilzing CGO calls to limit thread number. Signed-off-by: Congqi Xia <[email protected]>
@foxspy |
Limit the maximum concurrency of channel tasks for each DataNode to prevent excessive subscriptions from causing DataNode OOM. issue: #37665 Signed-off-by: bigsheeper <[email protected]>
@bigsheeper we still need a fix for datanode high memory usage issue |
As mentioned, this is for 5K collections scenario. |
pending for 10k collections enhancement |
Is there an existing issue for this?
Environment
Current Behavior
The time point when the memory usage surged drastically coincided with the time when mixcoord started to upgrade.
Logs when the query node crashes
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/5485/pipeline
log:
artifacts-kafka-mixcoord-5485-server-logs.tar.gz
cluster: 4am
ns: chaos-testing
pod info
Anything else?
it is a stable reproduced issue
The text was updated successfully, but these errors were encountered: