You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Multiple nodes ran out of memory on a 45-node cluster.
We have memory_data in the logs, which indicates that the worst offender was the same process on all the affected nodes (global:locker). Process backtrace is available for one of the affected nodes.
To Reproduce
It has occurred once and we have not been able to reproduce it. Our guess is it happens when there are multiple nodedown messages. One node (85) was failed over and added back to the cluster. While adding it back to the cluster, connections were repeatedly shut down and recreated (~500 times) (from 85 to every other node in the cluster). There are numerous net_kernel, nodedown messages in the logs. Immediately after node85 successfully establishes the dist connection to node74, global on node74 starts consuming memory. The memory consumption increases linearly.
It appears there have been 576460752302537370 - 576460752267720345 = 34 M tags (and twice as many messages for his_the_locker/cancel resulting in 69M messages).
Affected versions
OTP 25.3
Additional context
Note that prevent_overlapping_partitions is false on all nodes (not the default: true on OTP 25).
I'll extract and upload the global_(locks|pid_id|node_resources|...) ETS tables on the nodes for which logs are available.
The text was updated successfully, but these errors were encountered:
neelima32
changed the title
global OOM issue (64M messages in global - his_the_locker and cancel with distinct tags)
global OOM issue (69M messages in global - his_the_locker and cancel with distinct tags)
Nov 26, 2024
Describe the bug
Multiple nodes ran out of memory on a 45-node cluster.
We have memory_data in the logs, which indicates that the worst offender was the same process on all the affected nodes (global:locker). Process backtrace is available for one of the affected nodes.
To Reproduce
It has occurred once and we have not been able to reproduce it. Our guess is it happens when there are multiple nodedown messages. One node (85) was failed over and added back to the cluster. While adding it back to the cluster, connections were repeatedly shut down and recreated (~500 times) (from 85 to every other node in the cluster). There are numerous net_kernel, nodedown messages in the logs. Immediately after node85 successfully establishes the dist connection to node74, global on node74 starts consuming memory. The memory consumption increases linearly.
Logs on node74 indicate 69M messages:
From <0.55.0>'s process dictionary:
It appears there have been 576460752302537370 - 576460752267720345 = 34 M tags (and twice as many messages for his_the_locker/cancel resulting in 69M messages).
Affected versions
OTP 25.3
Additional context
Note that prevent_overlapping_partitions is false on all nodes (not the default: true on OTP 25).
I'll extract and upload the global_(locks|pid_id|node_resources|...) ETS tables on the nodes for which logs are available.
The text was updated successfully, but these errors were encountered: