Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

global OOM issue (69M messages in global - his_the_locker and cancel with distinct tags) #9117

Open
neelima32 opened this issue Nov 26, 2024 · 1 comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@neelima32
Copy link

neelima32 commented Nov 26, 2024

Describe the bug
Multiple nodes ran out of memory on a 45-node cluster.
We have memory_data in the logs, which indicates that the worst offender was the same process on all the affected nodes (global:locker). Process backtrace is available for one of the affected nodes.

To Reproduce
It has occurred once and we have not been able to reproduce it. Our guess is it happens when there are multiple nodedown messages. One node (85) was failed over and added back to the cluster. While adding it back to the cluster, connections were repeatedly shut down and recreated (~500 times) (from 85 to every other node in the cluster). There are numerous net_kernel, nodedown messages in the logs. Immediately after node85 successfully establishes the dist connection to node74, global on node74 starts consuming memory. The memory consumption increases linearly.

Logs on node74 indicate 69M messages:

   {messages,
       [<<"{cancel,'ns_1@node-85',-576460752302537370,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387318,-576460752302537369}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537369,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387317,-576460752302537368}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537368,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387316,-576460752302537367}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537367,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387315,-576460752302537366}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537366,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387314,-576460752302537365}">>]},
   {message_queue_len,69628685},

From <0.55.0>'s process dictionary:

    {{sync_tag_my,
         'ns_1@node-85'},
     -576460752267720345},
    {{sync_tag_his,
         'ns_1@node-85'},
     -576460752268570427},

It appears there have been 576460752302537370 - 576460752267720345 = 34 M tags (and twice as many messages for his_the_locker/cancel resulting in 69M messages).

Affected versions
OTP 25.3

Additional context
Note that prevent_overlapping_partitions is false on all nodes (not the default: true on OTP 25).
I'll extract and upload the global_(locks|pid_id|node_resources|...) ETS tables on the nodes for which logs are available.

@neelima32 neelima32 added the bug Issue is reported as a bug label Nov 26, 2024
@neelima32 neelima32 changed the title global OOM issue (64M messages in global - his_the_locker and cancel with distinct tags) global OOM issue (69M messages in global - his_the_locker and cancel with distinct tags) Nov 26, 2024
@neelima32
Copy link
Author

logs.tar.gz

@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

2 participants