global OOM issue (69M messages in global - his_the_locker and cancel with distinct tags) #9117

neelima32 · 2024-11-26T18:42:01Z

Describe the bug
Multiple nodes ran out of memory on a 45-node cluster.
We have memory_data in the logs, which indicates that the worst offender was the same process on all the affected nodes (global:locker). Process backtrace is available for one of the affected nodes.

To Reproduce
It has occurred once and we have not been able to reproduce it. Our guess is it happens when there are multiple nodedown messages. One node (85) was failed over and added back to the cluster. While adding it back to the cluster, connections were repeatedly shut down and recreated (~500 times) (from 85 to every other node in the cluster). There are numerous net_kernel, nodedown messages in the logs. Immediately after node85 successfully establishes the dist connection to node74, global on node74 starts consuming memory. The memory consumption increases linearly.

Logs on node74 indicate 69M messages:

   {messages,
       [<<"{cancel,'ns_1@node-85',-576460752302537370,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387318,-576460752302537369}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537369,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387317,-576460752302537368}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537368,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387316,-576460752302537367}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537367,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387315,-576460752302537366}">>,
        <<"{cancel,'ns_1@node-85',-576460752302537366,no_fun}">>,
        <<"{his_the_locker,<18602.56.0>,{8,[]},-576460752303387314,-576460752302537365}">>]},

   {message_queue_len,69628685},

From <0.55.0>'s process dictionary:

    {{sync_tag_my,
         'ns_1@node-85'},
     -576460752267720345},
    {{sync_tag_his,
         'ns_1@node-85'},
     -576460752268570427},

It appears there have been 576460752302537370 - 576460752267720345 = 34 M tags (and twice as many messages for his_the_locker/cancel resulting in 69M messages).

Affected versions
OTP 25.3

Additional context
Note that prevent_overlapping_partitions is false on all nodes (not the default: true on OTP 25).
I'll extract and upload the global_(locks|pid_id|node_resources|...) ETS tables on the nodes for which logs are available.

The text was updated successfully, but these errors were encountered:

neelima32 · 2024-11-27T00:45:14Z

logs.tar.gz

neelima32 added the bug Issue is reported as a bug label Nov 26, 2024

neelima32 changed the title ~~global OOM issue (64M messages in global - his_the_locker and cancel with distinct tags)~~ global OOM issue (69M messages in global - his_the_locker and cancel with distinct tags) Nov 26, 2024

IngelaAndin added the team:VM Assigned to OTP team VM label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

global OOM issue (69M messages in global - his_the_locker and cancel with distinct tags) #9117

global OOM issue (69M messages in global - his_the_locker and cancel with distinct tags) #9117

neelima32 commented Nov 26, 2024 •

edited

Loading

neelima32 commented Nov 27, 2024

global OOM issue (69M messages in global - his_the_locker and cancel with distinct tags) #9117

global OOM issue (69M messages in global - his_the_locker and cancel with distinct tags) #9117

Comments

neelima32 commented Nov 26, 2024 • edited Loading

neelima32 commented Nov 27, 2024

neelima32 commented Nov 26, 2024 •

edited

Loading