Global lock stuck, blocking new nodes to get synced #9112

sirihansen · 2024-11-26T10:51:35Z

Describe the bug
We have a cluster of up to 80 erlang nodes and sometimes when we try to add a new node to the cluster we notice that it does not get connected to all nodes but only to the one it pings or a subset of the cluster. After a lot of tracing I found out that one of the nodes in the cluster happens to have a global lock set which never seem to be released:

ets:tab2list(global_locks).
[{global,[<7352.59.0>,<9811.59.0>],
         [{<7352.59.0>,#Ref<0.2322150621.116391938.1946>}]}]

This lock blocks all other nodes from getting a lock and thereby be able to sync with new nodes.

When I look at the state of 'global' on different nodes in the cluster, I see that many of them have pending node syncs and several reslovers. Also, the number of synced nodes known by 'global' often differs from the number of nodes returned by the nodes() BIF.

To Reproduce
Not sure. I suspect that we end up in this state after having network downs. Last time I sae it was after a network maintenance session, but since I did not have any monitoring I do not know for sure that this was what triggered it.

Expected behavior
When connecting a new node to a cluster I expect the new node to get connected to all other nodes in the cluster.

Affected versions
We are currently running OTP-26.2. We have seen the same behaviour on earlier versions as well - at least on OTP-24.

Additional context
I am trying to add some monitoring of 'global' internal state and the global_locks table to see if I can pinpoint when things start going wrong - hoping this might help getting a better understanding of what happens. I'm currently logging if more than two samples in a row show the exact same global lock, and if the number of nodes seen as synced by global is not the same as the number of nodes returned by the nodes() BIF.

Do you have any suggestions of other things to monitor that might help?

We currently have this bad state with a stuck global lock in our test environment. I will at some point need to restore this (by restarting at least the node with the stuck lock), but if it could be of any help I can extract data from the cluster before I do that. Please let me know what could be interesting to look at.

The text was updated successfully, but these errors were encountered:

rickard-green · 2024-11-26T17:49:06Z

Hi Siri!

Please post or mail me the results of these expressions from the nodes identified by the pids in the global_locks table, the node that still got that lock and another suitable node:

global:info().
ets:tab2list(global_locks).
ets:tab2list(global_pid_ids).
ets:tab2list(global_node_resources).
Ref=make_ref(), global:get_locker() ! {get_state, self(), Ref}, receive {Ref, _, _} = Res -> Res end.

rickard-green · 2024-11-27T10:23:32Z

This on the above nodes as well:

process_info(whereis(global_name_server), dictionary).

sirihansen added the bug Issue is reported as a bug label Nov 26, 2024

IngelaAndin added the team:VM Assigned to OTP team VM label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global lock stuck, blocking new nodes to get synced #9112

Global lock stuck, blocking new nodes to get synced #9112

sirihansen commented Nov 26, 2024 •

edited

Loading

rickard-green commented Nov 26, 2024

rickard-green commented Nov 27, 2024

Global lock stuck, blocking new nodes to get synced #9112

Global lock stuck, blocking new nodes to get synced #9112

Comments

sirihansen commented Nov 26, 2024 • edited Loading

rickard-green commented Nov 26, 2024

rickard-green commented Nov 27, 2024

sirihansen commented Nov 26, 2024 •

edited

Loading