Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global lock stuck, blocking new nodes to get synced #9112

Open
sirihansen opened this issue Nov 26, 2024 · 2 comments
Open

Global lock stuck, blocking new nodes to get synced #9112

sirihansen opened this issue Nov 26, 2024 · 2 comments
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@sirihansen
Copy link
Contributor

sirihansen commented Nov 26, 2024

Describe the bug
We have a cluster of up to 80 erlang nodes and sometimes when we try to add a new node to the cluster we notice that it does not get connected to all nodes but only to the one it pings or a subset of the cluster. After a lot of tracing I found out that one of the nodes in the cluster happens to have a global lock set which never seem to be released:

ets:tab2list(global_locks).
[{global,[<7352.59.0>,<9811.59.0>],
         [{<7352.59.0>,#Ref<0.2322150621.116391938.1946>}]}]

This lock blocks all other nodes from getting a lock and thereby be able to sync with new nodes.

When I look at the state of 'global' on different nodes in the cluster, I see that many of them have pending node syncs and several reslovers. Also, the number of synced nodes known by 'global' often differs from the number of nodes returned by the nodes() BIF.

To Reproduce
Not sure. I suspect that we end up in this state after having network downs. Last time I sae it was after a network maintenance session, but since I did not have any monitoring I do not know for sure that this was what triggered it.

Expected behavior
When connecting a new node to a cluster I expect the new node to get connected to all other nodes in the cluster.

Affected versions
We are currently running OTP-26.2. We have seen the same behaviour on earlier versions as well - at least on OTP-24.

Additional context
I am trying to add some monitoring of 'global' internal state and the global_locks table to see if I can pinpoint when things start going wrong - hoping this might help getting a better understanding of what happens. I'm currently logging if more than two samples in a row show the exact same global lock, and if the number of nodes seen as synced by global is not the same as the number of nodes returned by the nodes() BIF.

Do you have any suggestions of other things to monitor that might help?

We currently have this bad state with a stuck global lock in our test environment. I will at some point need to restore this (by restarting at least the node with the stuck lock), but if it could be of any help I can extract data from the cluster before I do that. Please let me know what could be interesting to look at.

@sirihansen sirihansen added the bug Issue is reported as a bug label Nov 26, 2024
@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Nov 26, 2024
@rickard-green
Copy link
Contributor

Hi Siri!

Please post or mail me the results of these expressions from the nodes identified by the pids in the global_locks table, the node that still got that lock and another suitable node:

  • global:info().
  • ets:tab2list(global_locks).
  • ets:tab2list(global_pid_ids).
  • ets:tab2list(global_node_resources).
  • Ref=make_ref(), global:get_locker() ! {get_state, self(), Ref}, receive {Ref, _, _} = Res -> Res end.

@rickard-green
Copy link
Contributor

This on the above nodes as well:

  • process_info(whereis(global_name_server), dictionary).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

No branches or pull requests

3 participants