You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We have a cluster of up to 80 erlang nodes and sometimes when we try to add a new node to the cluster we notice that it does not get connected to all nodes but only to the one it pings or a subset of the cluster. After a lot of tracing I found out that one of the nodes in the cluster happens to have a global lock set which never seem to be released:
This lock blocks all other nodes from getting a lock and thereby be able to sync with new nodes.
When I look at the state of 'global' on different nodes in the cluster, I see that many of them have pending node syncs and several reslovers. Also, the number of synced nodes known by 'global' often differs from the number of nodes returned by the nodes() BIF.
To Reproduce
Not sure. I suspect that we end up in this state after having network downs. Last time I sae it was after a network maintenance session, but since I did not have any monitoring I do not know for sure that this was what triggered it.
Expected behavior
When connecting a new node to a cluster I expect the new node to get connected to all other nodes in the cluster.
Affected versions
We are currently running OTP-26.2. We have seen the same behaviour on earlier versions as well - at least on OTP-24.
Additional context
I am trying to add some monitoring of 'global' internal state and the global_locks table to see if I can pinpoint when things start going wrong - hoping this might help getting a better understanding of what happens. I'm currently logging if more than two samples in a row show the exact same global lock, and if the number of nodes seen as synced by global is not the same as the number of nodes returned by the nodes() BIF.
Do you have any suggestions of other things to monitor that might help?
We currently have this bad state with a stuck global lock in our test environment. I will at some point need to restore this (by restarting at least the node with the stuck lock), but if it could be of any help I can extract data from the cluster before I do that. Please let me know what could be interesting to look at.
The text was updated successfully, but these errors were encountered:
Please post or mail me the results of these expressions from the nodes identified by the pids in the global_locks table, the node that still got that lock and another suitable node:
global:info().
ets:tab2list(global_locks).
ets:tab2list(global_pid_ids).
ets:tab2list(global_node_resources).
Ref=make_ref(), global:get_locker() ! {get_state, self(), Ref}, receive {Ref, _, _} = Res -> Res end.
Describe the bug
We have a cluster of up to 80 erlang nodes and sometimes when we try to add a new node to the cluster we notice that it does not get connected to all nodes but only to the one it pings or a subset of the cluster. After a lot of tracing I found out that one of the nodes in the cluster happens to have a global lock set which never seem to be released:
This lock blocks all other nodes from getting a lock and thereby be able to sync with new nodes.
When I look at the state of 'global' on different nodes in the cluster, I see that many of them have pending node syncs and several reslovers. Also, the number of synced nodes known by 'global' often differs from the number of nodes returned by the nodes() BIF.
To Reproduce
Not sure. I suspect that we end up in this state after having network downs. Last time I sae it was after a network maintenance session, but since I did not have any monitoring I do not know for sure that this was what triggered it.
Expected behavior
When connecting a new node to a cluster I expect the new node to get connected to all other nodes in the cluster.
Affected versions
We are currently running OTP-26.2. We have seen the same behaviour on earlier versions as well - at least on OTP-24.
Additional context
I am trying to add some monitoring of 'global' internal state and the global_locks table to see if I can pinpoint when things start going wrong - hoping this might help getting a better understanding of what happens. I'm currently logging if more than two samples in a row show the exact same global lock, and if the number of nodes seen as synced by global is not the same as the number of nodes returned by the nodes() BIF.
Do you have any suggestions of other things to monitor that might help?
We currently have this bad state with a stuck global lock in our test environment. I will at some point need to restore this (by restarting at least the node with the stuck lock), but if it could be of any help I can extract data from the cluster before I do that. Please let me know what could be interesting to look at.
The text was updated successfully, but these errors were encountered: