Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: sbd-cluster: stop dispatching cmap if disconnected #80

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

skazi0
Copy link

@skazi0 skazi0 commented May 20, 2019

If cmap socket is in HUP state, attempt to dispatch incoming events
will trigger the callback again and cause infinite loop with high
CPU load.
Added check should solve this by destroying the cmap connection and
removing it from the main loop.

If cmap socket is in HUP state, attempt to dispatch incoming events
will trigger the callback again and cause infinite loop with high
CPU load.
Added check should solve this by destroying the cmap connection and
removing it from the main loop.
@wenningerk
Copy link

Thanks for the update. Unfortunately github doesn't create an email-notification upon an amend-commit. That is why I just now stumbled over the update by manually polling. Thus it usually makes sense to spare an additional comment when doing an amend-commit.

@wenningerk
Copy link

Aah one more question arises looking at that error-handling. If we just close down the cmap-tracking we wouldn't be updated about changes anymore but proceed otherwise which might be dangerous when the cluster switches from > 2-nodes to 2-nodes due to configuration.
If it just happens when we loose the corosync-connection implying loosing the cluster-connection as well we would have a reconnection attempt. But if that can happen withouth the pacemaker-cpg-connection going away as well we probably would have to trigger a full reconnection.

@skazi0
Copy link
Author

skazi0 commented May 27, 2019

@wenningerk I was thinking about this but currently all CMAP failures (see https://github.com/ClusterLabs/sbd/blob/master/src/sbd-cluster.c#L197) are just "warning level" and don't trigger reconnect.

@wenningerk
Copy link

wenningerk commented May 27, 2019

@wenningerk I was thinking about this but currently all CMAP failures (see https://github.com/ClusterLabs/sbd/blob/master/src/sbd-cluster.c#L197) are just "warning level" and don't trigger reconnect.

That isn't entirely true as up to now we just have connection issues.
And those will lead to connected never becoming 'true' so that sbd_membership_connect is stuck in a loop that prevents notify_parent from being called and the failed stuff is gonna be repeated.
The errors themselves are indeed warnings but watchdog / inquisitor should be taking care of the issue.

@skazi0
Copy link
Author

skazi0 commented May 28, 2019

@wenningerk OK, so shall we really do a full reconnect (i.e. sbd_membership_destroy()) or just try to connect to cmap again? I'm thinking about creating another timer when cmap connection is lost and keep it running until we get a successful connection. WDYT?

@wenningerk
Copy link

@wenningerk OK, so shall we really do a full reconnect (i.e. sbd_membership_destroy()) or just try to connect to cmap again? I'm thinking about creating another timer when cmap connection is lost and keep it running until we get a successful connection. WDYT?

good question ... multiple timers and stuff introduce another source of racyness so I'm leaning toward a simple solution. Another question is if we should try corosync reconnects at all or for robustness rather consider connection loss to corosync as fatal for the node (except the shutdown cases of course but that should be handled within the watchdog-timeout).

@skazi0
Copy link
Author

skazi0 commented May 28, 2019

@wenningerk We experienced this CMAP connection loss a lot in our CI but couldn't identify the exact conditions why this was happening. I thing it might have been cause by quick start+restart when sbd services didn't fully start yet (CMAP connected but CPG not yet?) when corosync restart was triggered. In such cases the old sbd-cluster process kept trying to use the old socket and didn't terminate correctly. Maybe the best solution would be to just exit sbd-cluster if such situation is detected and let the watchdog do the restart?
The CMAP and CPG connections are served by the same corosync process, right? How is this even possible (besides the restart case) that the CMAP connection is lost while CPG is still working fine?

@wenningerk
Copy link

onnection is lost while CPG is still working fine?

@wenningerk wenningerk closed this May 28, 2019
@wenningerk
Copy link

Sorry that was the wrong button ...

@wenningerk wenningerk reopened this May 28, 2019
@wenningerk
Copy link

wenningerk commented May 28, 2019

@wenningerk We experienced this CMAP connection loss a lot in our CI but couldn't identify the exact conditions why this was happening. I thing it might have been cause by quick start+restart when sbd services didn't fully start yet (CMAP connected but CPG not yet?) when corosync restart was triggered. In such cases the old sbd-cluster process kept trying to use the old socket and didn't terminate correctly. Maybe the best solution would be to just exit sbd-cluster if such situation is detected and let the watchdog do the restart?

That is a good idea. Keeps things simple and good for testability as with the process going away there is just one way to get into the running and connected state instead of all the reconnection cases that might end up in something slightly different. I've just recently modified pacemaker-watcher to work in a similar way.

The CMAP and CPG connections are served by the same corosync process, right? How is this even possible (besides the restart case) that the CMAP connection is lost while CPG is still working fine?

Only thing I can imagine is restart issues. Yes - should be the same process - but corosync is multithreaded - but I'm not familiar with the details there.

@skazi0
Copy link
Author

skazi0 commented May 28, 2019

@wenningerk I added this new idea as a separate commit. I can squash it later if needed.

src/sbd.h Outdated Show resolved Hide resolved
@skazi0
Copy link
Author

skazi0 commented May 28, 2019

@wenningerk the base "exit based reconnect" PR: #81 I will rebase this one on top of that once it's agreed on.

@jfriesse
Copy link
Member

jfriesse commented May 28, 2019

@skazi0 Just answer your question about CMAP HUP and CPG still working. Corosync uses multiple sockets per service and on the exit it will close sockets one-by-one in order of service ids (internal thing). CMAP is closed before the CPG so if sbd (or any other IPC client) is lucky enough it will catch CMAP HUP before CPG HUP. But it should be really temporary.

As you can see in the pacemaker code lib/cluster/cpg.c function pcmk_cpg_dispatch it handles cpg_dispatch error. Then (please correct me if I'm wrong, I'm not really expert nether in pcmk code nor gio code) it removes cpg_fd from loop and then gio calls cluster->destroy. Code in SBD doesn't do that directly. What I don't understand is, why cmap_destroy was not called by sbd_membership_destroy which should delete the cmap fd from main loop. But maybe it was really this quick start-restart which caused sbd to keep watching cmap old fd and cpg new fd?

@skazi0
Copy link
Author

skazi0 commented May 28, 2019

@jfriesse I've never seen Lost connection to ... message from sbd_membership_destroy in my logs so I think that cluster disconnect never happened in my tests.

Take a look at this piece of log:

maj 20 15:21:27 dfa-16-3e-29-57-31 systemd[1]: Stopping Corosync Cluster Engine...
maj 20 15:21:27 dfa-16-3e-29-57-31 systemd[1]: Starting Shared-storage based fencing daemon...
maj 20 15:21:28 dfa-16-3e-29-57-31 systemd[1]: Stopped Corosync Cluster Engine.
maj 20 15:21:28 dfa-16-3e-29-57-31 systemd[1]: Starting Corosync Cluster Engine...
maj 20 15:21:28 dfa-16-3e-29-57-31 systemd[1]: Started Shared-storage based fencing daemon.

It seems that the corosync process got replaced somewhere during the sbd startup procedure. CMAP connection is opened inside sbd_get_two_node called from sbd_membership_connect before actual cluster connection is established. Maybe the actual cluster connection already hit the new corosync process and worked correctly. Only the CMAP socket(s) were pointing at the old process.

To avoid problems with lost CMAP connection, just exit and let the
inquisitor fix the situation by restarting the servant.
@jfriesse
Copy link
Member

@skazi0 Yeps, sbd using old corosync cmap fd and new cpg fd seems to be most probable reason. Nice catch!

@jfriesse jfriesse removed their request for review May 29, 2019 06:51
@skazi0
Copy link
Author

skazi0 commented May 29, 2019

@wenningerk as I understand, the cpg reconnect via restart is not as simple to implement as it seemed (for sure not with my level of expertise). How shall we proceed with this cmap fix?

@wenningerk
Copy link

@wenningerk as I understand, the cpg reconnect via restart is not as simple to implement as it seemed (for sure not with my level of expertise). How shall we proceed with this cmap fix?

I would still like to see if we can sort out startup/shutdown issues and exit in more or less all cases of disconnection to trigger an immediate suicide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants