Fix: sbd-cluster: stop dispatching cmap if disconnected #80

skazi0 · 2019-05-20T13:10:39Z

If cmap socket is in HUP state, attempt to dispatch incoming events
will trigger the callback again and cause infinite loop with high
CPU load.
Added check should solve this by destroying the cmap connection and
removing it from the main loop.

src/sbd-cluster.c

If cmap socket is in HUP state, attempt to dispatch incoming events will trigger the callback again and cause infinite loop with high CPU load. Added check should solve this by destroying the cmap connection and removing it from the main loop.

wenningerk · 2019-05-27T09:04:34Z

Thanks for the update. Unfortunately github doesn't create an email-notification upon an amend-commit. That is why I just now stumbled over the update by manually polling. Thus it usually makes sense to spare an additional comment when doing an amend-commit.

wenningerk · 2019-05-27T13:48:50Z

Aah one more question arises looking at that error-handling. If we just close down the cmap-tracking we wouldn't be updated about changes anymore but proceed otherwise which might be dangerous when the cluster switches from > 2-nodes to 2-nodes due to configuration.
If it just happens when we loose the corosync-connection implying loosing the cluster-connection as well we would have a reconnection attempt. But if that can happen withouth the pacemaker-cpg-connection going away as well we probably would have to trigger a full reconnection.

skazi0 · 2019-05-27T13:59:24Z

@wenningerk I was thinking about this but currently all CMAP failures (see https://github.com/ClusterLabs/sbd/blob/master/src/sbd-cluster.c#L197) are just "warning level" and don't trigger reconnect.

wenningerk · 2019-05-27T14:07:56Z

@wenningerk I was thinking about this but currently all CMAP failures (see https://github.com/ClusterLabs/sbd/blob/master/src/sbd-cluster.c#L197) are just "warning level" and don't trigger reconnect.

That isn't entirely true as up to now we just have connection issues.
And those will lead to connected never becoming 'true' so that sbd_membership_connect is stuck in a loop that prevents notify_parent from being called and the failed stuff is gonna be repeated.
The errors themselves are indeed warnings but watchdog / inquisitor should be taking care of the issue.

skazi0 · 2019-05-28T06:16:17Z

@wenningerk OK, so shall we really do a full reconnect (i.e. sbd_membership_destroy()) or just try to connect to cmap again? I'm thinking about creating another timer when cmap connection is lost and keep it running until we get a successful connection. WDYT?

wenningerk · 2019-05-28T06:30:39Z

@wenningerk OK, so shall we really do a full reconnect (i.e. sbd_membership_destroy()) or just try to connect to cmap again? I'm thinking about creating another timer when cmap connection is lost and keep it running until we get a successful connection. WDYT?

good question ... multiple timers and stuff introduce another source of racyness so I'm leaning toward a simple solution. Another question is if we should try corosync reconnects at all or for robustness rather consider connection loss to corosync as fatal for the node (except the shutdown cases of course but that should be handled within the watchdog-timeout).

skazi0 · 2019-05-28T08:24:06Z

@wenningerk We experienced this CMAP connection loss a lot in our CI but couldn't identify the exact conditions why this was happening. I thing it might have been cause by quick start+restart when sbd services didn't fully start yet (CMAP connected but CPG not yet?) when corosync restart was triggered. In such cases the old sbd-cluster process kept trying to use the old socket and didn't terminate correctly. Maybe the best solution would be to just exit sbd-cluster if such situation is detected and let the watchdog do the restart?
The CMAP and CPG connections are served by the same corosync process, right? How is this even possible (besides the restart case) that the CMAP connection is lost while CPG is still working fine?

wenningerk · 2019-05-28T08:46:14Z

onnection is lost while CPG is still working fine?

wenningerk · 2019-05-28T08:47:16Z

Sorry that was the wrong button ...

wenningerk · 2019-05-28T08:50:30Z

@wenningerk We experienced this CMAP connection loss a lot in our CI but couldn't identify the exact conditions why this was happening. I thing it might have been cause by quick start+restart when sbd services didn't fully start yet (CMAP connected but CPG not yet?) when corosync restart was triggered. In such cases the old sbd-cluster process kept trying to use the old socket and didn't terminate correctly. Maybe the best solution would be to just exit sbd-cluster if such situation is detected and let the watchdog do the restart?

That is a good idea. Keeps things simple and good for testability as with the process going away there is just one way to get into the running and connected state instead of all the reconnection cases that might end up in something slightly different. I've just recently modified pacemaker-watcher to work in a similar way.

The CMAP and CPG connections are served by the same corosync process, right? How is this even possible (besides the restart case) that the CMAP connection is lost while CPG is still working fine?

Only thing I can imagine is restart issues. Yes - should be the same process - but corosync is multithreaded - but I'm not familiar with the details there.

skazi0 · 2019-05-28T12:00:57Z

@wenningerk I added this new idea as a separate commit. I can squash it later if needed.

src/sbd.h

skazi0 · 2019-05-28T17:47:25Z

@wenningerk the base "exit based reconnect" PR: #81 I will rebase this one on top of that once it's agreed on.

jfriesse · 2019-05-28T17:50:02Z

@skazi0 Just answer your question about CMAP HUP and CPG still working. Corosync uses multiple sockets per service and on the exit it will close sockets one-by-one in order of service ids (internal thing). CMAP is closed before the CPG so if sbd (or any other IPC client) is lucky enough it will catch CMAP HUP before CPG HUP. But it should be really temporary.

As you can see in the pacemaker code lib/cluster/cpg.c function pcmk_cpg_dispatch it handles cpg_dispatch error. Then (please correct me if I'm wrong, I'm not really expert nether in pcmk code nor gio code) it removes cpg_fd from loop and then gio calls cluster->destroy. Code in SBD doesn't do that directly. What I don't understand is, why cmap_destroy was not called by sbd_membership_destroy which should delete the cmap fd from main loop. But maybe it was really this quick start-restart which caused sbd to keep watching cmap old fd and cpg new fd?

skazi0 · 2019-05-28T18:27:13Z

@jfriesse I've never seen Lost connection to ... message from sbd_membership_destroy in my logs so I think that cluster disconnect never happened in my tests.

Take a look at this piece of log:

maj 20 15:21:27 dfa-16-3e-29-57-31 systemd[1]: Stopping Corosync Cluster Engine...
maj 20 15:21:27 dfa-16-3e-29-57-31 systemd[1]: Starting Shared-storage based fencing daemon...
maj 20 15:21:28 dfa-16-3e-29-57-31 systemd[1]: Stopped Corosync Cluster Engine.
maj 20 15:21:28 dfa-16-3e-29-57-31 systemd[1]: Starting Corosync Cluster Engine...
maj 20 15:21:28 dfa-16-3e-29-57-31 systemd[1]: Started Shared-storage based fencing daemon.

It seems that the corosync process got replaced somewhere during the sbd startup procedure. CMAP connection is opened inside sbd_get_two_node called from sbd_membership_connect before actual cluster connection is established. Maybe the actual cluster connection already hit the new corosync process and worked correctly. Only the CMAP socket(s) were pointing at the old process.

To avoid problems with lost CMAP connection, just exit and let the inquisitor fix the situation by restarting the servant.

jfriesse · 2019-05-29T06:07:28Z

@skazi0 Yeps, sbd using old corosync cmap fd and new cpg fd seems to be most probable reason. Nice catch!

skazi0 · 2019-05-29T15:51:43Z

@wenningerk as I understand, the cpg reconnect via restart is not as simple to implement as it seemed (for sure not with my level of expertise). How shall we proceed with this cmap fix?

wenningerk · 2019-05-29T16:10:55Z

@wenningerk as I understand, the cpg reconnect via restart is not as simple to implement as it seemed (for sure not with my level of expertise). How shall we proceed with this cmap fix?

I would still like to see if we can sort out startup/shutdown issues and exit in more or less all cases of disconnection to trigger an immediate suicide.

kgaillot reviewed May 20, 2019

View reviewed changes

src/sbd-cluster.c Show resolved Hide resolved

skazi0 force-pushed the cmap-hup-loop-fix branch from 3bbb161 to 838dd14 Compare May 22, 2019 19:12

wenningerk closed this May 28, 2019

wenningerk reopened this May 28, 2019

wenningerk reviewed May 28, 2019

View reviewed changes

src/sbd.h Outdated Show resolved Hide resolved

skazi0 force-pushed the cmap-hup-loop-fix branch from 19c11ce to cda2d7d Compare May 28, 2019 12:30

wenningerk requested a review from jfriesse May 28, 2019 16:06

Fix: sbd-cluster: exit if cmap is disconnected

adafc28

To avoid problems with lost CMAP connection, just exit and let the inquisitor fix the situation by restarting the servant.

skazi0 force-pushed the cmap-hup-loop-fix branch from cda2d7d to adafc28 Compare May 29, 2019 06:04

jfriesse removed their request for review May 29, 2019 06:51

wenningerk mentioned this pull request Jun 5, 2019

Fix: sbd-cluster: periodically check corosync-daemon liveness #83

Merged

gao-yan mentioned this pull request Jun 24, 2021

Fix 100% CPU usage when CMAP connection is lost #129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: sbd-cluster: stop dispatching cmap if disconnected #80

Fix: sbd-cluster: stop dispatching cmap if disconnected #80

skazi0 commented May 20, 2019

wenningerk commented May 27, 2019

wenningerk commented May 27, 2019

skazi0 commented May 27, 2019

wenningerk commented May 27, 2019 •

edited

Loading

skazi0 commented May 28, 2019

wenningerk commented May 28, 2019

skazi0 commented May 28, 2019 •

edited

Loading

wenningerk commented May 28, 2019

wenningerk commented May 28, 2019

wenningerk commented May 28, 2019 •

edited

Loading

skazi0 commented May 28, 2019

skazi0 commented May 28, 2019

jfriesse commented May 28, 2019 •

edited

Loading

skazi0 commented May 28, 2019

jfriesse commented May 29, 2019

skazi0 commented May 29, 2019

wenningerk commented May 29, 2019

Fix: sbd-cluster: stop dispatching cmap if disconnected #80

Are you sure you want to change the base?

Fix: sbd-cluster: stop dispatching cmap if disconnected #80

Conversation

skazi0 commented May 20, 2019

wenningerk commented May 27, 2019

wenningerk commented May 27, 2019

skazi0 commented May 27, 2019

wenningerk commented May 27, 2019 • edited Loading

skazi0 commented May 28, 2019

wenningerk commented May 28, 2019

skazi0 commented May 28, 2019 • edited Loading

wenningerk commented May 28, 2019

wenningerk commented May 28, 2019

wenningerk commented May 28, 2019 • edited Loading

skazi0 commented May 28, 2019

skazi0 commented May 28, 2019

jfriesse commented May 28, 2019 • edited Loading

skazi0 commented May 28, 2019

jfriesse commented May 29, 2019

skazi0 commented May 29, 2019

wenningerk commented May 29, 2019

wenningerk commented May 27, 2019 •

edited

Loading

skazi0 commented May 28, 2019 •

edited

Loading

wenningerk commented May 28, 2019 •

edited

Loading

jfriesse commented May 28, 2019 •

edited

Loading