-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: sbd-cluster: stop dispatching cmap if disconnected #80
base: main
Are you sure you want to change the base?
Conversation
If cmap socket is in HUP state, attempt to dispatch incoming events will trigger the callback again and cause infinite loop with high CPU load. Added check should solve this by destroying the cmap connection and removing it from the main loop.
Thanks for the update. Unfortunately github doesn't create an email-notification upon an amend-commit. That is why I just now stumbled over the update by manually polling. Thus it usually makes sense to spare an additional comment when doing an amend-commit. |
Aah one more question arises looking at that error-handling. If we just close down the cmap-tracking we wouldn't be updated about changes anymore but proceed otherwise which might be dangerous when the cluster switches from > 2-nodes to 2-nodes due to configuration. |
@wenningerk I was thinking about this but currently all CMAP failures (see https://github.com/ClusterLabs/sbd/blob/master/src/sbd-cluster.c#L197) are just "warning level" and don't trigger reconnect. |
That isn't entirely true as up to now we just have connection issues. |
@wenningerk OK, so shall we really do a full reconnect (i.e. |
good question ... multiple timers and stuff introduce another source of racyness so I'm leaning toward a simple solution. Another question is if we should try corosync reconnects at all or for robustness rather consider connection loss to corosync as fatal for the node (except the shutdown cases of course but that should be handled within the watchdog-timeout). |
@wenningerk We experienced this CMAP connection loss a lot in our CI but couldn't identify the exact conditions why this was happening. I thing it might have been cause by quick start+restart when sbd services didn't fully start yet (CMAP connected but CPG not yet?) when corosync restart was triggered. In such cases the old sbd-cluster process kept trying to use the old socket and didn't terminate correctly. Maybe the best solution would be to just exit sbd-cluster if such situation is detected and let the watchdog do the restart? |
|
Sorry that was the wrong button ... |
That is a good idea. Keeps things simple and good for testability as with the process going away there is just one way to get into the running and connected state instead of all the reconnection cases that might end up in something slightly different. I've just recently modified pacemaker-watcher to work in a similar way.
Only thing I can imagine is restart issues. Yes - should be the same process - but corosync is multithreaded - but I'm not familiar with the details there. |
@wenningerk I added this new idea as a separate commit. I can squash it later if needed. |
@wenningerk the base "exit based reconnect" PR: #81 I will rebase this one on top of that once it's agreed on. |
@skazi0 Just answer your question about CMAP HUP and CPG still working. Corosync uses multiple sockets per service and on the exit it will close sockets one-by-one in order of service ids (internal thing). CMAP is closed before the CPG so if sbd (or any other IPC client) is lucky enough it will catch CMAP HUP before CPG HUP. But it should be really temporary. As you can see in the pacemaker code |
@jfriesse I've never seen Take a look at this piece of log:
It seems that the corosync process got replaced somewhere during the sbd startup procedure. CMAP connection is opened inside |
To avoid problems with lost CMAP connection, just exit and let the inquisitor fix the situation by restarting the servant.
@skazi0 Yeps, sbd using old corosync cmap fd and new cpg fd seems to be most probable reason. Nice catch! |
@wenningerk as I understand, the cpg reconnect via restart is not as simple to implement as it seemed (for sure not with my level of expertise). How shall we proceed with this cmap fix? |
I would still like to see if we can sort out startup/shutdown issues and exit in more or less all cases of disconnection to trigger an immediate suicide. |
If cmap socket is in HUP state, attempt to dispatch incoming events
will trigger the callback again and cause infinite loop with high
CPU load.
Added check should solve this by destroying the cmap connection and
removing it from the main loop.