-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sbd-cluster: Simplify cluster connection loss handling #81
base: main
Are you sure you want to change the base?
Conversation
If cluster connection is lost, exit sbd-cluster and let the inquisitor handle the reconnect by restart.
I guess the full resetting should just be done on some kind of graceful disconnection of corosync (triggered by systemd having assured that pacemaker is down already; currently we don't have a way to really tell - possibly a future improvement in corosync). Otherwise a reconnection attempt (if done at all) has to assure that everything happens within the configured timeout-interval. But I'll have to have a closer look ... |
I expected this to not be as simple as the CMAP disconnect which is likely only happening during restart anyway. |
@wenningerk Yeps, add corosync ability to inform clients about its exit is something I'm considering to add. Just keep in mind it's really not that easy. Also I don't really think corosync can tell that pcmk is really down. What corosync can do is to add api which sends message to some IPC clients (clients which registered for receiving such message - probably CPG service?) and waits (probably with some configurable timeout) for client ACK. |
Maybe not that critical as pacemaker-watcher still offers some degree of safety-net as it detects pacemaker going down without running local resources. We just have to be careful as cib-updates might be stalled with corosync being down. |
@wenningerk Ok, perfect. I've filled corosync/corosync#475 so it is not forgotten. I would like to ask you to let me know if you decide not to use such feature, because I don't think it make sense to spent time implementing "useless" feature. |
Closing for now as it's not as simple as it seemed. |
We could already introduce this graceful-disconnection-return-value here (even if we wouldn't be using as long as corosync doesn't signal the graceful shutdown) and close the watcher with a different return value in all other cases of disconnection that would basically lead to immediate suicide. |
So you mean that current version of this PR is still worth something? |
Well it implements the basic desired structure. Let me have a look later this evening. |
@wenningerk did you have time to take a look at this? Or maybe you came up with some better solution in the mean time? |
Guess to simplify it in a way as I've done with the cib-connection would require to be able to determine if a corosync disconnect was a graceful shutdown or not. As I don't see how that should be possible atm I haven't yet found the time to think about it further. |
@wenningerk as we're still experiencing this HUP/100% CPU problem in our systems, I think I need to prepare some local/private patch to fix this for us. As I understand, the simplest solution is to exit in case of such failures and let the inquisitor handle the restart (even without special exit code). Is that correct? |
Is there progress on this patch? |
Can one of the admins verify this patch? |
As a simple solution (like the one for pacemaker - at least from the basic pattern - the pacemaker graceful-shutdown-detection is still a hack) would require to know if corosync went away gracefully this still requires some syncing between projects. |
Time to revive this ... |
If cluster connection is lost, exit sbd-cluster and let the inquisitor
handle the reconnect by restart.