sbd-cluster: Simplify cluster connection loss handling #81

skazi0 · 2019-05-28T17:45:31Z

If cluster connection is lost, exit sbd-cluster and let the inquisitor
handle the reconnect by restart.

If cluster connection is lost, exit sbd-cluster and let the inquisitor handle the reconnect by restart.

wenningerk · 2019-05-29T05:55:03Z

I guess the full resetting should just be done on some kind of graceful disconnection of corosync (triggered by systemd having assured that pacemaker is down already; currently we don't have a way to really tell - possibly a future improvement in corosync). Otherwise a reconnection attempt (if done at all) has to assure that everything happens within the configured timeout-interval. But I'll have to have a closer look ...

skazi0 · 2019-05-29T06:06:00Z

I expected this to not be as simple as the CMAP disconnect which is likely only happening during restart anyway.

jfriesse · 2019-05-29T06:57:44Z

@wenningerk Yeps, add corosync ability to inform clients about its exit is something I'm considering to add. Just keep in mind it's really not that easy. Also I don't really think corosync can tell that pcmk is really down. What corosync can do is to add api which sends message to some IPC clients (clients which registered for receiving such message - probably CPG service?) and waits (probably with some configurable timeout) for client ACK.

wenningerk · 2019-05-29T07:23:19Z

@wenningerk Yeps, add corosync ability to inform clients about its exit is something I'm considering to add. Just keep in mind it's really not that easy. Also I don't really think corosync can tell that pcmk is really down. What corosync can do is to add api which sends message to some IPC clients (clients which registered for receiving such message - probably CPG service?) and waits (probably with some configurable timeout) for client ACK.

Maybe not that critical as pacemaker-watcher still offers some degree of safety-net as it detects pacemaker going down without running local resources. We just have to be careful as cib-updates might be stalled with corosync being down.

jfriesse · 2019-05-29T14:45:23Z

@wenningerk Ok, perfect. I've filled corosync/corosync#475 so it is not forgotten. I would like to ask you to let me know if you decide not to use such feature, because I don't think it make sense to spent time implementing "useless" feature.

skazi0 · 2019-05-29T15:50:26Z

Closing for now as it's not as simple as it seemed.

wenningerk · 2019-05-29T16:06:10Z

Closing for now as it's not as simple as it seemed.

We could already introduce this graceful-disconnection-return-value here (even if we wouldn't be using as long as corosync doesn't signal the graceful shutdown) and close the watcher with a different return value in all other cases of disconnection that would basically lead to immediate suicide.

skazi0 · 2019-05-29T16:11:56Z

So you mean that current version of this PR is still worth something?

wenningerk · 2019-05-29T16:16:41Z

So you mean that current version of this PR is still worth something?

Well it implements the basic desired structure. Let me have a look later this evening.

skazi0 · 2019-06-19T18:46:42Z

@wenningerk did you have time to take a look at this? Or maybe you came up with some better solution in the mean time?

wenningerk · 2019-06-19T19:43:36Z

@wenningerk did you have time to take a look at this? Or maybe you came up with some better solution in the mean time?

Guess to simplify it in a way as I've done with the cib-connection would require to be able to determine if a corosync disconnect was a graceful shutdown or not. As I don't see how that should be possible atm I haven't yet found the time to think about it further.

skazi0 · 2019-06-19T20:19:28Z

@wenningerk as we're still experiencing this HUP/100% CPU problem in our systems, I think I need to prepare some local/private patch to fix this for us. As I understand, the simplest solution is to exit in case of such failures and let the inquisitor handle the restart (even without special exit code). Is that correct?

curvygrin · 2019-09-13T07:02:26Z

Is there progress on this patch?
I'm seeing the same issue.

knet-ci-bot · 2019-09-13T07:02:29Z

Can one of the admins verify this patch?

wenningerk · 2019-09-17T16:12:39Z

Is there progress on this patch?
I'm seeing the same issue.

As a simple solution (like the one for pacemaker - at least from the basic pattern - the pacemaker graceful-shutdown-detection is still a hack) would require to know if corosync went away gracefully this still requires some syncing between projects.
If you are just after the hup-issue that kicked off all of that we might think of having that as an interimistic improvement.

wenningerk · 2021-01-14T14:17:26Z

As a simple solution (like the one for pacemaker - at least from the basic pattern - the pacemaker graceful-shutdown-detection is still a hack) would require to know if corosync went away gracefully this still requires some syncing between projects.

Time to revive this ...
Meanwhile graceful-shutdown-detection towards pacemaker has been replaced by a robust ipc-implementation that does as well syncing on startup (pacemakerd waits to be contacted by sbd which will not happen if sbd fails to detect existence of pacemaker due to e.g. wrong selinux config).
On the corosync front
https://github.com/corosync/corosync/pull/615
is giving us an interface that we can use to see if connection-loss to corosync happened gracefully or not.

sbd-cluster: Simplify cluster connection loss handling

b257ff1

If cluster connection is lost, exit sbd-cluster and let the inquisitor handle the reconnect by restart.

skazi0 mentioned this pull request May 28, 2019

Fix: sbd-cluster: stop dispatching cmap if disconnected #80

Open

wenningerk requested a review from jfriesse May 29, 2019 05:49

jfriesse removed their request for review May 29, 2019 06:50

jfriesse mentioned this pull request May 29, 2019

Consider implementing API to inform IPC clients about corosync exit corosync/corosync#475

Closed

skazi0 closed this May 29, 2019

wenningerk mentioned this pull request Jun 5, 2019

Fix: sbd-cluster: periodically check corosync-daemon liveness #83

Merged

skazi0 reopened this Jun 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sbd-cluster: Simplify cluster connection loss handling #81

sbd-cluster: Simplify cluster connection loss handling #81

skazi0 commented May 28, 2019

wenningerk commented May 29, 2019

skazi0 commented May 29, 2019

jfriesse commented May 29, 2019

wenningerk commented May 29, 2019

jfriesse commented May 29, 2019

skazi0 commented May 29, 2019

wenningerk commented May 29, 2019

skazi0 commented May 29, 2019

wenningerk commented May 29, 2019

skazi0 commented Jun 19, 2019

wenningerk commented Jun 19, 2019

skazi0 commented Jun 19, 2019

curvygrin commented Sep 13, 2019 •

edited

Loading

knet-ci-bot commented Sep 13, 2019

wenningerk commented Sep 17, 2019

wenningerk commented Jan 14, 2021

sbd-cluster: Simplify cluster connection loss handling #81

Are you sure you want to change the base?

sbd-cluster: Simplify cluster connection loss handling #81

Conversation

skazi0 commented May 28, 2019

wenningerk commented May 29, 2019

skazi0 commented May 29, 2019

jfriesse commented May 29, 2019

wenningerk commented May 29, 2019

jfriesse commented May 29, 2019

skazi0 commented May 29, 2019

wenningerk commented May 29, 2019

skazi0 commented May 29, 2019

wenningerk commented May 29, 2019

skazi0 commented Jun 19, 2019

wenningerk commented Jun 19, 2019

skazi0 commented Jun 19, 2019

curvygrin commented Sep 13, 2019 • edited Loading

knet-ci-bot commented Sep 13, 2019

wenningerk commented Sep 17, 2019

wenningerk commented Jan 14, 2021

curvygrin commented Sep 13, 2019 •

edited

Loading