Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sbd-cluster: Simplify cluster connection loss handling #81

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

skazi0
Copy link

@skazi0 skazi0 commented May 28, 2019

If cluster connection is lost, exit sbd-cluster and let the inquisitor
handle the reconnect by restart.

If cluster connection is lost, exit sbd-cluster and let the inquisitor
handle the reconnect by restart.
@wenningerk
Copy link

I guess the full resetting should just be done on some kind of graceful disconnection of corosync (triggered by systemd having assured that pacemaker is down already; currently we don't have a way to really tell - possibly a future improvement in corosync). Otherwise a reconnection attempt (if done at all) has to assure that everything happens within the configured timeout-interval. But I'll have to have a closer look ...

@skazi0
Copy link
Author

skazi0 commented May 29, 2019

I expected this to not be as simple as the CMAP disconnect which is likely only happening during restart anyway.

@jfriesse jfriesse removed their request for review May 29, 2019 06:50
@jfriesse
Copy link
Member

@wenningerk Yeps, add corosync ability to inform clients about its exit is something I'm considering to add. Just keep in mind it's really not that easy. Also I don't really think corosync can tell that pcmk is really down. What corosync can do is to add api which sends message to some IPC clients (clients which registered for receiving such message - probably CPG service?) and waits (probably with some configurable timeout) for client ACK.

@wenningerk
Copy link

@wenningerk Yeps, add corosync ability to inform clients about its exit is something I'm considering to add. Just keep in mind it's really not that easy. Also I don't really think corosync can tell that pcmk is really down. What corosync can do is to add api which sends message to some IPC clients (clients which registered for receiving such message - probably CPG service?) and waits (probably with some configurable timeout) for client ACK.

Maybe not that critical as pacemaker-watcher still offers some degree of safety-net as it detects pacemaker going down without running local resources. We just have to be careful as cib-updates might be stalled with corosync being down.

@jfriesse
Copy link
Member

@wenningerk Ok, perfect. I've filled corosync/corosync#475 so it is not forgotten. I would like to ask you to let me know if you decide not to use such feature, because I don't think it make sense to spent time implementing "useless" feature.

@skazi0
Copy link
Author

skazi0 commented May 29, 2019

Closing for now as it's not as simple as it seemed.

@skazi0 skazi0 closed this May 29, 2019
@wenningerk
Copy link

Closing for now as it's not as simple as it seemed.

We could already introduce this graceful-disconnection-return-value here (even if we wouldn't be using as long as corosync doesn't signal the graceful shutdown) and close the watcher with a different return value in all other cases of disconnection that would basically lead to immediate suicide.

@skazi0
Copy link
Author

skazi0 commented May 29, 2019

So you mean that current version of this PR is still worth something?

@wenningerk
Copy link

So you mean that current version of this PR is still worth something?

Well it implements the basic desired structure. Let me have a look later this evening.

@skazi0
Copy link
Author

skazi0 commented Jun 19, 2019

@wenningerk did you have time to take a look at this? Or maybe you came up with some better solution in the mean time?

@wenningerk
Copy link

@wenningerk did you have time to take a look at this? Or maybe you came up with some better solution in the mean time?

Guess to simplify it in a way as I've done with the cib-connection would require to be able to determine if a corosync disconnect was a graceful shutdown or not. As I don't see how that should be possible atm I haven't yet found the time to think about it further.

@skazi0
Copy link
Author

skazi0 commented Jun 19, 2019

@wenningerk as we're still experiencing this HUP/100% CPU problem in our systems, I think I need to prepare some local/private patch to fix this for us. As I understand, the simplest solution is to exit in case of such failures and let the inquisitor handle the restart (even without special exit code). Is that correct?

@curvygrin
Copy link

curvygrin commented Sep 13, 2019

Is there progress on this patch?
I'm seeing the same issue.

@knet-ci-bot
Copy link

Can one of the admins verify this patch?

@wenningerk
Copy link

Is there progress on this patch?
I'm seeing the same issue.

As a simple solution (like the one for pacemaker - at least from the basic pattern - the pacemaker graceful-shutdown-detection is still a hack) would require to know if corosync went away gracefully this still requires some syncing between projects.
If you are just after the hup-issue that kicked off all of that we might think of having that as an interimistic improvement.

@wenningerk
Copy link

As a simple solution (like the one for pacemaker - at least from the basic pattern - the pacemaker graceful-shutdown-detection is still a hack) would require to know if corosync went away gracefully this still requires some syncing between projects.

Time to revive this ...
Meanwhile graceful-shutdown-detection towards pacemaker has been replaced by a robust ipc-implementation that does as well syncing on startup (pacemakerd waits to be contacted by sbd which will not happen if sbd fails to detect existence of pacemaker due to e.g. wrong selinux config).
On the corosync front
https://github.com/corosync/corosync/pull/615
is giving us an interface that we can use to see if connection-loss to corosync happened gracefully or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants