Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: service: add pre-start configuration validator #99

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gao-yan
Copy link
Member

@gao-yan gao-yan commented Sep 26, 2019

  1. Currently the only hard errors are:
  • sbd binary is not existing or not executable.

  • 'TimeoutStartSec' of service is shorter than 'msgwait' if
    'SBD_DELAY_START' is enabled.

  • 'TimeoutStartSec' of service is shorter than the delay value
    explicitly configured with 'SBD_DELAY_START'.

Hard errors prevent service from starting.

  1. Warnings are given if:
  • sbd devices are not existing.

  • meta-data cannot be read from sbd devices.

On start-up, since sbd daemon waits for sbd devices to appear and get
ready within start-up timeout, such situations shouldn't prevent service
from starting. Besides, for the latter situation, it doesn't necessarily
mean sbd devices are not correctly initialized. But anyway warnings are
given.

  1. Notices are given if:
  • '/etc/sysconfig/sbd' is not existing.

  • 'SBD_DEVICE' is not configured in case it's unintentional.

  • 'TimeoutStartUSec'/'TimeoutStartSec' setting of service somehow cannot
    be retrieved or recognized.

  • 'TimeoutStartSec' of service is shorter than sbd start-up timeout.

  • 'SBD_DELAY_START' is enabled but not by being specified an explicit
    delay value (It's recommended to set it to be longer than 'corosync
    token timeout + consensus timeout + pcmk_delay_max/base + msgwait').

  • 'SBD_DELAY_START' is configured with an explicit delay value but
    shorter than the configured 'msgwait'

These are not strictly or necessarily mis-configuration. They shouldn't
prevent service from starting either.

1) Currently the only hard errors are:

- sbd binary is not existing or not executable.

- 'TimeoutStartSec' of service is shorter than 'msgwait' if
'SBD_DELAY_START' is enabled.

- 'TimeoutStartSec' of service is shorter than the delay value
explicitly configured with 'SBD_DELAY_START'.

Hard errors prevent service from starting.

2) Warnings are given if:

- sbd devices are not existing.

- meta-data cannot be read from sbd devices.

On start-up, since sbd daemon waits for sbd devices to appear and get
ready within start-up timeout, such situations shouldn't prevent service
from starting. Besides, for the latter situation, it doesn't necessarily
mean sbd devices are not correctly initialized. But anyway warnings are
given.

3) Notices are given if:

- '/etc/sysconfig/sbd' is not existing.

- 'SBD_DEVICE' is not configured in case it's unintentional.

- 'TimeoutStartUSec'/'TimeoutStartSec' setting of service somehow cannot
be retrieved or recognized.

- 'TimeoutStartSec' of service is shorter than sbd start-up timeout.

- 'SBD_DELAY_START' is enabled but not by being specified an explicit
delay value (It's recommended to set it to be longer than 'corosync
token timeout + consensus timeout + pcmk_delay_max/base + msgwait').

- 'SBD_DELAY_START' is configured with an explicit delay value but
shorter than the configured 'msgwait'

These are not strictly or necessarily mis-configuration. They shouldn't
prevent service from starting either.
@knet-ci-bot
Copy link

Can one of the admins verify this patch?

@wenningerk
Copy link

add to whitelist

@wenningerk
Copy link

ppc64le on travis sucks on travis recently - let's give it another try ...

@wenningerk
Copy link

Ooh that grew into a complex thing ... I know I was the one to come up with the pre-exec-script ;-)
My main-idea was keeping systemd specifics out of the C-codebase/dependency.
What I see is that we are duplicating a lot of code in bash that is there in C already and unfortunately some logic (like how parameters and environment are combined making one take precedence over the other and stuff) with the inherent danger that the implementations don't align (or drift apart in the future - just playing with a change in '-d' vs. 'SBD_DEVICES' handling in https://github.com/wenningerk/sbd/commits/intercept_aio for instance).
And we are opening the devices twice - not too bad probably just mentioning.
Checking for the existence and putting out some logs if the binary doesn't exist should already be there in systemd (In your elaborate version of the pre-exec it makes sense to check/log before using of course ...).
I guess the pre-exec-script already has the environment the binary is executed in.
Did you check if it would be possible to export additional environment-variables from the pre-exec to be used by the binary?
Don't get me wrong ... not saying I don't like it ... just thinking ... give me a bit to think a little more ...

Klaus

@gao-yan
Copy link
Member Author

gao-yan commented Oct 1, 2019

My main-idea was keeping systemd specifics out of the C-codebase/dependency.

I think so too. It's not for sbd to understand what mechanism starts its daemon.

What I see is that we are duplicating a lot of code in bash that is there in C already and unfortunately some logic (like how parameters and environment are combined making one take precedence over the other and stuff) with the inherent danger that the implementations don't align (or drift apart in the future - just playing with a change in '-d' vs. 'SBD_DEVICES' handling in https://github.com/wenningerk/sbd/commits/intercept_aio for instance).

But I don't think we are unnecessarily duplicating code here. Although the main purpose is for checking TimeoutStartSec accordingly, which is beyond sbd's knowledge/tasks. But in order to achieve that, it's probably unavoidable to parse and check the prerequisite configuration information here. Indeed it should align with the logic in the C code, for example it should probably parse more relevant options like "-d "from SBD_OPTS.

And we are opening the devices twice - not too bad probably just mentioning.
Checking for the existence and putting out some logs if the binary doesn't exist should already be there in systemd (In your elaborate version of the pre-exec it makes sense to check/log before using of course ...).
I guess the pre-exec-script already has the environment the binary is executed in.

It does, but it's not harmful to get them again. I've actually been thinking the script could be more than just a pre-start checker called by systemd. It could be directly run by users/bootstraps as a configuration validator doing sanity-check during setup/bootstraping without or before starting cluster. That's the reason why I added a "validate-all" action for it, which for now apparently does the exactly same as "pre-start" though :-)

Did you check if it would be possible to export additional environment-variables from the pre-exec to be used by the binary?

I haven't figured out a way either. OTOH, I don't think it's for the binary to care about systemd settings as mentioned :-)

@wenningerk
Copy link

wenningerk commented Oct 1, 2019

I haven't figured out a way either. OTOH, I don't think it's for the binary to care about systemd settings as mentioned :-)

I would have called it something like SBD_MAX_STARTUP_DELAY coming from /etc/sysconfig/sbd or wherever possibly overwritten by the pre-exec (if that works) without sbd having to explicitly know where it is coming from just to be fed into a consistency check.
Your idea of a consistency-check before any service-start sounds appealing though. But as the sbd binary is meant to run as cmdline-tool already having that doesn't necessarily mean it has to be implemented as an extra script.

@gao-yan
Copy link
Member Author

gao-yan commented Oct 1, 2019

I haven't figured out a way either. OTOH, I don't think it's for the binary to care about systemd settings as mentioned :-)
I would have called it something like SBD_MAX_STARTUP_DELAY coming from /etc/sysconfig/sbd

Not another option from sbd sysconfig please ;-) The existing options are already tough enough for users to understand and make right.

or wherever possibly overwritten by the pre-exec (if that works) without sbd having to explicitly know where it is coming from just to be fed into a consistency check.

No matter where the checking is done, as long as the checker doesn't explicitly say a misconfiguration is exactly because of "TimeoutStartSec" from sbd.service, the error message won't be helpful but introducing confusions.

@wenningerk
Copy link

Not another option from sbd sysconfig please ;-) The existing options are already tough enough for users to understand and make right.

Would've mentioned it in sbd sysconfig but usually one wouldn't have to touch it because the startup is setting it. And I guess a sentence in the remarks of that file could state that in a comprehensive way.

No matter where the checking is done, as long as the checker doesn't explicitly say a misconfiguration is exactly because of "TimeoutStartSec" from sbd.service, the error message won't be helpful but introducing confusions.

There you've really got a point ...
Doesn't make me enthusiastic about the complicated pre-start but I'm getting that my suggestion isn't as helpful as I thought.
atm I don't see a real alternative to the pre-script-approach then. Won't have much time to think/play till end of the week unfortunately.

@wenningerk
Copy link

wenningerk commented Oct 7, 2019

Not saying this idea is ready for implementation but maybe thinking along these lines might lead to a general improvement:
Whenever stuff is coming in via environment it is not 100% sure where the info is coming from.
So if we would optionally mark config coming via environment we would be able to make messages more targeting. Or if the config came in via some unforeseen path we might as well spot that easier.

Was thinking something like ...
SBD_WATCHDOG_DEV=sysconfig:/dev/watchdog
SBD_MAX_STARTUP_DELAY=systemd-unit-TimeoutStartSec:30s

Implementation wise that approach would just require some string-filtering done on all environment-variables. Everything else can be done over time.

@kgaillot
Copy link

kgaillot commented Oct 7, 2019

Have you considered putting the shared bits in a private C library, and doing the pre-start script in C?

@wenningerk
Copy link

Have you considered putting the shared bits in a private C library, and doing the pre-start script in C?

Hmm ... interesting idea. We could probably do a split that keeps the systemd-library-dependencies out of the basic sbd code base.
But if we still call it as pre-start some parts like opening the disks for getting the timeouts from the headers would still have to be done twice.
So when splitting that out I wouldn't go with a pre-start after all. Just start and bail out.
The environment (systemd) - specific code would go into a library and the environment-specific messages as well.
The library would get an interface that could potentially be used for other startup-environments as well.
Called as cmdline-tool sbd would get an additional command like 'consistency-check' that would do more or less the same as 'watch' but instead of daemonizing exit with success.
If somebody really wants to he might still be using that as pre-start.

@gao-yan
Copy link
Member Author

gao-yan commented Oct 8, 2019

Indeed putting the checking parts into a library, and systemd-specific things even into a separate library would be beneficial for implementing a pre-start in C.

I might be misunderstanding anything in here, but if a pre-start is not the way to go, I don't see any way of preventing sbd binary from depending on systemd library, right?

And I'm not sure if it'd be worth it to make it depend on systemd just because of this, although systemd is commonly used and the dependency on systemd could be determined at compiling time...

OTOH generally, a daemon may have its systemd service file, but how could it be sure that it was started by systemd and is being limited by systemd timeouts? It could be started by an init-script or purely manually even if there's systemd.

Technically I think other things would be facing the similar topic. For example for pacemaker, 'shutdown-escalation' is not recommended to be configured, but still it's configurable. pacemaker-controld knows the value of 'shutdown-escalation' , but not actually anything about TimeoutStopSec of pacemaker.service.

What pacemaker does for now is give TimeoutStopSec=30min in the default pacemaker.serivce file, which is longer that the default 'shutdown-escalation' (20min).

So similarly, what we could first easily do for now is probably give the default sbd.service file a generally long enough TimeoutStartSec value for example 10min, as suggested before? That'd likely resolve 99% of "mis-configuration"/"forgotten-configuration". I don't see much drawback of that. And it wouldn't conflict/overlap with any further improvements for configuration validation.

@gao-yan
Copy link
Member Author

gao-yan commented Oct 8, 2019

@gao-yan
Copy link
Member Author

gao-yan commented Oct 8, 2019

Hrm, this could be interesting:
https://www.freedesktop.org/software/systemd/man/systemd.environment-generator.html

It doesn't seem like it can be done per unit/service ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants