[1.1] Feature: priority-fencing-delay #2043

gao-yan · 2020-04-24T22:11:01Z

Backports of #2012 and #2027 for 1.1 branch.

This feature addresses the relevant topics and implements the ideas
brought up from:

Apply specified delay for the fencings that are targeting the lost
nodes with the highest total resource priority in case we don't
have the majority of the nodes in our cluster partition, so that
the more significant nodes potentially win any fencing match,
which is especially meaningful under split-brain of 2-node
cluster. A promoted resource instance takes the base priority + 1
on calculation if the base priority is not 0. Any static/random
delays that are introduced by pcmk_delay_base/max configured
for the corresponding fencing resources will be added to this
delay. This delay should be significantly greater than, safely
twice, the maximum pcmk_delay_base/max. By default, priority
fencing delay is disabled.

This feature addresses the relevant topics and implements the ideas brought up from: ClusterLabs/fence-agents#308 This commit adds priority-fencing-delay option (just the option, not the feature itself). Enforce specified delay for the fencings that are targeting the lost nodes with the highest total resource priority in case we don't have the majority of the nodes in our cluster partition, so that the more significant nodes potentially win any fencing match, which is especially meaningful under split-brain of 2-node cluster. A promoted resource instance takes the base priority + 1 on calculation if the base priority is not 0. If all the nodes have equal priority, then any pcmk_delay_base/max configured for the corresponding fencing resources will be applied. Otherwise as long as it's set, even if to 0, it takes precedence over any configured pcmk_delay_base/max. By default, priority fencing delay is disabled.

This is based on the existing test whitebox-imply-stop-on-fence.

A parameter value -1 disables enforced fencing delay. Operation fence() is now a wrapper for fence_with_delay().

…ng delay

…g delay It can be specified with --fence, --reboot or --unfence commands. The default value -1 disables enforced fencing delay.

Enforced fencing delay takes precedence over any pcmk_delay_base/max configured for the corresponding fencing resources. Enforced fencing delay is applied only for the first device in the first fencing topology level. Consistently use g_timeout_add_seconds() for pcmk_delay_base/max as well.

…delay with fencing topology

…bled This commit also documents the upcoming new behavior as discussed from: ClusterLabs#2012 Any static/random delays that are introduced by `pcmk_delay_base/max` configured for the corresponding fencing resources will be added to this delay. This delay should be significantly greater than, safely twice, the maximum `pcmk_delay_base/max`. By default, priority fencing delay is disabled.

… have equal priority In any cases, priority-fencing-delay won't take precedence over any configured pcmk_delay_base/max.

…uested fencing delay Requested fencing delay doesn't take precedence over any configured pcmk_delay_base/max. A delay value -1 now means disable also any static/random fencing delays from pcmk_delay_base/max. It's not used by any consumers for now.

This commit also documents the current behavior in the help: - Any static/random delays from pcmk_delay_base/max will be added to requested fencing delay. - A delay value -1 now means disable also any static/random fencing delays from pcmk_delay_base/max.

…y_base is added This commit also updates log patterns for the log changes.

kgaillot · 2020-05-27T13:55:03Z

Getting ready for the 1.1.23 release, it just now occurred to me that it would have been better not to backport this upstream. We can't bump the feature set in 1.1 (we need to guarantee rolling upgrades from any 1.1 to any 2.0). That means in a mixed version cluster, this feature could start or stop working depending on which node is elected DC.

It's a tough call -- I can either let this be part of the release, and let that be a known problem, or revert it upstream (distros of course could still backport it). What are your thoughts?

gao-yan · 2020-05-27T15:51:53Z

Getting ready for the 1.1.23 release, it just now occurred to me that it would have been better not to backport this upstream. We can't bump the feature set in 1.1 (we need to guarantee rolling upgrades from any 1.1 to any 2.0).

1.1 branch has the feature set 3.0.14, while 2.0 versions have something >= 3.1.0, right? We could bump 1.1 branch to 3.0.15, no?

gao-yan · 2020-05-27T15:54:05Z

Getting ready for the 1.1.23 release, it just now occurred to me that it would have been better not to backport this upstream. We can't bump the feature set in 1.1 (we need to guarantee rolling upgrades from any 1.1 to any 2.0).

1.1 branch has the feature set 3.0.14, while 2.0 versions have something >= 3.1.0, right? We could bump 1.1 branch to 3.0.15, no?

Or you mean we should support rolling upgrade to any old 2.0? Do we have to?

kgaillot · 2020-05-27T16:02:35Z

Or you mean we should support rolling upgrade to any old 2.0? Do we have to?

Right, we currently guarantee rolling upgrades from any mix of versions 1.1.11 or later to any higher version.

kgaillot · 2020-05-27T16:25:05Z

Rather than revert it entirely, I can ifdef the key sections with a new constant that users can define if they want to enable it, at the cost of upgrade compatibility.

gao-yan · 2020-05-27T16:30:26Z

Hard to tell why one would upgrade from a latest 1.1 to an outdated 2.0 :-) But of course they would lose the feature upgrading to a 2.0 version that doesn't support it.

OTOH, it's probably not really an incompatible change. It's just the cluster nodes incapable of handling a delay simply ignore it.

Hrm , not sure if it's really an unacceptable limitation, compared with the benefits for 1.1 users to have the feature ...

gao-yan · 2020-05-27T16:30:38Z

Rather than revert it entirely, I can ifdef the key sections with a new constant that users can define if they want to enable it, at the cost of upgrade compatibility.

Ah, good idea!

kgaillot · 2020-05-27T16:40:44Z

Rather than revert it entirely, I can ifdef the key sections with a new constant that users can define if they want to enable it, at the cost of upgrade compatibility.

Ah, good idea!

I'll do that then.

It's unlikely a user would upgrade from a newer 1.1 to an older 2.0, but an example of where that might happen is if someone switches from compiling their own on an older platform, to stock packages on a newer platform with an older 2.0.

Also, unless we did a feature set bump, a mixed-version cluster is allowed (in this case anything since 1.1.18). With such a cluster, the feature would start or stop working depending on which node is DC. (It's not wise to run such a cluster outside a rolling upgrade, but it is supported.)

We could revise our support guarantees, but I'd rather do that at 3.0.0 than break a promise we've already made. I do like the current policy though, it avoids needing a compatibility matrix, and is intuitive enough to minimize surprises for anyone unfamiliar with the policy.

kgaillot · 2020-05-27T16:46:22Z

BTW I see I overlooked this issue with fence-reaction in 1.1.22. :( It looks like everything else since 1.1.18 is acceptable.

kgaillot · 2020-05-27T17:14:56Z

See #2082 for -DENABLE_PRIORITY_FENCING_DELAY

gao-yan · 2020-05-27T18:22:37Z

We could revise our support guarantees, but I'd rather do that at 3.0.0 than break a promise we've already made. I do like the current policy though, it avoids needing a compatibility matrix, and is intuitive enough to minimize surprises for anyone unfamiliar with the policy.

Makes sense.

BTW I see I overlooked this issue with fence-reaction in 1.1.22. :(

Didn't think of that either...

It looks like everything else since 1.1.18 is acceptable.

Alright.

gao-yan added 15 commits April 17, 2020 11:46

Feature: scheduler: implement priority-fencing-delay

05c6800

Test: scheduler: add regression test for priority-fencing-delay

fcd987d

This is based on the existing test whitebox-imply-stop-on-fence.

Feature: libstonithd: introduce fence_with_delay() operation

47cfa38

A parameter value -1 disables enforced fencing delay. Operation fence() is now a wrapper for fence_with_delay().

Feature: controller: request fencing with any enforced priority fenci…

b42f6e3

…ng delay

Feature: stonith_admin: add --delay option to support enforced fencin…

6c3039a

…g delay It can be specified with --fence, --reboot or --unfence commands. The default value -1 disables enforced fencing delay.

Test: fencer: add cpg_topology_delay test to verify enforced fencing …

432ee3a

…delay with fencing topology

Doc: Pacemaker Explained: document priority-fencing-delay cluster option

d7bab0e

Feature: scheduler: do not differentiate the case where all the nodes…

69cc3f6

… have equal priority In any cases, priority-fencing-delay won't take precedence over any configured pcmk_delay_base/max.

Feature: controller: requested priority fencing delay defaults to 0

ec67feb

Test: fencer: update cpg_topology_delay test to also verify pcmk_dela…

7367658

…y_base is added This commit also updates log patterns for the log changes.

kgaillot merged commit 2b620fa into ClusterLabs:1.1 Apr 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.1] Feature: priority-fencing-delay #2043

[1.1] Feature: priority-fencing-delay #2043

gao-yan commented Apr 24, 2020

kgaillot commented May 27, 2020

gao-yan commented May 27, 2020

gao-yan commented May 27, 2020

kgaillot commented May 27, 2020

kgaillot commented May 27, 2020

gao-yan commented May 27, 2020

gao-yan commented May 27, 2020

kgaillot commented May 27, 2020

kgaillot commented May 27, 2020

kgaillot commented May 27, 2020

gao-yan commented May 27, 2020

[1.1] Feature: priority-fencing-delay #2043

[1.1] Feature: priority-fencing-delay #2043

Conversation

gao-yan commented Apr 24, 2020

kgaillot commented May 27, 2020

gao-yan commented May 27, 2020

gao-yan commented May 27, 2020

kgaillot commented May 27, 2020

kgaillot commented May 27, 2020

gao-yan commented May 27, 2020

gao-yan commented May 27, 2020

kgaillot commented May 27, 2020

kgaillot commented May 27, 2020

kgaillot commented May 27, 2020

gao-yan commented May 27, 2020