fence_heuristics_resource: add new fence-agent for dynamic delay fencing #308

curvygrin · 2019-11-01T07:53:28Z

In a two-node cluster, if a fence races occur during running service on the standby node (= the delay of fence-agent is set) after failover due to monitoring failure on active node etc, the standby node will be fenced, and service cannot be continued.
This is due to the delay setting of fencing is fixed to the node.

proposal (this PR)

Implements a new heuristics agent to determine the node where the service (resource) is running and dynamically delay fencing.

Signed-off-by: Kumabuchi Kenji [email protected]

knet-ci-bot · 2019-11-01T07:53:30Z

Can one of the admins verify this patch?

fence-agents.spec.in

Signed-off-by: Kumabuchi Kenji <[email protected]>

oalbrigt · 2019-11-05T09:07:45Z

ok to test

wenningerk · 2019-11-05T10:17:49Z

agents/heuristics_resource/fence_heuristics_resource.py

+	standby_wait = int(options["--standby-wait"])
+	p = None
+	cmd = "%s -r %s -W" % (crm_resource_path, resource)
+	search_str = re.compile(r"\s%s$" % os.uname()[1])


as pacemaker doesn't necessarily use uname as name for the node it would be preferable here to use crm_node --name or something rather.

Fixed to use crm_node --name

wenningerk

Actually the idea of heuristics was rather to enable/disable another fence agent than introduce a delay. You should at least adapt the description accordingly.
And the logic should cover the case of self-fencing I guess, where additional waiting doesn't make sense regardless if the resource is running or not.
To make it more generic it would probably make sense to think of promotable/master-slave resources as well.

curvygrin · 2019-11-06T05:48:48Z

@wenningerk Thank you for your feedback. I think it's not so difficult to fix the description and consider the promotable resources. But I have no idea to cover the self-fence. Is there a good way to detect it?

Certainly, in the case of self-fence, a wasteful delay occurs. But honestly, I think it's no different from a delay parameter fixed to a node.

wenningerk · 2019-11-06T08:57:04Z

@wenningerk Thank you for your feedback. I think it's not so difficult to fix the description and consider the promotable resources. But I have no idea to cover the self-fence. Is there a good way to detect it?

Certainly, in the case of self-fence, a wasteful delay occurs. But honestly, I think it's no different from a delay parameter fixed to a node.

Well you know who you are (crm_node --name) and you know the target ;-)

wenningerk · 2019-11-07T09:59:29Z

Was talking as well to Ken yesterday about that approach and he is sharing my feeling that doing something complicated like determining the location of a resource should be considered carefully as part of a fence-agent. At least a note in the fence-agent should state that this maybe puts more stress on keeping the one instance running than on reliable recovery.
If you put that fence-agent on the same level as the real fence-agent and determining the location fails or hangs this node would never fence.
So one possibility to think about would be to rather put it on a higher level as the real fence-agent and to always make it fail. Then if determining the location is working it might introduce a delay and if that fails we would at least still have fencing. Could be an option as well so that the user can select if he wants to weight reliable recovery higher than keeping the service running.

Signed-off-by: Kumabuchi Kenji <[email protected]>

curvygrin · 2019-11-08T08:28:38Z

@wenningerk & KEN
Thank you very much for considering this proposal.
I understand that the current approach has the concerns.

So one possibility to think about would be to rather put it on a higher level as the real fence-agent and to always make it fail. Then if determining the location is working it might introduce a delay and if that fails we would at least still have fencing.

I certainly think that the above approach is safer.
(In fact, I thought about it, but the implementation of always returning FALSE seemed strange and I stopped it.)
I will try to fix PR.

wenningerk · 2019-11-08T08:35:29Z

I certainly think that the above approach is safer.
(In fact, I thought about it, but the implementation of always returning FALSE seemed strange and I stopped it.)
I will try to fix PR.

If you feel unpleasant with it leave the default as is and make returning FALSE an option ...

agents/heuristics_resource/fence_heuristics_resource.py

kgaillot

Hi,

Thanks for contributing a significant new capability! I do have some concerns that I've commented on.

kgaillot · 2019-11-08T16:07:24Z

agents/heuristics_resource/fence_heuristics_resource.py

+		logging.error("Command failed. rc=%s, stderr=%s" % (rc, err))
+		return False
+
+	search_str = re.compile(r"\s%s%s$" % (mynodename, '\sMaster' if promotable else ''))


Have you tested this? I think there's a space at the end of the line if "Master" isn't printed.

I'd be a little concerned about introducing a dependency on the output format. We regularly have problems crop up when other software (e.g. pcs) looks for specific strings in the output, since we don't always remember that something requires it when we need to update the pacemaker output for some reason.

We actually are in the middle of a project to have all the pacemaker tools support XML output, so external code can use a format that we will try to keep backward-compatible while still allowing us to change text output as needed. Unfortunately crm_resource has not yet been adapted, and you'd want to support older pacemaker versions anyway.

Another concern I have is that crm_resource -W doesn't distinguish between running cleanly and failed. The output would be the same in either case. If the resource is failed and the node needs to be fenced, this will unnecessarily delay recovery.

One possibility would be to use crm_mon --as-xml instead, which would avoid both the magic string problem and the success vs failure problem. crm_mon's XML output has been supported "forever". It will generate a bunch of output including something like:

<resource id="R" ... failed="true" failure_ignored="false" ... > <node name="N" ... /> </resource>

Another corner case is that a resource can be (wrongly) active on more than one node. In a 2-node cluster, it would be simple to skip the delay here if the resource is active on the local node, even if it is also active on the other node. In a larger cluster the best course of action is not as obvious, but a safe choice would be to skip the delay if the resource is active on multiple nodes.

Side note: the upcoming release of pacemaker will have a slight change in how crm_mon does XML output. --as-xml will still behave the same, but it will be deprecated (however it will remain supported for years to come). A new --output-as=xml option will generate nearly identical XML output, but with a different outermost tag. That is part of the project to make all tools support XML -- that option and outermost tag will be identical for all tools.

I think there's a space at the end of the line if "Master" isn't printed.

At this point, it was an implementation that removed the space by strip() .

One possibility would be to use crm_mon --as-xml instead, which would avoid both the magic string problem and the success vs failure problem. crm_mon's XML output has been supported "forever". It will generate a bunch of output including something like:

I added new commit to use crm_mon --as-xml.
Is it necessary and sufficient to detect resource failure by checking the following two attributes?

failed="true" failure_ignored="false"

If even one of the resources has failed, including resources inside the resource group or promotable resource, the delay is skipped.

Another corner case is that a resource can be (wrongly) active on more than one node. In a 2-node cluster, it would be simple to skip the delay here if the resource is active on the local node, even if it is also active on the other node. In a larger cluster the best course of action is not as obvious, but a safe choice would be to skip the delay if the resource is active on multiple nodes.

Regardless of the number of nodes in the cluster, the delay is skipped when the resource is started on multiple nodes.

In the latest commit, the delay occurs only when the specified resource, resource group, promotable master resource is started on another single node (not remote) and has not failed. The delay is also skipped for self-fencing.

kgaillot · 2019-11-08T16:16:32Z

agents/heuristics_resource/fence_heuristics_resource.py

+		"getopt" : "r:",
+		"longopt" : "resource",
+		"required" : "1",
+		"help" : "-r, --resource=[resource-id]   ID of the resource that should be running in the ACT node",


I would document that it does not make sense to specify a cloned or bundled resource unless it is promotable and has only a single master instance.

Ok. I documented it.

kgaillot · 2019-11-08T16:18:44Z

agents/heuristics_resource/fence_heuristics_resource.py

+		"shortdesc" : "Handle the promotable resource. The node on which the master resource is running is considered as ACT.",
+		"default" : "False",
+		"order" : 1
+		}


If you use crm_mon --as-xml, you can detect this automatically. Cloned resources are wrapped in clone tags that include a multi_state true/false attribute that is equivalent to promotable.

Fixed to distinguish promotable resources based on clone tags and attriburtes in xml output.

kgaillot · 2019-11-08T16:21:44Z

agents/heuristics_resource/fence_heuristics_resource.py

+		"longopt" : "standby-wait",
+		"required" : "0",
+		"help" : "-w, --standby-wait=[seconds]   Wait X seconds on SBY node. If a positive number is specified, fencing action of this agent will always succeed after waits.",
+		"shortdesc" : "Wait X seconds on SBY node. If a positive number is specified, fencing action of this agent will always succeed after waits.",


The agent will delay but not succeed

curvygrin · 2019-11-15T08:43:44Z

Sorry for the late response. Thank you for many important comments.
I updated PR, and it tested on RHEL8.0 / Pacemaker 2.0.1.
But I'm not familiar with crm_mon's xml output. Please let me know if you have other comments.

…remote host

gao-yan · 2019-12-04T16:49:56Z

Delaying fencing actions targeting the nodes that are hosting more significant resources or resource instances (promoted) to make them win fencing matches has been an very interesting topic indeed for a lot of use cases especially in regard of 2-node clusters. Thanks for bringing up the topic as well with such an implementation via a special fence agent ...

Actually, I've been thinking about the possibility of achieving this in pacemaker without requiring an additional fence agent or resource agent.

How about we introduce a couple of meta attributes for resources for example:

fencing-delay
-- Delay fencing actions targeting the nodes that are hosting this resource for the specified period of time, so that they survive fencing matches.

fencing-delay-promoted

-- Delay fencing actions targeting the nodes that are hosting the promoted instances of this resource for the specified period of time, so that they survive fencing matches.

Fencer could either combine these delays together with pcmk_delay_base/max to calculate the eventual delay, or pick the longest delay among them as the actual delay.

What do you think?

kgaillot · 2019-12-04T23:35:56Z

Delaying fencing actions targeting the nodes that are hosting more significant resources or resource instances (promoted) to make them win fencing matches has been an very interesting topic indeed for a lot of use cases especially in regard of 2-node clusters. Thanks for bringing up the topic as well with such an implementation via a special fence agent ...

Actually, I've been thinking about the possibility of achieving this in pacemaker without requiring an additional fence agent or resource agent.

How about we introduce a couple of meta attributes for resources for example:

fencing-delay
-- Delay fencing actions targeting the nodes that are hosting this resource for the specified period of time, so that they survive fencing matches.

fencing-delay-promoted

-- Delay fencing actions targeting the nodes that are hosting the promoted instances of this resource for the specified period of time, so that they survive fencing matches.

Fencer could either combine these delays together with pcmk_delay_base/max to calculate the eventual delay, or pick the longest delay among them as the actual delay.

What do you think?

That's an interesting idea -- I was hesitant about agents getting that involved in pacemaker's internal state, and that would avoid any such concerns.

We already have a "priority" meta-attribute for resources -- maybe it could be involved somehow. Instead of a per-resource fencing-delay, maybe a cluster-wide priority-fencing-delay that would get applied to the node with the highest total resource priority. I'm not sure how to prefer promoted instances, maybe a "promoted-priority" meta-attribute for clones that would get added to the base priority of promoted instances.

The fencer uses the "quick location" flag with the scheduler, so I'm not sure offhand whether it has all the information needed.

wenningerk · 2019-12-05T09:44:32Z

Delaying fencing actions targeting the nodes that are hosting more significant resources or resource instances (promoted) to make them win fencing matches has been an very interesting topic indeed for a lot of use cases especially in regard of 2-node clusters. Thanks for bringing up the topic as well with such an implementation via a special fence agent ...
Actually, I've been thinking about the possibility of achieving this in pacemaker without requiring an additional fence agent or resource agent.
How about we introduce a couple of meta attributes for resources for example:
fencing-delay
-- Delay fencing actions targeting the nodes that are hosting this resource for the specified period of time, so that they survive fencing matches.
fencing-delay-promoted
-- Delay fencing actions targeting the nodes that are hosting the promoted instances of this resource for the specified period of time, so that they survive fencing matches.
Fencer could either combine these delays together with pcmk_delay_base/max to calculate the eventual delay, or pick the longest delay among them as the actual delay.
What do you think?

That's an interesting idea -- I was hesitant about agents getting that involved in pacemaker's internal state, and that would avoid any such concerns.

We already have a "priority" meta-attribute for resources -- maybe it could be involved somehow. Instead of a per-resource fencing-delay, maybe a cluster-wide priority-fencing-delay that would get applied to the node with the highest total resource priority. I'm not sure how to prefer promoted instances, maybe a "promoted-priority" meta-attribute for clones that would get added to the base priority of promoted instances.

The fencer uses the "quick location" flag with the scheduler, so I'm not sure offhand whether it has all the information needed.

Guess this is bringing up the old issue again that the view fenced has of the state of the cluster
is just updated by a few rules and so it wouldn't know which delay to take as atm it isn't aware
of a clone being promoted or not.
But of course we could go a route where scheduler pushes some additional delay into fenced
where fenced would be agnostic of how that is composed - either a parameter of the fence-action
or asynchronous via a side-channel.

kgaillot · 2019-12-05T15:02:56Z

But of course we could go a route where scheduler pushes some additional delay into fenced
where fenced would be agnostic of how that is composed - either a parameter of the fence-action
or asynchronous via a side-channel.

I like the idea of the scheduler calculating the delay and adding it to the graph action. However that would mean it would only apply to fencing initiated by the cluster (and not e.g. stonith_admin or dlm).

gao-yan · 2019-12-05T15:57:56Z

Probably somehow let fencer check the cluster status and calculate the delay only when fencing is actually requested?

gao-yan · 2019-12-05T16:55:15Z

We already have a "priority" meta-attribute for resources -- maybe it could be involved somehow. Instead of a per-resource fencing-delay, maybe a cluster-wide priority-fencing-delay that would get applied to the node with the highest total resource priority. I'm not sure how to prefer promoted instances, maybe a "promoted-priority" meta-attribute for clones that would get added to the base priority of promoted instances.

Good thinking. If to just involve "priority" meta attribute, that'd be fine. Not sure if it'd become confusing to introduce another "promoted-priority" attribute... A simpler rule of course could be "promoted instances only take a little higher priority (+0.000001) than the base priority on calculation, not higher than any other resources with real higher priority ( >= +1)

kgaillot · 2019-12-05T18:50:04Z

Probably somehow let fencer check the cluster status and calculate the delay only when fencing is actually requested?

Up to now we've tried to avoid the overhead. The fencer checks the status at noncritical times like when the CIB is changed. When fencing is requested, we kind of want that to be as streamlined as possible, both to avoid unnecessary delays and to reduce the chance that something could go wrong. If we want status then, we have to get the current CIB, so worst case we have to wait for that to timeout before giving up.

But it would make it easier to have dynamic capabilities like rule processing and this.

If we did check status at fence time, we'd have to define what to do when reading the current CIB fails. Do we go with the most recent copy we have, or fail? If we use the most recent copy, do we disable the priority feature to avoid using outdated information?

kgaillot · 2019-12-05T18:57:40Z

A simpler rule of course could be "promoted instances only take a little higher priority (+0.000001) than the base priority on calculation, not higher than any other resources with real higher priority ( >= +1)

That makes sense. Scores are integer so I'd just go with +1 rather than convert back and forth from floating-point. The +1 for implicit clone stickiness establishes a precedent. In my experience users tend to use bigger jumps between configured scores anyway (at least 10).

Overall I lean to doing the calculations in the scheduler (and losing the feature for external fencing) rather than making the fencer more complex, but I can see arguments both ways.

gao-yan · 2019-12-06T11:09:39Z

Probably somehow let fencer check the cluster status and calculate the delay only when fencing is actually requested?

Up to now we've tried to avoid the overhead. The fencer checks the status at noncritical times like when the CIB is changed. When fencing is requested, we kind of want that to be as streamlined as possible, both to avoid unnecessary delays and to reduce the chance that something could go wrong. If we want status then, we have to get the current CIB, so worst case we have to wait for that to timeout before giving up.

But it would make it easier to have dynamic capabilities like rule processing and this.

If we did check status at fence time, we'd have to define what to do when reading the current CIB fails. Do we go with the most recent copy we have, or fail? If we use the most recent copy, do we disable the priority feature to avoid using outdated information?

It'd be a difficult situation indeed ...

gao-yan · 2019-12-06T11:11:46Z

A simpler rule of course could be "promoted instances only take a little higher priority (+0.000001) than the base priority on calculation, not higher than any other resources with real higher priority ( >= +1)

That makes sense. Scores are integer so I'd just go with +1 rather than convert back and forth from floating-point. The +1 for implicit clone stickiness establishes a precedent. In my experience users tend to use bigger jumps between configured scores anyway (at least 10).

Makes sense.

Overall I lean to doing the calculations in the scheduler (and losing the feature for external fencing) rather than making the fencer more complex, but I can see arguments both ways.

Agreed. There probably always have to be trade-offs. I'll start with something on this.

gao-yan · 2019-12-06T12:28:34Z

And again a question is, should fencer combine priority-fencing-delay with pcmk_delay_base/max, or should it just pick the longest among them as the actual delay?

kgaillot · 2019-12-06T15:52:59Z

And again a question is, should fencer combine priority-fencing-delay with pcmk_delay_base/max, or should it just pick the longest among them as the actual delay?

My first reaction was that they should combine, but thinking about possible scenarios, I think priority-fencing-delay should always have precedence.

The idea is that the user could configure priority-fencing-delay, which would always be used for cluster-initiated fencing, and then configure a static and/or random delay as a fallback to be used with external fencing. If the fencer receives a request with a priority delay specified, it would use that, otherwise it would use the normal delay parameters. That means the scheduler should always add it to fencing ops, even if it's 0 (i.e. targeting a lower-priority node), if the option is enabled.

The only scenario I can think of where combining them would be useful would be for when two nodes have equal priority. Someone could configure a static and/or random delay totalling smaller than the priority delay, which would only matter in that situation. But since by definition the user doesn't have a preference which node gets fenced in that case, it should be acceptable to pick one node algorithmically (lowest ID or whatever) to get a +1 in that situation.

A mixed-version cluster might be another scenario where they're useful, but that should be a temporary situation.

I wonder if it would be worthwhile to have some exceptions, e.g. don't use a delay if we're fencing a node in our partition, and/or use the delay only if there's an even number of nodes. But presumably the user wouldn't configure a delay for an odd number of nodes.

This feature addresses the relevant topics and implements the ideas brought up from: ClusterLabs/fence-agents#308 This commit adds priority-fencing-delay option (just the option, not the feature itself). Enforce specified delay for the fencings that are targeting the lost nodes with the highest total resource priority in case we don't have the majority of the nodes in our cluster partition, so that the more significant nodes potentially win any fencing match, which is especially meaningful under split-brain of 2-node cluster. If all the nodes have equal priority, then any pcmk_delay_base/max configured for the corresponding fencing resources will be applied. Otherwise as long as it's set, even if to 0, it takes precedence over any configured pcmk_delay_base/max. By default, priority fencing delay is disabled.

This feature addresses the relevant topics and implements the ideas brought up from: ClusterLabs/fence-agents#308 This commit adds priority-fencing-delay option (just the option, not the feature itself). Enforce specified delay for the fencings that are targeting the lost nodes with the highest total resource priority in case we don't have the majority of the nodes in our cluster partition, so that the more significant nodes potentially win any fencing match, which is especially meaningful under split-brain of 2-node cluster. A promoted resource instance takes the base priority + 1 on calculation if the base priority is not 0. If all the nodes have equal priority, then any pcmk_delay_base/max configured for the corresponding fencing resources will be applied. Otherwise as long as it's set, even if to 0, it takes precedence over any configured pcmk_delay_base/max. By default, priority fencing delay is disabled.

knet-jenkins · 2023-06-12T07:19:13Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/fence-agents-pipeline/job/PR-308/1/input

oalbrigt reviewed Nov 4, 2019

View reviewed changes

fence-agents.spec.in Outdated Show resolved Hide resolved

Add fence_heuristics_resource agent

c928a80

Signed-off-by: Kumabuchi Kenji <[email protected]>

curvygrin force-pushed the implements-fence-heuristics-resource branch from b8c5478 to c928a80 Compare November 5, 2019 08:20

wenningerk reviewed Nov 5, 2019

View reviewed changes

fixed to always escalate to the next fencing level

6e156c0

Signed-off-by: Kumabuchi Kenji <[email protected]>

oalbrigt reviewed Nov 8, 2019

View reviewed changes

agents/heuristics_resource/fence_heuristics_resource.py Outdated Show resolved Hide resolved

kgaillot reviewed Nov 8, 2019

View reviewed changes

change to use crm_mon xml output

03ff829

Ensure that the resource is running on another cluster member, not a …

b71797f

…remote host

gao-yan mentioned this pull request Mar 17, 2020

Feature: priority-fencing-delay ClusterLabs/pacemaker#2012

Merged

gao-yan mentioned this pull request Apr 24, 2020

[1.1] Feature: priority-fencing-delay ClusterLabs/pacemaker#2043

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fence_heuristics_resource: add new fence-agent for dynamic delay fencing #308

fence_heuristics_resource: add new fence-agent for dynamic delay fencing #308

curvygrin commented Nov 1, 2019

knet-ci-bot commented Nov 1, 2019

oalbrigt commented Nov 5, 2019

wenningerk Nov 5, 2019

curvygrin Dec 4, 2019

wenningerk left a comment

curvygrin commented Nov 6, 2019

wenningerk commented Nov 6, 2019

wenningerk commented Nov 7, 2019

curvygrin commented Nov 8, 2019

wenningerk commented Nov 8, 2019

kgaillot left a comment

kgaillot Nov 8, 2019

curvygrin Dec 4, 2019

kgaillot Nov 8, 2019

curvygrin Dec 4, 2019

kgaillot Nov 8, 2019

curvygrin Dec 4, 2019

kgaillot Nov 8, 2019

curvygrin Dec 4, 2019

curvygrin commented Nov 15, 2019

gao-yan commented Dec 4, 2019 •

edited

Loading

kgaillot commented Dec 4, 2019

wenningerk commented Dec 5, 2019

kgaillot commented Dec 5, 2019

gao-yan commented Dec 5, 2019

gao-yan commented Dec 5, 2019 •

edited

Loading

kgaillot commented Dec 5, 2019

kgaillot commented Dec 5, 2019

gao-yan commented Dec 6, 2019

gao-yan commented Dec 6, 2019

gao-yan commented Dec 6, 2019

kgaillot commented Dec 6, 2019

knet-jenkins bot commented Jun 12, 2023

fence_heuristics_resource: add new fence-agent for dynamic delay fencing #308

Are you sure you want to change the base?

fence_heuristics_resource: add new fence-agent for dynamic delay fencing #308

Conversation

curvygrin commented Nov 1, 2019

knet-ci-bot commented Nov 1, 2019

oalbrigt commented Nov 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wenningerk left a comment

Choose a reason for hiding this comment

curvygrin commented Nov 6, 2019

wenningerk commented Nov 6, 2019

wenningerk commented Nov 7, 2019

curvygrin commented Nov 8, 2019

wenningerk commented Nov 8, 2019

kgaillot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

curvygrin commented Nov 15, 2019

gao-yan commented Dec 4, 2019 • edited Loading

kgaillot commented Dec 4, 2019

wenningerk commented Dec 5, 2019

kgaillot commented Dec 5, 2019

gao-yan commented Dec 5, 2019

gao-yan commented Dec 5, 2019 • edited Loading

kgaillot commented Dec 5, 2019

kgaillot commented Dec 5, 2019

gao-yan commented Dec 6, 2019

gao-yan commented Dec 6, 2019

gao-yan commented Dec 6, 2019

kgaillot commented Dec 6, 2019

knet-jenkins bot commented Jun 12, 2023

gao-yan commented Dec 4, 2019 •

edited

Loading

gao-yan commented Dec 5, 2019 •

edited

Loading