-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fence_heuristics_resource: add new fence-agent for dynamic delay fencing #308
base: main
Are you sure you want to change the base?
fence_heuristics_resource: add new fence-agent for dynamic delay fencing #308
Conversation
Can one of the admins verify this patch? |
Signed-off-by: Kumabuchi Kenji <[email protected]>
b8c5478
to
c928a80
Compare
ok to test |
standby_wait = int(options["--standby-wait"]) | ||
p = None | ||
cmd = "%s -r %s -W" % (crm_resource_path, resource) | ||
search_str = re.compile(r"\s%s$" % os.uname()[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as pacemaker doesn't necessarily use uname as name for the node it would be preferable here to use crm_node --name
or something rather.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to use crm_node --name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the idea of heuristics was rather to enable/disable another fence agent than introduce a delay. You should at least adapt the description accordingly.
And the logic should cover the case of self-fencing I guess, where additional waiting doesn't make sense regardless if the resource is running or not.
To make it more generic it would probably make sense to think of promotable/master-slave resources as well.
@wenningerk Thank you for your feedback. I think it's not so difficult to fix the description and consider the promotable resources. But I have no idea to cover the self-fence. Is there a good way to detect it? Certainly, in the case of self-fence, a wasteful delay occurs. But honestly, I think it's no different from a delay parameter fixed to a node. |
Well you know who you are (crm_node --name) and you know the target ;-) |
Was talking as well to Ken yesterday about that approach and he is sharing my feeling that doing something complicated like determining the location of a resource should be considered carefully as part of a fence-agent. At least a note in the fence-agent should state that this maybe puts more stress on keeping the one instance running than on reliable recovery. |
Signed-off-by: Kumabuchi Kenji <[email protected]>
@wenningerk & KEN
I certainly think that the above approach is safer. |
If you feel unpleasant with it leave the default as is and make returning FALSE an option ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Thanks for contributing a significant new capability! I do have some concerns that I've commented on.
logging.error("Command failed. rc=%s, stderr=%s" % (rc, err)) | ||
return False | ||
|
||
search_str = re.compile(r"\s%s%s$" % (mynodename, '\sMaster' if promotable else '')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested this? I think there's a space at the end of the line if "Master" isn't printed.
I'd be a little concerned about introducing a dependency on the output format. We regularly have problems crop up when other software (e.g. pcs) looks for specific strings in the output, since we don't always remember that something requires it when we need to update the pacemaker output for some reason.
We actually are in the middle of a project to have all the pacemaker tools support XML output, so external code can use a format that we will try to keep backward-compatible while still allowing us to change text output as needed. Unfortunately crm_resource has not yet been adapted, and you'd want to support older pacemaker versions anyway.
Another concern I have is that crm_resource -W doesn't distinguish between running cleanly and failed. The output would be the same in either case. If the resource is failed and the node needs to be fenced, this will unnecessarily delay recovery.
One possibility would be to use crm_mon --as-xml instead, which would avoid both the magic string problem and the success vs failure problem. crm_mon's XML output has been supported "forever". It will generate a bunch of output including something like:
<resource id="R" ... failed="true" failure_ignored="false" ... >
<node name="N" ... />
</resource>
Another corner case is that a resource can be (wrongly) active on more than one node. In a 2-node cluster, it would be simple to skip the delay here if the resource is active on the local node, even if it is also active on the other node. In a larger cluster the best course of action is not as obvious, but a safe choice would be to skip the delay if the resource is active on multiple nodes.
Side note: the upcoming release of pacemaker will have a slight change in how crm_mon does XML output. --as-xml will still behave the same, but it will be deprecated (however it will remain supported for years to come). A new --output-as=xml option will generate nearly identical XML output, but with a different outermost tag. That is part of the project to make all tools support XML -- that option and outermost tag will be identical for all tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a space at the end of the line if "Master" isn't printed.
At this point, it was an implementation that removed the space by strip()
.
One possibility would be to use crm_mon --as-xml instead, which would avoid both the magic string problem and the success vs failure problem. crm_mon's XML output has been supported "forever". It will generate a bunch of output including something like:
I added new commit to use crm_mon --as-xml
.
Is it necessary and sufficient to detect resource failure by checking the following two attributes?
failed="true" failure_ignored="false"
If even one of the resources has failed, including resources inside the resource group or promotable resource, the delay is skipped.
Another corner case is that a resource can be (wrongly) active on more than one node. In a 2-node cluster, it would be simple to skip the delay here if the resource is active on the local node, even if it is also active on the other node. In a larger cluster the best course of action is not as obvious, but a safe choice would be to skip the delay if the resource is active on multiple nodes.
Regardless of the number of nodes in the cluster, the delay is skipped when the resource is started on multiple nodes.
In the latest commit, the delay occurs only when the specified resource, resource group, promotable master resource is started on another single node (not remote) and has not failed. The delay is also skipped for self-fencing.
"getopt" : "r:", | ||
"longopt" : "resource", | ||
"required" : "1", | ||
"help" : "-r, --resource=[resource-id] ID of the resource that should be running in the ACT node", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would document that it does not make sense to specify a cloned or bundled resource unless it is promotable and has only a single master instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I documented it.
"shortdesc" : "Handle the promotable resource. The node on which the master resource is running is considered as ACT.", | ||
"default" : "False", | ||
"order" : 1 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use crm_mon --as-xml, you can detect this automatically. Cloned resources are wrapped in clone tags that include a multi_state true/false attribute that is equivalent to promotable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to distinguish promotable resources based on clone tags and attriburtes in xml output.
"longopt" : "standby-wait", | ||
"required" : "0", | ||
"help" : "-w, --standby-wait=[seconds] Wait X seconds on SBY node. If a positive number is specified, fencing action of this agent will always succeed after waits.", | ||
"shortdesc" : "Wait X seconds on SBY node. If a positive number is specified, fencing action of this agent will always succeed after waits.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The agent will delay but not succeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Sorry for the late response. Thank you for many important comments. |
Delaying fencing actions targeting the nodes that are hosting more significant resources or resource instances (promoted) to make them win fencing matches has been an very interesting topic indeed for a lot of use cases especially in regard of 2-node clusters. Thanks for bringing up the topic as well with such an implementation via a special fence agent ... Actually, I've been thinking about the possibility of achieving this in pacemaker without requiring an additional fence agent or resource agent. How about we introduce a couple of meta attributes for resources for example:
-- Delay fencing actions targeting the nodes that are hosting the promoted instances of this resource for the specified period of time, so that they survive fencing matches. Fencer could either combine these delays together with pcmk_delay_base/max to calculate the eventual delay, or pick the longest delay among them as the actual delay. What do you think? |
That's an interesting idea -- I was hesitant about agents getting that involved in pacemaker's internal state, and that would avoid any such concerns. We already have a "priority" meta-attribute for resources -- maybe it could be involved somehow. Instead of a per-resource fencing-delay, maybe a cluster-wide priority-fencing-delay that would get applied to the node with the highest total resource priority. I'm not sure how to prefer promoted instances, maybe a "promoted-priority" meta-attribute for clones that would get added to the base priority of promoted instances. The fencer uses the "quick location" flag with the scheduler, so I'm not sure offhand whether it has all the information needed. |
Guess this is bringing up the old issue again that the view fenced has of the state of the cluster |
I like the idea of the scheduler calculating the delay and adding it to the graph action. However that would mean it would only apply to fencing initiated by the cluster (and not e.g. stonith_admin or dlm). |
Probably somehow let fencer check the cluster status and calculate the delay only when fencing is actually requested? |
Good thinking. If to just involve "priority" meta attribute, that'd be fine. Not sure if it'd become confusing to introduce another "promoted-priority" attribute... A simpler rule of course could be "promoted instances only take a little higher priority (+0.000001) than the base priority on calculation, not higher than any other resources with real higher priority ( >= +1) |
Up to now we've tried to avoid the overhead. The fencer checks the status at noncritical times like when the CIB is changed. When fencing is requested, we kind of want that to be as streamlined as possible, both to avoid unnecessary delays and to reduce the chance that something could go wrong. If we want status then, we have to get the current CIB, so worst case we have to wait for that to timeout before giving up. But it would make it easier to have dynamic capabilities like rule processing and this. If we did check status at fence time, we'd have to define what to do when reading the current CIB fails. Do we go with the most recent copy we have, or fail? If we use the most recent copy, do we disable the priority feature to avoid using outdated information? |
That makes sense. Scores are integer so I'd just go with +1 rather than convert back and forth from floating-point. The +1 for implicit clone stickiness establishes a precedent. In my experience users tend to use bigger jumps between configured scores anyway (at least 10). Overall I lean to doing the calculations in the scheduler (and losing the feature for external fencing) rather than making the fencer more complex, but I can see arguments both ways. |
It'd be a difficult situation indeed ... |
Makes sense.
Agreed. There probably always have to be trade-offs. I'll start with something on this. |
And again a question is, should fencer combine |
My first reaction was that they should combine, but thinking about possible scenarios, I think priority-fencing-delay should always have precedence. The idea is that the user could configure priority-fencing-delay, which would always be used for cluster-initiated fencing, and then configure a static and/or random delay as a fallback to be used with external fencing. If the fencer receives a request with a priority delay specified, it would use that, otherwise it would use the normal delay parameters. That means the scheduler should always add it to fencing ops, even if it's 0 (i.e. targeting a lower-priority node), if the option is enabled. The only scenario I can think of where combining them would be useful would be for when two nodes have equal priority. Someone could configure a static and/or random delay totalling smaller than the priority delay, which would only matter in that situation. But since by definition the user doesn't have a preference which node gets fenced in that case, it should be acceptable to pick one node algorithmically (lowest ID or whatever) to get a +1 in that situation. A mixed-version cluster might be another scenario where they're useful, but that should be a temporary situation. I wonder if it would be worthwhile to have some exceptions, e.g. don't use a delay if we're fencing a node in our partition, and/or use the delay only if there's an even number of nodes. But presumably the user wouldn't configure a delay for an odd number of nodes. |
This feature addresses the relevant topics and implements the ideas brought up from: ClusterLabs/fence-agents#308 This commit adds priority-fencing-delay option (just the option, not the feature itself). Enforce specified delay for the fencings that are targeting the lost nodes with the highest total resource priority in case we don't have the majority of the nodes in our cluster partition, so that the more significant nodes potentially win any fencing match, which is especially meaningful under split-brain of 2-node cluster. If all the nodes have equal priority, then any pcmk_delay_base/max configured for the corresponding fencing resources will be applied. Otherwise as long as it's set, even if to 0, it takes precedence over any configured pcmk_delay_base/max. By default, priority fencing delay is disabled.
This feature addresses the relevant topics and implements the ideas brought up from: ClusterLabs/fence-agents#308 This commit adds priority-fencing-delay option (just the option, not the feature itself). Enforce specified delay for the fencings that are targeting the lost nodes with the highest total resource priority in case we don't have the majority of the nodes in our cluster partition, so that the more significant nodes potentially win any fencing match, which is especially meaningful under split-brain of 2-node cluster. If all the nodes have equal priority, then any pcmk_delay_base/max configured for the corresponding fencing resources will be applied. Otherwise as long as it's set, even if to 0, it takes precedence over any configured pcmk_delay_base/max. By default, priority fencing delay is disabled.
This feature addresses the relevant topics and implements the ideas brought up from: ClusterLabs/fence-agents#308 This commit adds priority-fencing-delay option (just the option, not the feature itself). Enforce specified delay for the fencings that are targeting the lost nodes with the highest total resource priority in case we don't have the majority of the nodes in our cluster partition, so that the more significant nodes potentially win any fencing match, which is especially meaningful under split-brain of 2-node cluster. If all the nodes have equal priority, then any pcmk_delay_base/max configured for the corresponding fencing resources will be applied. Otherwise as long as it's set, even if to 0, it takes precedence over any configured pcmk_delay_base/max. By default, priority fencing delay is disabled.
This feature addresses the relevant topics and implements the ideas brought up from: ClusterLabs/fence-agents#308 This commit adds priority-fencing-delay option (just the option, not the feature itself). Enforce specified delay for the fencings that are targeting the lost nodes with the highest total resource priority in case we don't have the majority of the nodes in our cluster partition, so that the more significant nodes potentially win any fencing match, which is especially meaningful under split-brain of 2-node cluster. A promoted resource instance takes the base priority + 1 on calculation if the base priority is not 0. If all the nodes have equal priority, then any pcmk_delay_base/max configured for the corresponding fencing resources will be applied. Otherwise as long as it's set, even if to 0, it takes precedence over any configured pcmk_delay_base/max. By default, priority fencing delay is disabled.
This feature addresses the relevant topics and implements the ideas brought up from: ClusterLabs/fence-agents#308 This commit adds priority-fencing-delay option (just the option, not the feature itself). Enforce specified delay for the fencings that are targeting the lost nodes with the highest total resource priority in case we don't have the majority of the nodes in our cluster partition, so that the more significant nodes potentially win any fencing match, which is especially meaningful under split-brain of 2-node cluster. A promoted resource instance takes the base priority + 1 on calculation if the base priority is not 0. If all the nodes have equal priority, then any pcmk_delay_base/max configured for the corresponding fencing resources will be applied. Otherwise as long as it's set, even if to 0, it takes precedence over any configured pcmk_delay_base/max. By default, priority fencing delay is disabled.
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/fence-agents-pipeline/job/PR-308/1/input |
In a two-node cluster, if a fence races occur during running service on the standby node (= the delay of fence-agent is set) after failover due to monitoring failure on active node etc, the standby node will be fenced, and service cannot be continued.
This is due to the delay setting of fencing is fixed to the node.
proposal (this PR)
Implements a new heuristics agent to determine the node where the service (resource) is running and dynamically delay fencing.
Signed-off-by: Kumabuchi Kenji [email protected]