-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: agent: prevent fencing from hanging on sbd commands if any of the devices is silently blocked #119
base: main
Are you sure you want to change the base?
Conversation
…ng on list and dump commands if any of the devices is silently blocked If any of the configured SBD devices is silently blocked without any explicit I/O error from kernel, fencing will get stuck and time out, even if the majority of the devices are still available. On fencing, list and dump commands are called first. Under this situation, the commands will print output but get stuck on exit_aio() on exit, and become D state. With this commit, sbd fence agent asynchronously calls the commands individually for the devices and wait for any successful return and collect the output, so that it prevents execution of sbd fence agent from hanging.
…if any of the devices is silently blocked Differently from list and dump commands, message command is actually kind of already asynchronous. Rather than directly accessing the devices, it spawns multiple writing child processes in parallel, one for each device, and waits for majority of them to finish writing of poison pill and returns, even if the minority gets stuck in D state. But if it's called by stonith command, sbd fence agent process will become "defunct" state and get stuck. This commits prevent that by asynchronously calling message command with a subshell.
… of the devices is silently blocked If any of the configured SBD devices is silently blocked without any explicit I/O error from kernel, status action will get stuck on list command which will be hanging on exit_aio() and become D state. With this commit, sbd fence agent asynchronously calls list command individually for the devices and won't wait for any devices that actually have been already reported failed, so that it prevents hanging under such a situation.
Let's first confirm the concept. I'll ask the user to test it as well. |
Definitely something that looks as if it need improvement. @kgaillot What do you think? |
@gao-yan what are you using to simulate the stall behavior? Maybe we can add it to the tests then. |
Indeed... I'll take a look what could be done with sbd itself.
I haven't figured out how to simulate it in a test environment. So far I've been testing it with an iscsi setup and iptables :-) |
Maybe something simple in the testbed I've written would be enough. |
Not sure what you mean. Are you suggesting using something like topology to manage multiple sbd devices, so that sbd and fence_sbd only ever deal with a single device? |
Yes. Wanted to see if thinking in that direction leads to somewhere useful as pacemaker already has that logic of individual timeouts and logic combination already. Parallel fencing is already available - not in case of topologies I guess though. Being content with a quorate number of positive results would have to be added. |
@gao-yan |
You happened to ask :-) I actually just got the chance to get back to this one. I'm working on a solution in the C code to make several sbd commands execute with sub-processes for respective devices to prevent a main process from hanging. It should be universally beneficial for both sbd fence agents, so that we could avoid changing the fence agents. I'm going to open a PR soon to show you the draft so that we can talk about the details. |
That's cool! Good that I asked before starting a parallel project ;-) |
…or respective devices This is an universal solution to prevent fencing from hanging on silently blocked devices as originally brought up from ClusterLabs#119
…or respective devices This is an universal solution to prevent fencing from hanging on silently blocked devices as originally brought up from ClusterLabs#119.
…or respective devices This is an universal solution to prevent fencing from hanging on silently blocked devices as originally brought up from ClusterLabs#119.
…es for respective devices This is an universal solution to prevent fencing from hanging on silently blocked devices as originally brought up from ClusterLabs#119.
If any of the configured SBD devices is silently blocked without any explicit I/O error from kernel, fencing will get stuck and time out, even if the majority of the devices are still available.
Under this situation, gethosts, off/reset and status actions will get stuck on sbd list, dump, or message command which will be hanging on exit_aio() on exit, and become D state.
With these commits, sbd fence agent asynchronously calls the commands individually for the devices and returns whenever the purposes of the commands are achieved, so that it prevents execution of sbd fence agent from unnecessary hanging under such a situation.