-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abort "running" builds when the pod agent is disconnected #1362
Conversation
If true, this is a real bug, but this is not the right fix. It is OK for an agent pod to temporarily disconnect and then reconnect later—the build should proceed without interruption. If the pod is deleted (as should normally happen when you drain a node), then the API server should send an event to that effect, and there is already logic ( If the agent ceases to respond and never comes back, maybe because the node crashes and API server does not notice, then after about 5m a Test cases using Mockito are basically useless for evaluating the appropriateness of a fix; it proves only that the Java code you wrote does what it superficially claims to do. To really deal with the problem, it would be much better to have a test case running against a cluster (in CI, this runs via Kind) which demonstrates the indefinite hang. That would make it possible to pinpoint which component is not behaving the way it is supposed to, and fix it. Note that there are already tests relating to pod deletion—which pass. |
I mean, there's a screenshot on the post right now. I can share another 3 ~ 4 pods that are "stuck" on my Jenkins instance at this very moment, their builds were literally already cleaned by the log rotator, but they still count as "running".
I realized that, indeed, a temporary disconnect should not affect the pod, by in my experience the agent just terminates when that happens. Is there some way for us to detect this?
The pod was removed from Kubernetes about 5 days ago, a little after the pod started. In this case, the Node that was running it was reclaimed by AWS. As you can see, the build is still going, so self-abort is not working.
I agree, unfortunately I had a hard time finding any documentation or help on how to test this properly: I'd prefer to simply check that given a "dead" agent that it doesn't have any running builds. I'll love some pointers to either some docs or examples of how to do this. For now, I'll close this PR until I can rethink the solution. |
I believe you. 😁
Hmm, jenkinsci/workflow-durable-task-step-plugin#226 or jenkinsci/workflow-durable-task-step-plugin#259 or jenkinsci/workflow-durable-task-step-plugin#299 might deal with this. Make sure your other plugins are up to date.
If true, something is broken in |
@jglick Before I leave this for another time (unfortunately I only have so much time I can dedicate to this issue), what is Jenkins' behavior on Either way, I'll see if I can round up some more information and create a proper Issue to track this. In our cases, this "If the pod is deleted (as should normally happen when you drain a node)" might not necessarily be true, the node is never drained as the underlying EC2 instance is forcefully terminated, but either way the API should send an event about that. |
https://github.com/jenkinsci/workflow-durable-task-step-plugin/blob/ee71f675dee64a2527c3d5ef14aa11d2961ffd30/src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java#L328-L369 and note that https://github.com/jenkinsci/workflow-durable-task-step-plugin/blob/ee71f675dee64a2527c3d5ef14aa11d2961ffd30/src/main/java/org/jenkinsci/plugins/workflow/support/steps/AgentErrorCondition.java#L61 means that https://github.com/jenkinsci/kubernetes-plugin/#retrying-after-infrastructure-outages should permit the build to succeed if fresh K8s nodes become available. |
Pod agents sometimes are disconnected unexpectedly. The most common reason is their Node being shutdown. Frequently, when this happens, builds become "stuck" waiting for the agent to reconnect. Since Pods can't really "reconnect", the builds stay there indefinitely:
This fix follows the logic used by the EC2 Fleet Plugin to abort any running builds at the moment the agent is disconnected.
This particular PR would also benefit from #1348 being merged, as it can provide better information in the logs.
While there is a test implemented, I don't particularly like it. I tried to find examples of how to properly create a fake
Executor
and pod agent and at the same time verify that the builds are indeed killed, but I failed at doing so, so instead I simply mocked it. However, this means that the test fails if any of the otherinterrupt
methods are called. I welcome any suggestions on how to better test this.