Abort "running" builds when the pod agent is disconnected #1362

Angelin01 · 2023-04-24T14:54:02Z

Pod agents sometimes are disconnected unexpectedly. The most common reason is their Node being shutdown. Frequently, when this happens, builds become "stuck" waiting for the agent to reconnect. Since Pods can't really "reconnect", the builds stay there indefinitely:

This fix follows the logic used by the EC2 Fleet Plugin to abort any running builds at the moment the agent is disconnected.

This particular PR would also benefit from #1348 being merged, as it can provide better information in the logs.

While there is a test implemented, I don't particularly like it. I tried to find examples of how to properly create a fake Executor and pod agent and at the same time verify that the builds are indeed killed, but I failed at doing so, so instead I simply mocked it. However, this means that the test fails if any of the other interrupt methods are called. I welcome any suggestions on how to better test this.

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests - that demonstrates feature works or fixes the issue

jglick · 2023-04-24T17:52:32Z

builds become "stuck" waiting for the agent to reconnect. Since Pods can't really "reconnect", the builds stay there indefinitely

If true, this is a real bug, but this is not the right fix.

It is OK for an agent pod to temporarily disconnect and then reconnect later—the build should proceed without interruption.

If the pod is deleted (as should normally happen when you drain a node), then the API server should send an event to that effect, and there is already logic (Reaper) to then send an abort signal to the build.

If the agent ceases to respond and never comes back, maybe because the node crashes and API server does not notice, then after about 5m a sh step should notice that it is not getting expected updates and also self-abort.

Test cases using Mockito are basically useless for evaluating the appropriateness of a fix; it proves only that the Java code you wrote does what it superficially claims to do. To really deal with the problem, it would be much better to have a test case running against a cluster (in CI, this runs via Kind) which demonstrates the indefinite hang. That would make it possible to pinpoint which component is not behaving the way it is supposed to, and fix it. Note that there are already tests relating to pod deletion—which pass.

Angelin01 · 2023-04-24T18:15:15Z

@jglick

builds become "stuck" waiting for the agent to reconnect. Since Pods can't really "reconnect", the builds stay there indefinitely

If true, this is a real bug, but this is not the right fix.

I mean, there's a screenshot on the post right now. I can share another 3 ~ 4 pods that are "stuck" on my Jenkins instance at this very moment, their builds were literally already cleaned by the log rotator, but they still count as "running".

It is OK for an agent pod to temporarily disconnect and then reconnect later—the build should proceed without interruption.

If the pod is deleted (as should normally happen when you drain a node), then the API server should send an event to that effect, and there is already logic (Reaper) to then send an abort signal to the build.

I realized that, indeed, a temporary disconnect should not affect the pod, by in my experience the agent just terminates when that happens. Is there some way for us to detect this?

If the agent ceases to respond and never comes back, maybe because the node crashes and API server does not notice, then after about 5m a sh step should notice that it is not getting expected updates and also self-abort.

The pod was removed from Kubernetes about 5 days ago, a little after the pod started. In this case, the Node that was running it was reclaimed by AWS. As you can see, the build is still going, so self-abort is not working.

Test cases using Mockito are basically useless for evaluating the appropriateness of a fix; it proves only that the Java code you wrote does what it superficially claims to do. To really deal with the problem, it would be much better to have a test case running against a cluster (in CI, this runs via Kind) which demonstrates the indefinite hang. That would make it possible to pinpoint which component is not behaving the way it is supposed to, and fix it. Note that there are already tests relating to pod deletion—which pass.

I agree, unfortunately I had a hard time finding any documentation or help on how to test this properly: I'd prefer to simply check that given a "dead" agent that it doesn't have any running builds. I'll love some pointers to either some docs or examples of how to do this.

For now, I'll close this PR until I can rethink the solution.

jglick · 2023-04-24T18:23:42Z

there's a screenshot on the post right now

I believe you. 😁

their builds were literally already cleaned by the log rotator, but they still count as "running"

Hmm, jenkinsci/workflow-durable-task-step-plugin#226 or jenkinsci/workflow-durable-task-step-plugin#259 or jenkinsci/workflow-durable-task-step-plugin#299 might deal with this. Make sure your other plugins are up to date.

the agent just terminates when [a temporary disconnect] happens

If true, something is broken in remoting. It should wait 10s then retry.

Angelin01 · 2023-04-24T19:18:49Z

@jglick
Our plugins/version were updated fairly recently, but I'll have another look, that last PR might be on to something.

Before I leave this for another time (unfortunately I only have so much time I can dedicate to this issue), what is Jenkins' behavior on removeNode (called here: https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pod/retention/Reaper.java#L172)? Because apparently the Reaper trusts that Jenkins will kill whatever job is running, but the ec2-fleet-plugin didn't seem to do so.

Either way, I'll see if I can round up some more information and create a proper Issue to track this. In our cases, this "If the pod is deleted (as should normally happen when you drain a node)" might not necessarily be true, the node is never drained as the underlying EC2 instance is forcefully terminated, but either way the API should send an event about that.

jglick · 2023-04-25T16:43:56Z

what is Jenkins' behavior on removeNode

https://github.com/jenkinsci/workflow-durable-task-step-plugin/blob/ee71f675dee64a2527c3d5ef14aa11d2961ffd30/src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java#L328-L369 and note that https://github.com/jenkinsci/workflow-durable-task-step-plugin/blob/ee71f675dee64a2527c3d5ef14aa11d2961ffd30/src/main/java/org/jenkinsci/plugins/workflow/support/steps/AgentErrorCondition.java#L61 means that https://github.com/jenkinsci/kubernetes-plugin/#retrying-after-infrastructure-outages should permit the build to succeed if fresh K8s nodes become available.

Handle still running jobs after pod disconnect

52133c1

Angelin01 requested a review from a team as a code owner April 24, 2023 14:54

Angelin01 marked this pull request as draft April 24, 2023 16:14

Angelin01 closed this Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abort "running" builds when the pod agent is disconnected #1362

Abort "running" builds when the pod agent is disconnected #1362

Angelin01 commented Apr 24, 2023

jglick commented Apr 24, 2023

Angelin01 commented Apr 24, 2023

jglick commented Apr 24, 2023

Angelin01 commented Apr 24, 2023

jglick commented Apr 25, 2023

Abort "running" builds when the pod agent is disconnected #1362

Abort "running" builds when the pod agent is disconnected #1362

Conversation

Angelin01 commented Apr 24, 2023

jglick commented Apr 24, 2023

Angelin01 commented Apr 24, 2023

jglick commented Apr 24, 2023

Angelin01 commented Apr 24, 2023

jglick commented Apr 25, 2023