Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abort "running" builds when the pod agent is disconnected #1362

Closed

Conversation

Angelin01
Copy link

Pod agents sometimes are disconnected unexpectedly. The most common reason is their Node being shutdown. Frequently, when this happens, builds become "stuck" waiting for the agent to reconnect. Since Pods can't really "reconnect", the builds stay there indefinitely:

image

This fix follows the logic used by the EC2 Fleet Plugin to abort any running builds at the moment the agent is disconnected.

This particular PR would also benefit from #1348 being merged, as it can provide better information in the logs.

While there is a test implemented, I don't particularly like it. I tried to find examples of how to properly create a fake Executor and pod agent and at the same time verify that the builds are indeed killed, but I failed at doing so, so instead I simply mocked it. However, this means that the test fails if any of the other interrupt methods are called. I welcome any suggestions on how to better test this.

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@Angelin01 Angelin01 requested a review from a team as a code owner April 24, 2023 14:54
@Angelin01 Angelin01 marked this pull request as draft April 24, 2023 16:14
@jglick
Copy link
Member

jglick commented Apr 24, 2023

builds become "stuck" waiting for the agent to reconnect. Since Pods can't really "reconnect", the builds stay there indefinitely

If true, this is a real bug, but this is not the right fix.

It is OK for an agent pod to temporarily disconnect and then reconnect later—the build should proceed without interruption.

If the pod is deleted (as should normally happen when you drain a node), then the API server should send an event to that effect, and there is already logic (Reaper) to then send an abort signal to the build.

If the agent ceases to respond and never comes back, maybe because the node crashes and API server does not notice, then after about 5m a sh step should notice that it is not getting expected updates and also self-abort.

Test cases using Mockito are basically useless for evaluating the appropriateness of a fix; it proves only that the Java code you wrote does what it superficially claims to do. To really deal with the problem, it would be much better to have a test case running against a cluster (in CI, this runs via Kind) which demonstrates the indefinite hang. That would make it possible to pinpoint which component is not behaving the way it is supposed to, and fix it. Note that there are already tests relating to pod deletion—which pass.

@Angelin01
Copy link
Author

@jglick

builds become "stuck" waiting for the agent to reconnect. Since Pods can't really "reconnect", the builds stay there indefinitely

If true, this is a real bug, but this is not the right fix.

I mean, there's a screenshot on the post right now. I can share another 3 ~ 4 pods that are "stuck" on my Jenkins instance at this very moment, their builds were literally already cleaned by the log rotator, but they still count as "running".

It is OK for an agent pod to temporarily disconnect and then reconnect later—the build should proceed without interruption.

If the pod is deleted (as should normally happen when you drain a node), then the API server should send an event to that effect, and there is already logic (Reaper) to then send an abort signal to the build.

I realized that, indeed, a temporary disconnect should not affect the pod, by in my experience the agent just terminates when that happens. Is there some way for us to detect this?

If the agent ceases to respond and never comes back, maybe because the node crashes and API server does not notice, then after about 5m a sh step should notice that it is not getting expected updates and also self-abort.

The pod was removed from Kubernetes about 5 days ago, a little after the pod started. In this case, the Node that was running it was reclaimed by AWS. As you can see, the build is still going, so self-abort is not working.

Test cases using Mockito are basically useless for evaluating the appropriateness of a fix; it proves only that the Java code you wrote does what it superficially claims to do. To really deal with the problem, it would be much better to have a test case running against a cluster (in CI, this runs via Kind) which demonstrates the indefinite hang. That would make it possible to pinpoint which component is not behaving the way it is supposed to, and fix it. Note that there are already tests relating to pod deletion—which pass.

I agree, unfortunately I had a hard time finding any documentation or help on how to test this properly: I'd prefer to simply check that given a "dead" agent that it doesn't have any running builds. I'll love some pointers to either some docs or examples of how to do this.

For now, I'll close this PR until I can rethink the solution.

@Angelin01 Angelin01 closed this Apr 24, 2023
@jglick
Copy link
Member

jglick commented Apr 24, 2023

there's a screenshot on the post right now

I believe you. 😁

their builds were literally already cleaned by the log rotator, but they still count as "running"

Hmm, jenkinsci/workflow-durable-task-step-plugin#226 or jenkinsci/workflow-durable-task-step-plugin#259 or jenkinsci/workflow-durable-task-step-plugin#299 might deal with this. Make sure your other plugins are up to date.

the agent just terminates when [a temporary disconnect] happens

If true, something is broken in remoting. It should wait 10s then retry.

@Angelin01
Copy link
Author

@jglick
Our plugins/version were updated fairly recently, but I'll have another look, that last PR might be on to something.

Before I leave this for another time (unfortunately I only have so much time I can dedicate to this issue), what is Jenkins' behavior on removeNode (called here: https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pod/retention/Reaper.java#L172)? Because apparently the Reaper trusts that Jenkins will kill whatever job is running, but the ec2-fleet-plugin didn't seem to do so.

Either way, I'll see if I can round up some more information and create a proper Issue to track this. In our cases, this "If the pod is deleted (as should normally happen when you drain a node)" might not necessarily be true, the node is never drained as the underlying EC2 instance is forcefully terminated, but either way the API should send an event about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants