-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-49707] Agent missing after controller restart to fail resumption of node
step, not kill whole build
#180
Merged
Merged
Changes from 85 commits
Commits
Show all changes
92 commits
Select commit
Hold shift + click to select a range
fec8b95
Sketch of non-`Pickle`-based resumption of `ExecutorStepExecution`
jglick 5bb6378
Amending test from #34
jglick aeb67be
Fixing `Callback`
jglick aa00f8d
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 57e1ed0
Exploring where to detect a failure eligible for retry
jglick b43f084
Making `normalNodeDisappearance` pass under somewhat altered circumst…
jglick 8796113
First successful `node` block retry
jglick 802accd
`FilePathDynamicContext` must take precedence over `ExecutorStepDynam…
jglick 4f0878c
`ExecutorStepTest.unloadableExecutorPickle` as currently conceived no…
jglick bfac529
Fixing `ExecutorStepTest.nodeDisconnectMissingContextVariableExceptio…
jglick d03e043
Fixing `ExecutorStepTest.contextualizeFreshFilePathAfterAgentReconnec…
jglick 98fe3e8
Marking `ExecutorStepRetryEligibility` beta
jglick f149cdb
Javadoc imports
jglick f798bfa
SpotBugs: does not make sense to allow `ExecutorStepDynamicContext.no…
jglick f8bcbe6
Reasonable behavior of `ExecutorPickleTest.canceledQueueItem`
jglick 3a6bad4
User-visible logging about 5m timeout now that `ExecutorPickle.printW…
jglick 2c0f41b
`ExecutorPickleTest` → `ExecutorStepDynamicContextTest`
jglick 1cc20e6
Move `TIMEOUT_WAITING_FOR_NODE_MILLIS` from `ExecutorPickle` to `Exec…
jglick 65a5749
Some user-visible logging when `ExecutorStepRetryEligibility` is _not…
jglick 2d03a46
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick b65ff1c
Do not block on `FutureImpl`.
jglick 1edef32
A `sh` step could fail to find a workspace at the moment it starts.
jglick 6cd1cf2
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 2455149
Also handling `ClosedChannelException` as commonly thrown by `SimpleB…
jglick 64c123a
Also retry after `MissingContextVariableException` on `FilePath`
jglick a54afe2
Remove `super.onResume` left over in #46; see https://github.com/jenk…
jglick 8a49c9e
Documenting need for JENKINS-30383
jglick a7f21b5
https://github.com/jenkinsci/workflow-step-api-plugin/pull/77 allows …
jglick 0c554d3
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 5d83e63
https://github.com/jenkinsci/workflow-step-api-plugin/pull/77 released
jglick e138f6f
Expanding `MissingContextVariableException` type list after noticing …
jglick 8beb608
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 8800d55
Never log a message with only `PlaceholderTask.cookie`; also need `.r…
jglick 9c71521
Suppressing uninteresting log message
jglick b5880ca
Maybe need to call `StepContext.saveState`?
jglick b78373e
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick e1b4bb7
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 3f0e140
Deleting comment; for now it seems clearest to keep `ExecutorStepDyna…
jglick dbc3387
Resolve longstanding tech debt from #104 & #117 about `BodyExecution`.
jglick 3223668
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick a968adf
Sketch of updating `WorkspaceStepExecution`: https://github.com/jenki…
jglick 6a9854c
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 75b2375
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 79d4ff0
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 8b08398
Allowing `FilePathDynamicContext` to block waiting for an agent to re…
jglick 46821ed
Adding a comment about 8b08398
jglick db3d043
8b08398b19b3f5da7ab3915dc62ebe9871f95b84 seems to allow a968adfd12835…
jglick f4660d6
Refining 8b08398b19b3f5da7ab3915dc62ebe9871f95b84 to avoid tail call
jglick 242f829
`org.jenkinsci.plugins.workflow.support.concurrent.Futures` deprecated
jglick 8e25c8f
SpotBugs reminds me that `WorkspaceStepExecution.Callback` needs an S…
jglick 1c42f39
Adding `NodeTranslator`
jglick 5abe7a4
Beginning rewrite to `AgentErrorCondition`
jglick 39b0bd3
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 6b82cc9
Extracted `RetryExecutorStepTest` from `ExecutorStepTest`
jglick 56e8777
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 3035961
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 02fbe3a
Rely on blocking `ExecutorStepExecution.onResume`
jglick 7231de1
Merge branch 'stale-placeholdertask' into retry-JENKINS-49707
jglick c663479
Removing bogus `waitForMessage` (and unnecessary `interrupt`) from `u…
jglick 7cd4bdd
`PlaceholderTask` may not hold a reference to `ExecutorStepExecution`
jglick c134cc4
Equivalent of #184 for `ExecutorStepDynamicContext`
jglick 2a40b11
Assert https://github.com/jenkinsci/workflow-api-plugin/commit/b1778a…
jglick f46dd04
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick f76de40
Merge branch 'retry-JENKINS-49707-base' into retry-JENKINS-49707
jglick 974a71e
https://github.com/jenkinsci/workflow-cps-plugin/pull/534 & https://g…
jglick 67f2180
Merge branch 'retry-JENKINS-49707-base' into retry-JENKINS-49707
jglick 0a995a3
Windows tests https://github.com/jenkinsci/workflow-basic-steps-plugi…
jglick 9a63dd9
Avoiding `powershell` https://github.com/jenkinsci/workflow-basic-ste…
jglick 59ce004
Better usage of Hamcrest; got a flake https://github.com/jenkinsci/wo…
jglick 7b49dcb
In fact the flake was in a different test (`canceledQueueItem`)
jglick 0a5c299
Forgot to make `canceledQueueItem` not use `sh`, though it should not…
jglick b9618c5
Merge branch 'retry-JENKINS-49707-base' into retry-JENKINS-49707
jglick 05c77e0
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 48950fd
Test flake: https://github.com/jenkinsci/workflow-durable-task-step-p…
jglick b2bf524
Merge branch 'retry-JENKINS-49707-conditions' into retry-JENKINS-49707
jglick 3ebbf94
Use `AgentOfflineException` from `ExecutorStepDynamicContext` as well
jglick 4a8e223
Can now run all of `AgentErrorConditionTest`
jglick 2699896
Responding to `CancellationException` in `ExecutorStepDynamicContext`…
jglick 17476f7
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 9865461
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick a437c24
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 26175b3
Permit `canceledQueueItem` to pass on `RemovedNodeCause`
jglick 8d39d15
Normalizing indentation
jglick 2e95e3a
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick 9eaca4a
Merge branch 'master' into retry-JENKINS-49707
jglick efdb6ee
https://github.com/jenkinsci/workflow-step-api-plugin/pull/124 deprec…
jglick 97365d4
Refined `ExecutorStepDynamicContext` behavior in case of `quietingDow…
jglick edbd309
Merge branch 'master' of https://github.com/jenkinsci/workflow-durabl…
jglick e022b6b
Add an explicit `serialVersionUID` to `ExecutorStepExecution.Placehol…
jglick 4b48432
Optimizing `withExecution` to only consider steps in the current buil…
jglick 84036f7
Pick up https://github.com/jenkinsci/workflow-api-plugin/pull/256 htt…
jglick fcea63a
https://github.com/jenkinsci/workflow-api-plugin/pull/256 released
jglick File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
246 changes: 246 additions & 0 deletions
246
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepDynamicContext.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,246 @@ | ||
/* | ||
* The MIT License | ||
* | ||
* Copyright 2021 CloudBees, Inc. | ||
* | ||
* Permission is hereby granted, free of charge, to any person obtaining a copy | ||
* of this software and associated documentation files (the "Software"), to deal | ||
* in the Software without restriction, including without limitation the rights | ||
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
* copies of the Software, and to permit persons to whom the Software is | ||
* furnished to do so, subject to the following conditions: | ||
* | ||
* The above copyright notice and this permission notice shall be included in | ||
* all copies or substantial portions of the Software. | ||
* | ||
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN | ||
* THE SOFTWARE. | ||
*/ | ||
|
||
package org.jenkinsci.plugins.workflow.support.steps; | ||
|
||
import edu.umd.cs.findbugs.annotations.NonNull; | ||
import edu.umd.cs.findbugs.annotations.Nullable; | ||
import hudson.Extension; | ||
import hudson.FilePath; | ||
import hudson.Util; | ||
import hudson.init.InitMilestone; | ||
import hudson.model.Computer; | ||
import hudson.model.Executor; | ||
import hudson.model.Node; | ||
import hudson.model.Queue; | ||
import hudson.model.Result; | ||
import hudson.model.TaskListener; | ||
import hudson.remoting.VirtualChannel; | ||
import hudson.slaves.OfflineCause; | ||
import hudson.slaves.WorkspaceList; | ||
import java.io.IOException; | ||
import java.io.Serializable; | ||
import java.util.concurrent.CancellationException; | ||
import java.util.concurrent.TimeUnit; | ||
import java.util.concurrent.TimeoutException; | ||
import java.util.logging.Level; | ||
import java.util.logging.Logger; | ||
import jenkins.model.Jenkins; | ||
import org.jenkinsci.plugins.workflow.FilePathUtils; | ||
import org.jenkinsci.plugins.workflow.steps.DynamicContext; | ||
import org.jenkinsci.plugins.workflow.steps.FlowInterruptedException; | ||
import org.jenkinsci.plugins.workflow.steps.StepContext; | ||
import org.jenkinsci.plugins.workflow.support.DefaultStepContext; | ||
import org.kohsuke.accmod.Restricted; | ||
import org.kohsuke.accmod.restrictions.NoExternalUse; | ||
|
||
/** | ||
* Persistent representation for context of {@link ExecutorStepExecution}. | ||
* Supersedes {@link FilePathDynamicContext} (never mind {@link org.jenkinsci.plugins.workflow.support.pickles.FilePathPickle}), | ||
* {@link org.jenkinsci.plugins.workflow.support.pickles.ExecutorPickle}, | ||
* {@link org.jenkinsci.plugins.workflow.support.pickles.ComputerPickle}, | ||
* and {@link org.jenkinsci.plugins.workflow.support.pickles.WorkspaceListLeasePickle}. | ||
*/ | ||
@Restricted(NoExternalUse.class) | ||
public final class ExecutorStepDynamicContext implements Serializable { | ||
|
||
private static final Logger LOGGER = Logger.getLogger(ExecutorStepDynamicContext.class.getName()); | ||
|
||
private static final long serialVersionUID = 1; | ||
|
||
final @NonNull ExecutorStepExecution.PlaceholderTask task; | ||
final @NonNull String node; | ||
private final @NonNull String path; | ||
/** Non-null after {@link #resume} if all goes well. */ | ||
private transient @Nullable Executor executor; | ||
/** Non-null after {@link #resume} if all goes well. */ | ||
private transient @Nullable WorkspaceList.Lease lease; | ||
|
||
ExecutorStepDynamicContext(ExecutorStepExecution.PlaceholderTask task, WorkspaceList.Lease lease, Executor executor) { | ||
this.task = task; | ||
this.node = FilePathUtils.getNodeName(lease.path); | ||
this.path = lease.path.getRemote(); | ||
this.executor = executor; | ||
this.lease = lease; | ||
} | ||
|
||
void resume(StepContext context) throws Exception { | ||
if (executor != null) { | ||
throw new IllegalStateException("Already resumed"); | ||
} | ||
while (Jenkins.get().getInitLevel() != InitMilestone.COMPLETED) { | ||
jglick marked this conversation as resolved.
Show resolved
Hide resolved
|
||
LOGGER.fine(() -> "waiting to schedule task for " + path + " on " + node + " until Jenkins completes startup"); | ||
Thread.sleep(100); | ||
} | ||
Queue.Item item = Queue.getInstance().schedule2(task, 0).getItem(); | ||
if (item == null) { | ||
// TODO should also report when !ScheduleResult.created, since that is arguably an error | ||
throw new IllegalStateException("queue refused " + task); | ||
} | ||
LOGGER.fine(() -> "scheduled " + item + " for " + path + " on " + node); | ||
TaskListener listener = context.get(TaskListener.class); | ||
if (!node.isEmpty()) { // unlikely to be any delay for built-in node anyway | ||
listener.getLogger().println("Waiting for reconnection of " + node + " before proceeding with build"); | ||
} | ||
Queue.Executable exec; | ||
try { | ||
exec = item.getFuture().getStartCondition().get(ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS, TimeUnit.MILLISECONDS); | ||
jglick marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} catch (TimeoutException x) { | ||
listener.getLogger().println(node + " has been removed for " + Util.getTimeSpanString(ExecutorStepExecution.TIMEOUT_WAITING_FOR_NODE_MILLIS) + ", assuming it is not coming back"); | ||
throw new FlowInterruptedException(Result.ABORTED, new ExecutorStepExecution.RemovedNodeCause()); | ||
} catch (CancellationException x) { | ||
LOGGER.log(Level.FINE, "ceased to wait for " + node, x); | ||
throw new FlowInterruptedException(Result.ABORTED, new ExecutorStepExecution.QueueTaskCancelled()); | ||
} | ||
executor = Executor.of(exec); | ||
if (executor == null) { | ||
// TODO this could happen as a race condition if the executable takes <1s to run; how could that be prevented? | ||
// Or can we schedule a placeholder Task whose Executable does nothing but return Executor.currentExecutor and then end? | ||
throw new IOException(exec + " was scheduled but no executor claimed it"); | ||
} | ||
Computer computer = executor.getOwner(); | ||
VirtualChannel channel = computer.getChannel(); | ||
if (channel == null) { | ||
throw new IOException(computer + " is offline"); | ||
} | ||
FilePath fp = new FilePath(channel, path); | ||
// Since there is no equivalent to Lock.tryLock for WorkspaceList (.record would work but throws AssertionError and swaps the holder): | ||
WorkspaceList.Lease _lease = computer.getWorkspaceList().allocate(fp); | ||
if (_lease.path.equals(fp)) { | ||
lease = _lease; | ||
} else { // @2 or other variant, not what we expected to be able to lock without contention | ||
_lease.release(); | ||
throw new IOException("JENKINS-37121: something already locked " + fp); | ||
} | ||
LOGGER.fine(() -> "fully restored for " + path + " on " + node); | ||
} | ||
|
||
private static abstract class Translator<T> extends DynamicContext.Typed<T> { | ||
|
||
@Override protected T get(DelegatedContext context) throws IOException, InterruptedException { | ||
ExecutorStepDynamicContext c = context.get(ExecutorStepDynamicContext.class); | ||
if (c == null || c.lease == null) { | ||
return null; | ||
} | ||
return get(c); | ||
} | ||
|
||
abstract T get(ExecutorStepDynamicContext c) throws IOException, InterruptedException; | ||
|
||
} | ||
|
||
@Extension public static final class FilePathTranslator extends Translator<FilePath> { | ||
|
||
@Override protected Class<FilePath> type() { | ||
return FilePath.class; | ||
} | ||
|
||
@Override FilePath get(ExecutorStepDynamicContext c) throws IOException { | ||
if (c.lease.path.toComputer() == null) { | ||
FilePath f = FilePathUtils.find(c.node, c.path); | ||
if (f != null) { | ||
LOGGER.fine(() -> c.node + " disconnected and reconnected; getting a new FilePath on " + c.path + " with the new Channel"); | ||
return f; | ||
} | ||
String message = "Unable to create live FilePath for " + c.node; | ||
Computer comp = Jenkins.get().getComputer(c.node); | ||
if (comp != null) { | ||
OfflineCause oc = comp.getOfflineCause(); | ||
if (oc != null) { | ||
message += "; " + comp.getDisplayName() + " was marked offline: " + oc; | ||
} | ||
} | ||
AgentOfflineException e = new AgentOfflineException(message); | ||
if (comp != null) { | ||
for (Computer.TerminationRequest tr : comp.getTerminatedBy()) { | ||
e.addSuppressed(tr); | ||
} | ||
} | ||
throw e; | ||
} | ||
return c.lease.path; | ||
} | ||
|
||
} | ||
|
||
@Extension public static final class WorkspaceListLeaseTranslator extends Translator<WorkspaceList.Lease> { | ||
|
||
@Override protected Class<WorkspaceList.Lease> type() { | ||
return WorkspaceList.Lease.class; | ||
} | ||
|
||
@Override WorkspaceList.Lease get(ExecutorStepDynamicContext c) { | ||
// Do not do a liveness check as in FilePathTranslator. | ||
// We could not do anything about a stale .path even if we found out about it. | ||
return c.lease; | ||
} | ||
|
||
} | ||
|
||
@Extension public static final class ExecutorTranslator extends Translator<Executor> { | ||
|
||
@Override protected Class<Executor> type() { | ||
return Executor.class; | ||
} | ||
|
||
@Override Executor get(ExecutorStepDynamicContext c) { | ||
return c.executor; | ||
} | ||
|
||
} | ||
|
||
@Extension public static final class ComputerTranslator extends Translator<Computer> { | ||
|
||
@Override protected Class<Computer> type() { | ||
return Computer.class; | ||
} | ||
|
||
@Override Computer get(ExecutorStepDynamicContext c) { | ||
return c.executor.getOwner(); | ||
} | ||
|
||
} | ||
|
||
/** | ||
* Need not use {@link Translator} since we can serve a {@link Node} even when offline. | ||
* Overrides default behavior in {@link DefaultStepContext} which would delegate to {@link ComputerTranslator}. | ||
*/ | ||
@Extension public static final class NodeTranslator extends DynamicContext.Typed<Node> { | ||
|
||
@Override protected Class<Node> type() { | ||
return Node.class; | ||
} | ||
|
||
@Override protected Node get(DelegatedContext context) throws IOException, InterruptedException { | ||
ExecutorStepDynamicContext c = context.get(ExecutorStepDynamicContext.class); | ||
if (c == null) { | ||
return null; | ||
} | ||
Jenkins j = Jenkins.get(); | ||
return c.node.isEmpty() ? j : j.getNode(c.node); | ||
} | ||
|
||
} | ||
|
||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the factories for these pickles be deleted or replaced with an implementation that just throws an exception if it encounters an object of the specified type? Or do you want to keep them around in case some other step or wild Groovy code is relying on them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to retain them for compatibility. Weird Pipeline script is one possibility. They could also be used in other Pipeline steps; I know
PushdStep
bindsFilePathDynamicContext
, so that is not affected, but there may be others. Seems harmless enough to leave them here.