Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent StepExecutionIterator from leaking memory in cases where a single processed execution has a stuck CPS VM thread #347

Merged
merged 6 commits into from
Aug 9, 2024
Original file line number Diff line number Diff line change
Expand Up @@ -253,24 +253,26 @@

for (FlowExecution e : FlowExecutionList.get()) {
ListenableFuture<List<StepExecution>> execs = e.getCurrentExecutions(false);
all.add(execs);
Futures.addCallback(execs,new FutureCallback<List<StepExecution>>() {
@Override
public void onSuccess(@NonNull List<StepExecution> result) {
for (StepExecution e : result) {
try {
f.apply(e);
} catch (RuntimeException x) {
LOGGER.log(Level.WARNING, null, x);
}
// We transform the futures that return List<StepExecution> into futures that return Void before
// passing them to Futures.allAsList so that the combined future only holds strong references to each
// FlowExecution until its StepExecutions have been loaded and applied to the function. This prevents
// us from leaking references to all processed executions in the case where a single build has a stuck
// CpsVmExecutorService that prevents the future returned by getCurrentExecutions from completing.
ListenableFuture<Void> results = Futures.transform(execs, (List<StepExecution> result) -> {
for (StepExecution se : result) {
try {
f.apply(se);
} catch (RuntimeException x) {
LOGGER.log(Level.WARNING, null, x);
}
}

@Override
public void onFailure(@NonNull Throwable t) {
LOGGER.log(Level.WARNING, null, t);
}
return null;
}, MoreExecutors.directExecutor());
ListenableFuture<Void> resultsWithWarningsLogged = Futures.catching(results, Throwable.class, t -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes. Project Loom would be very welcome in code like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking more carefully through Guava's Javadoc I think I can use FluentFuture to simplify this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually FluentFuture doesn't really make things any clearer, so I will leave it.

LOGGER.log(Level.WARNING, null, t);
return null;

Check warning on line 273 in src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java

View check run for this annotation

ci.jenkins.io / Code Coverage

Not covered lines

Lines 272-273 are not covered by tests
}, MoreExecutors.directExecutor());
all.add(resultsWithWarningsLogged);
}

return Futures.allAsList(all);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,14 @@
import hudson.model.TaskListener;
import hudson.model.queue.QueueTaskFuture;
import java.io.Serializable;
import java.lang.ref.WeakReference;
import java.time.Duration;
import java.time.Instant;
import java.util.Collections;
import java.util.Set;
import java.util.function.Supplier;
import java.util.logging.Level;
import jenkins.model.Jenkins;
import org.hamcrest.Matcher;
import org.jenkinsci.plugins.workflow.cps.CpsFlowDefinition;
import org.jenkinsci.plugins.workflow.job.WorkflowJob;
Expand All @@ -60,6 +62,7 @@
import org.jvnet.hudson.test.Issue;
import org.jvnet.hudson.test.LoggerRule;
import org.jvnet.hudson.test.JenkinsSessionRule;
import org.jvnet.hudson.test.MemoryAssert;
import org.jvnet.hudson.test.TestExtension;
import org.kohsuke.stapler.DataBoundConstructor;

Expand Down Expand Up @@ -160,6 +163,32 @@
});
}

@Test public void stepExecutionIteratorDoesNotLeakBuildsWhenOneIsStuck() throws Throwable {
sessions.then(r -> {
var notStuck = r.createProject(WorkflowJob.class, "not-stuck");
notStuck.setDefinition(new CpsFlowDefinition("semaphore('wait')", true));
var notStuckBuild = notStuck.scheduleBuild2(0).waitForStart();
SemaphoreStep.waitForStart("wait/1", notStuckBuild);
WeakReference<Object> notStuckBuildRef = new WeakReference<>(notStuckBuild);
// Create a Pipeline that runs a long-lived task on its CpsVmExecutorService, causing it to get stuck.
var stuck = r.createProject(WorkflowJob.class, "stuck");
stuck.setDefinition(new CpsFlowDefinition("echo 'test message'; Thread.sleep(Integer.MAX_VALUE)", false));
var stuckBuild = stuck.scheduleBuild2(0).waitForStart();
r.waitForMessage("test message", stuckBuild);
Thread.sleep(1000); // We need Thread.sleep to be running in the CpsVmExecutorService.
// Make FlowExecutionList$StepExecutionIteratorImpl.applyAll submit a task to the CpsVmExecutorService
// for stuck #1 that will never complete, so the resulting future will never complete.
StepExecution.applyAll(e -> null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return value be kept in a local variable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary for the leak, if that's what you mean. Even if the StepExecution.applyAll caller completely ignores the return value, the stuck CPS VM thread holds a strong reference to the aggregate future because the SingleLaneExecutorService's task queue has a reference to the SettableFuture for the getCurrentExecutions call, and then that future references the aggregate future via a listener just due to the implementation of Futures.allAsList.

// Let notStuckBuild complete and check that it can be GC'd.
SemaphoreStep.success("wait/1", null);
r.waitForCompletion(notStuckBuild);
notStuckBuild = null; // Clear out the local variable in this thread.
Jenkins.get().getQueue().clearLeftItems(); // We don't want to wait 5 minutes.
MemoryAssert.assertGC(notStuckBuildRef, true);
Copy link
Member Author

@dwnusbaum dwnusbaum Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert the fix and this will fail. The PR description shows the reference path preventing the build from being cleaned up.

// TODO: Test cleanup hangs for 1 minute in CpsFlowExecution.suspendAll because the checkpoint task can't run.

Check warning on line 188 in src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java

View check run for this annotation

ci.jenkins.io / Open Tasks Scanner

TODO

NORMAL: Test cleanup hangs for 1 minute in CpsFlowExecution.suspendAll because the checkpoint task can't run.
dwnusbaum marked this conversation as resolved.
Show resolved Hide resolved
});
}

public static class NonResumableStep extends Step implements Serializable {
public static final long serialVersionUID = 1L;
@DataBoundConstructor
Expand Down
Loading