Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(pipeline) switch to podTemplate instead of label to use the new node pool #1969

Conversation

dduportal
Copy link
Contributor

@dduportal dduportal commented Apr 17, 2023

This PR is work in progress to switch the bom builds to a new pool of agents resources as per jenkins-infra/helpdesk#3521

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@dduportal dduportal changed the title chore(pipeline) switch to podTemplate instead of label to use the new… chore(pipeline) switch to podTemplate instead of label to use the new node pool Apr 17, 2023
@dduportal
Copy link
Contributor Author

First tentative (build #2) was unsuccessful but showed the following:

  • Pod allocation:
    • It takes ~5 min for the initial scale up phase, after prep.sh.
    • There are a lot of pod scheduling requests during this initial phase (which include scheduling + autoscaling + pulling the base image which is ~1.8 Gb). I guess tuning the Jenkins pod startup timeout could avoid hammering the scheduler
    • The build seems to peak at 280 running agents

Capture d’écran 2023-04-17 à 20 58 10

  • CPU usage:
    • High usage during the first 30 min, then it was a complete mess (CPU throttling?)
    • We see pods going over the CPU limit => let's remove it

Capture d’écran 2023-04-17 à 20 58 33

  • Memory Usage:
    • No pod used more than 7.5 Gb (limit of 8 Gb)
    • After the initial 30 min period, memory seems to stay at 4 Gb allocated. Don't forget that memory usage includes the /tmp in this case

Capture d’écran 2023-04-17 à 20 58 50

  • The flowGraphTable shows that a lof of steps such as the sh 'mvn -version' took minutes (2 to 4 minutes) for a lot of stages. It feels like there is a contention somewhere: CPU throttling? Issue with the ci.jenkins.io Kubernetes plugin? Something else?

@dduportal
Copy link
Contributor Author

dduportal commented Apr 17, 2023

Retrying (build #4) with the following changes:

  • Podtemplate moved to a YAML file for clarity
  • Remove the CPU resource limit to ensure no throttling could happen
  • Grouped sh expressions when possible (only the obvious ones)
  • Removed the deleteDir() expensive step (which is not needed as agents are ephemeral)

@jetersen
Copy link
Member

Could the base image be loaded into the AMI that is spun up, to save some build time? 🤔

@dduportal
Copy link
Contributor Author

Could the base image be loaded into the AMI that is spun up, to save some build time? 🤔

Technically yes. But it would not be sustainable (rebuild AMI on each Docker image rebuild, eg. 3-7 times a week, then redeploy the node pool which takes 4 to 20 min).
Not sure it would be worth it.

But it sounds like it's not really the core challenge (at least not yet): there are clearly contention on Jenkins side: it takes minutes to execute the sh steps. The massive parallelization is clearly blocked at different levels.

  • The infra team has a first wave of proposals (ready to apply on short term) to work on the controller setup (VM hardware including dystem disk, removing the jobconfig hisotry plugin, using a proper network, cleanup jcasc for azure setup)
  • But I wonder if building the megawar 280 times isn't creating more problem than it solves. (don't get me wrong: @jglick wors was really helpful and proved that we either have to pay for compute or bandwitdh). Maybe implementing a "custom stash/unstash" to a S3 bucket specifcially here + move back the megawar build to prep.sh could be a nice thing to try.

@jetersen
Copy link
Member

Could rebuilt the Ami once a week or month. Docker image would only download the bits that have changed since the last pull. So instead fo 2GB it could be a few MB

@jetersen
Copy link
Member

But I wonder if building the megawar 280 times isn't creating more problem than it solves. (don't get me wrong: @jglick wors was really helpful and proved that we either have to pay for compute or bandwitdh). Maybe implementing a "custom stash/unstash" to a S3 bucket specifcially here + move back the megawar build to prep.sh could be a nice thing to try.

There is the possibility to use volume binding on the Kubernetes nodes to mount the volume for megawar 🤔. There are some options at least:
https://kubernetes.io/blog/2021/09/13/read-write-once-pod-access-mode-alpha/
https://dev.to/otomato_io/mount-s3-objects-to-kubernetes-pods-12f5

@dduportal
Copy link
Contributor Author

Thanks @jetersen for the ideas and recommendations! Alas, we've already studied these elements:

Could rebuilt the Ami once a week or month. Docker image would only download the bits that have changed since the last pull. So instead fo 2GB it could be a few MB

In a scenario with Docker images shaped with layers, for a pure container runtime usage, that would be true. It's not the case for the Jenkins infra as we generate the Docker image using Packer to ensure the same environment between VM and container agents (rationale: maintaining a multitude of tiny Docker images in CI environments with all the possible combination of tools is an unsustainable task). It means that packer generate single layered image.

=> the image pull is a minor optimization, at best we'll be able to avoid 5 min waiting on a 120+ minute jobs. If we have to solve this problem, a local Docker registry (e.g. same network) would clearly be more efficient.

There is the possibility to use volume binding on the Kubernetes nodes to mount the volume for megawar 🤔. There are some options at least:
https://kubernetes.io/blog/2021/09/13/read-write-once-pod-access-mode-alpha/
https://dev.to/otomato_io/mount-s3-objects-to-kubernetes-pods-12f5

That is a good idea that could help!
Alas, our experience with Kubernetes, wether on AWS EKS, Azure AKS or Digital Ocean shows that adding PVC creates contention at the Kuberenetes scheduler even with S3 or Azurefile drivers.

Worth giving it a try though, if it does not require a cluster heavy change. otherwise aws s3 cp as a pipeline step could clearly be easy to implement (only constraint: credential, which is easily solvable).

@jglick jglick added the chore Reduces future maintenance label Apr 19, 2023
@jglick
Copy link
Member

jglick commented Apr 19, 2023

implementing a "custom stash/unstash" to a S3 bucket

Why would we not simply revert #1955 and enable artifact-manager-s3?

Regarding base image discussions, that all sounds like overthinking it. If an image is published to an appropriate registry local to the cloud provider, its nodes ought to be able to pull that image in a reasonable amount of time, and cache it as well. I think your issue is that you are pulling from Docker Hub (jenkinsciinfra/jenkins-agent-ubuntu-22.04). You should rather pull from ECR if running on AWS, etc.

sh 'mvn -version' took minutes (2 to 4 minutes)

That does not sound right—mvn -version should not require downloading anything, so it should complete within a second or so. What specifically is going on at this time? Does the actual process in the agent container take that long, or does the controller just take a long time to launch it, or a long time to register that it has completed and process the output? Do pipeline and/or native thread dumps from the controller offer any information at this time?

podTemplate.yaml Outdated Show resolved Hide resolved
podTemplate.yaml Outdated Show resolved Hide resolved
Comment on lines +14 to +15
- name: "ARTIFACT_CACHING_PROXY_PROVIDER"
value: "aws"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this get overridden?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC

infra.withArtifactCachingProxy(env.ARTIFACT_CACHING_PROXY_PROVIDER != 'do') {
should set the same variable?

Copy link
Member

@lemeurherve lemeurherve Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're setting this ARTIFACT_CACHING_PROXY_PROVIDER variable value in every agent (except the Azure VM agents as it's not possible to set an env var on them AFAIK) in our infrastructure as code definitions.
This variable is then used in the infra.withArtifactCachingProxy function to determine which acp provider to use.
This variable is not set in the pipeline library.
Putting it in this podTemplate allows the correct selection of the AWS acp provider, without it it would default to the Azure one. (cf https://github.com/jenkins-infra/pipeline-library/blob/9e0f0a2f09a7c91b26824e706a8963598b685742/vars/infra.groovy#L101)

privileged: false
tty: false
volumeMounts:
- mountPath: "/home/jenkins/.m2/repository"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not be used, given

"MAVEN_ARGS=${env.MAVEN_ARGS != null ? MAVEN_ARGS : ''} -B -ntp -Dmaven.repo.local=${WORKSPACE_TMP}/m2repo"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a "copy-and-paste" of the pod yaml from the existing config defined by the admins of ci.jenkins.io. I would want to stick to this realit y as close as possible.

@jglick
Copy link
Member

jglick commented Apr 19, 2023

credential, which is easily solvable

Directly granting access to an S3 bucket to prep.sh or pct.sh is a non-starter here, since anyone filing a PR could just edit those to do something else.

@dduportal
Copy link
Contributor Author

Why would we not simply revert #1955 and enable artifact-manager-s3?

@jglick if we enable artifact-manager-s3, can we state "only for bom ? Or is it controller wide ?
(and corollary, if we also enable Azure Artfact Caching Manager, what is the expected behavior?)

@dduportal
Copy link
Contributor Author

That does not sound right—mvn -version should not require downloading anything, so it should complete within a second or so.

That's also what we understand. The unexpected (slow) behavior is the same for any sh step (the curl to ACP for instances).

What specifically is going on at this time?

When the build #2 ran, only 6 builds (on VMs only) happened on ci.jenkins during the whole build.

Does the actual process in the agent container take that long, or does the controller just take a long time to launch it, or a long time to register that it has completed and process the output?

  • Based on the build logs, the process on the agent is really fast to execute (mvn --version reports less than 1s most of the time, curl command reports ~2s).

  • I'm not sure if the whole pipeline step was slow, or if maybe it is only reporting. The flow graph table shows unexpected long timing so I assume it's not only reporting, but a real contention

Do pipeline and/or native thread dumps from the controller offer any information at this time?

Need to check again and export threaddump + checking if we can see this on the main branch of the BOM builds

@jglick
Copy link
Member

jglick commented Apr 19, 2023

if we enable artifact-manager-s3, can we state "only for bom ? Or is it controller wide ?

It is controller-wide. But are there other jobs using artifacts and/or stashes which you would want to use the built-in storage for?

@timja
Copy link
Member

timja commented Apr 20, 2023

implementing a "custom stash/unstash" to a S3 bucket

There's also this plugin https://plugins.jenkins.io/s3/.
or just using shell steps.

@dduportal
Copy link
Contributor Author

dduportal commented Apr 20, 2023

I'm trying to summarize the possible options for the "stash/unstash" with their pros and cons. Let me know if it make sense for you:

  • Use Azure Artifact Manager (Azure blob storage)

    • ✅ Close to the ci.jenkins.io controller
    • ✅ Outbound bandwidth costs less from blob than VM
    • ✅ Using Azure is less prone to sudden migration (due to ended sponsorship) compared to AWS or DigitalOcean
    • No GC of stashed elements (see comment below)
    • ❌ BOM builds are mostly running in AWS (and eventually in DO) so there is a network latence AWS <-> Azure (but both are us-east-1)
    • ❌ The Azure Artifact Manager failed randomly after the first test, without enough debugging information so it's a risk that it could happen again
    • ❌ Unable to scope to only the bom
  • Use S3 Artifact Manager with AWS

    • ✅ Perfect network performance level for bom builds (assuming they should stick to AWS)
    • ✅ Works out of the box with AWS S3 bucket (of course)
    • ✅ Nice safety level (operation are checked by the controller so no risk from malicious PR and no credential in the pipeline)
    • ✅ Automatic GC of stashed artefacts (there is an option in its configuration UI)
    • 🚧 Costs to evaluate, it could impact the AWS bill which we are trying to decrease but if it helps to drastically decrease cost of bom build, then this cost can be compensated
    • ❌ Create a dependency to AWS (which is only a sponsor that can stop at any moment)
    • ❌ Network latency for builds in Azure (ATH, jobs requiring Docker CE with a VM) and DO (plugins and a bit of BOM today)
    • ❌ Unable to scope to only the bom
  • Step doing the custom stah/unstash in the Jenkinsfile, using an AWS S3:

    • ✅ Scoped to only the bom
    • ✅ Perfect network performance level for bom builds (assuming they should stick to AWS)
    • 🚧 Costs to evaluate, it could impact the AWS bill which we are trying to decrease but if it helps to drastically decrease cost of bom build, then this cost can be compensated
    • ❌ No automatic GC of stashed artifacts
    • ❌ Safety level (needs a pipeline-level credential) but could rely on the "trusted" Jenkinsfile (only maintainer can edit it)
  • Step doing the custom stah/unstash in the jenkins-infra/pipeline-library:

    • ✅ Could be scoped to each job (allows selecting the proper cloud in most of the cases if we stick a build to a single cloud)
    • ✅ Perfect network performance level for bom builds (assuming they should stick to AWS)
    • ❌ No automatic GC of stashed artifacts
    • ❌ Safety level is better but still require a credential accessed by the pipeline (top level)

Are there other solutions or things I missed?

@timja
Copy link
Member

timja commented Apr 20, 2023

If it's done in the pipeline library you could also abstract away the cloud provider, depending on which provider the agent is running on you could use that, although it would assume that other agents in the build use the same provider...


The artifact managers do clean up stashes.
see https://github.com/jenkinsci/azure-artifact-manager-plugin/blob/3506da42a953e7631b4bbb272f43de097aa84c27/src/main/java/com/microsoft/jenkins/artifactmanager/AzureArtifactManager.java#L475

@jglick
Copy link
Member

jglick commented Apr 20, 2023

artifact-manager-s3 by default does not delete anything, by design: so you can give the controller lesser permissions. Automatic deletion of artifacts past a certain age can easily be configured in S3 itself, which would be more reliable anyway. (And stashes and artifacts are kept in separate top-level folders, so you can apply distinct policies.) FWIW

@jglick
Copy link
Member

jglick commented Apr 20, 2023

Safety level (needs a pipeline-level credential) but could rely on the "trusted" Jenkinsfile
Safety level is better but still require a credential accessed by the pipeline (top level)

Since in either of these cases you are letting some sh step run on an agent with blob store credentials, you could not safely pass in bucket-scope credentials because the command could be intercepted or overridden with various tricks. You would have to do a bunch of IAM magic to pass in session credentials scoped to a per-PR folder, or something. This is the sort of concern that is taken care of for you by an artifact manager plugin.

@dduportal
Copy link
Contributor Author

artifact-manager-s3 by default does not delete anything, by design: so you can give the controller lesser permissions. Automatic deletion of artifacts past a certain age can easily be configured in S3 itself, which would be more reliable anyway. (And stashes and artifacts are kept in separate top-level folders, so you can apply distinct policies.) FWIW

I was only thinking about stash deletion driven by Jenkins (as per https://plugins.jenkins.io/artifact-manager-s3/#plugin-content-delete-stash).

But it's true that a Garbage collection could be done on S3 side to avoid pressuring Jenkins controller 👍

This is the sort of concern that is taken care of for you by an artifact manager plugin.

Absolutely, that is the main reason why I'm reluctant to use the "pipeline step" / "pipeline library" solution.

@jglick
Copy link
Member

jglick commented Apr 21, 2023

stash deletion driven by Jenkins

As noted there, this is deliberately not done by default, and must be enabled via system property.

@jglick jglick mentioned this pull request Apr 21, 2023
@dduportal dduportal force-pushed the chore/use-jenkins-infra-new-nodepool branch from 57116da to 23d6104 Compare April 22, 2023 12:11
@dduportal
Copy link
Contributor Author

Build in 2hours and 09 minutes => still the same error with "long steps".

The ci.jenkins.io machine seems to be "waiting" as well as agents, not sure where to look.

Got a thread dump when the phenomen was seen

Archive.zip

@jglick
Copy link
Member

jglick commented Apr 25, 2023

most of the shell steps are reported in Jenkins a way longer than the process reported time in build output

A few seconds of overhead is normal enough, but not this. It is difficult to pinpoint the issue from general thread dumps; need a thread dump of the controller and that agent taken at the moment that a sh step is taking too long to run.

From Thread dump [Jenkins].html I see nothing obviously amiss. There is

"org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#25]: checking /home/jenkins/agent/workspace/Tools_bom_PR-1969 on tools-bom-pr-1969-13-vdz4c-1g361-nmk5m / waiting for JNLP4-connect connection from ec2-3-18-53-88.us-east-2.compute.amazonaws.com/3.18.53.88:30580 id=244896" Id=620 Group=main TIMED_WAITING on hudson.remoting.UserRequest@5f1ba24e
	at [email protected]/java.lang.Object.wait(Native Method)
	-  waiting on hudson.remoting.UserRequest@5f1ba24e
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:999)
	at hudson.FilePath.act(FilePath.java:1192)
	at hudson.FilePath.act(FilePath.java:1181)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:340)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:600)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:557)

so looking for output. There is also an agent thread dump from related code, busy doing remote class loading

"pool-1-thread-14 for JNLP4-connect connection to ci.jenkins.io/104.208.238.39:50000 id=244896 / waiting for JNLP4-connect connection to ci.jenkins.io/104.208.238.39:50000 id=281" Id=41 Group=main TIMED_WAITING on hudson.remoting.RemoteInvocationHandler$RPCRequest@bc1a0b
	at [email protected]/java.lang.Object.wait(Native Method)
	-  waiting on hudson.remoting.RemoteInvocationHandler$RPCRequest@bc1a0b
	at app//hudson.remoting.Request.call(Request.java:177)
	at app//hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:288)
	at app//com.sun.proxy.$Proxy7.fetch3(Unknown Source)
	at app//hudson.remoting.RemoteClassLoader.prefetchClassReference(RemoteClassLoader.java:354)
	at app//hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:259)
	at app//hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:229)
	at [email protected]/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
	-  locked hudson.remoting.RemoteClassLoader@514a1d94
	at [email protected]/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
	at com.google.common.io.Closer.<clinit>(Closer.java:99)
	at com.google.common.io.CharSource.readFirstLine(CharSource.java:314)
	at com.google.common.io.Files.readFirstLine(Files.java:535)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController$StatusCheck.invoke(FileMonitoringTask.java:390)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController$StatusCheck.invoke(FileMonitoringTask.java:384)
	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3578)
	at app//hudson.remoting.UserRequest.perform(UserRequest.java:211)
	at app//hudson.remoting.UserRequest.perform(UserRequest.java:54)
	at app//hudson.remoting.Request$2.run(Request.java:377)

You can try running Jenkins with -Dorg.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.USE_WATCHING which is currently off by default (except when pipeline-cloudwatch-logs and similar JEP-210 plugins are enabled: JENKINS-52165) but which may scale better. OTOH I see plenty of DurableTaskStep threads idle on the controller, so the problem is not one of thread exhaustion in this pool; the agent connection just may be slow for some reason.

There are a lot of PingThreads. Sometime I need to try to fix https://issues.jenkins.io/browse/JENKINS-20217 (no reason to have threads in a pool sleeping).

The immediate problem might be

"Running CpsFlowExecution[Owner[Tools/bom/PR-1969/13:Tools/bom/PR-1969 #13]] / waiting for JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:65271 id=244875" Id=9408 Group=main TIMED_WAITING on hudson.remoting.UserRequest@32d167fe
	at [email protected]/java.lang.Object.wait(Native Method)
	-  waiting on hudson.remoting.UserRequest@32d167fe
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:999)
	at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1124)
	at hudson.Launcher$ProcStarter.start(Launcher.java:509)
	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:132)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:326)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:322)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:196)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:124)

meaning the sh step is waiting just to launch the wrapper script which would then keep running in the background. Unfortunately the thread dump does not indicate the agent name and there seem to be a lot of agents running on this VM.

In 07-Thread dump [Jenkins].html I notice a bunch of JnlpSlaveRestarterInstaller$Install.install which jenkinsci/jenkins#7693 will remove in the next LTS.

"Running CpsFlowExecution[Owner[Tools/bom/PR-1969/9:Tools/bom/PR-1969 #9]]saving /var/jenkins_home/jobs/Tools/jobs/bom/branches/PR-1969/builds/9/program.dat" Id=13785 Group=main RUNNABLE

is busy writing out metadata. The more stuff is happening, especially in parallel, the more program there is to save, so these files can probably get big. A support bundle will include a pipeline timings file that indicates how much controller CPU time is spent on various sorts of build overhead including this.

@dduportal
Copy link
Contributor Author

Thanks for the details and explanation!

Still trying to digest it but

A support bundle will include a pipeline timings file that indicates how much controller CPU time is spent on various sorts of build overhead including this.

kept my attention, as we have support bundle so I can share one

@dduportal
Copy link
Contributor Author

Just a note: work on the bom is delayed due to jenkins-infra/helpdesk#3542

@dduportal
Copy link
Contributor Author

Now that jenkins-infra/helpdesk#3542 is fixed, tried a build to verify the new agent cluster.

Works (from infrastructure point of view) but the build failed due to degraded agent pool with error messages related to spot (gotta check later)/

@jglick
Copy link
Member

jglick commented Jun 9, 2023

@dduportal are you still expecting to do something here? Not harming anything leaving it open if so, just wondering if this was forgotten.

@dduportal
Copy link
Contributor Author

@dduportal are you still expecting to do something here? Not harming anything leaving it open if so, just wondering if this was forgotten.

I did not forget but was waiting for migrating ci.jenkins.io's VM to a new hardware + networks and also for jenkins-infra/helpdesk#3573 to have a full observability (including traces).

Last test with these new elements still shows the dreaded sh steps taking (2 to 11) minutes to run a curl command.

@jglick
Copy link
Member

jglick commented Jul 12, 2023

Interesting that you should mention this now, as I have been following some performance testing inside @cloudbees suggesting that inexplicably slow sh steps can be a side effect of excessive program.dat writes in heavily loaded controllers, as hinted at in #1969 (comment). Enabling push mode (-Dorg.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.USE_WATCHING=true pending jenkinsci/workflow-durable-task-step-plugin#323) helps considerably; so does changing the durability mode of builds, though this could also increase the chance of builds being left in an anomalous state after controller crashes (should not matter for graceful restarts).

@jglick
Copy link
Member

jglick commented Jul 12, 2023

@dduportal does https://ci.jenkins.io/job/Tools/configure have an option for Pipeline branch speed/durability override? If so, please set that to “performance optimized”. Like this:

durabilityHint

(It is possible to set this from Jenkinsfile using the properties step but that does not apply to build 1 of a branch, since the setting is considered when a build starts before the properties step is even run, making it not very helpful for PRs.)

@MarkEWaite
Copy link
Contributor

@dduportal does https://ci.jenkins.io/job/Tools/configure have an option for Pipeline branch speed/durability override? If so, please set that to “performance optimized”

I've found that option and enabled it. Thanks for the recommendation

@MarkEWaite
Copy link
Contributor

MarkEWaite commented Jul 12, 2023

I've "borrowed" a test job that @jtnord had defined and am running it with the performance optimized setting on the job configuration. See https://ci.jenkins.io/job/Tools/job/bom/view/change-requests/job/PR-2257/3/ in a few hours to see the results.

I'm borrowing:

@jglick
Copy link
Member

jglick commented Jul 12, 2023

Hmm, seems to be not working at all: https://ci.jenkins.io/job/Tools/job/bom/job/PR-2257/3/threadDump/ has just created parallel branches but is not running anything inside of them.

@dduportal
Copy link
Contributor Author

Interesting that you should mention this now, as I have been following some performance testing inside @cloudbees suggesting that inexplicably slow sh steps can be a side effect of excessive program.dat writes in heavily loaded controllers, as hinted at in #1969 (comment). Enabling push mode (-Dorg.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.USE_WATCHING=true pending jenkinsci/workflow-durable-task-step-plugin#323) helps considerably; so does changing the durability mode of builds, though this could also increase the chance of builds being left in an anomalous state after controller crashes (should not matter for graceful restarts).

Interesting: is is worth testing the incremental built plugins from this PR to check the difference on ci.jenkins.io? Would that help?

@MarkEWaite
Copy link
Contributor

Hmm, seems to be not working at all: https://ci.jenkins.io/job/Tools/job/bom/job/PR-2257/3/threadDump/ has just created parallel branches but is not running anything inside of them.

The #3 build completed in less than 2 hours when its preceding build took 3.5 hours. I think that means the performance optimized setting that is currently configured on the Tools folder of ci.jenkins.io is worth retaining. Even if it required a new run across a Jenkins controller restart, I think the performance improvement is worth it.

@jglick
Copy link
Member

jglick commented Jul 13, 2023

is is worth testing the incremental built plugins from [PR-323] to check the difference on ci.jenkins.io?

Or simply setting the system property: besides flipping the default, PR-323 so far is only fixing a handful of robustness bugs affecting tests and corner cases. I think it would be worth a try. (It would have a bigger effect without PERFORMANCE_OPTIMIZED mode, but I think still some effect in conjunction.)

@jglick
Copy link
Member

jglick commented Sep 8, 2023

besides flipping the default, PR-323 so far is only fixing a handful of robustness bugs

Update: I decided not to flip the default in that PR, but I did make a number of fixes there and in associated PRs which should make -Dorg.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.USE_WATCHING safer to use. It would make sense to try it on ci.jenkins.io given the large logs and heavy use of parallelism from certain projects, especially bom.

@dduportal
Copy link
Contributor Author

For info: jenkins-infra/helpdesk#3745 (comment)

This PR is still in draft (as the goal is to improve parallelization of agents for bom to decrease build time and billing)

@dduportal
Copy link
Contributor Author

Closing this PR as we're gonna try a new angle:

@dduportal dduportal closed this Nov 29, 2023
@dduportal dduportal deleted the chore/use-jenkins-infra-new-nodepool branch November 29, 2023 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Reduces future maintenance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants