chore(pipeline) switch to podTemplate instead of label to use the new node pool #1969

dduportal · 2023-04-17T15:06:40Z

This PR is work in progress to switch the bom builds to a new pool of agents resources as per jenkins-infra/helpdesk#3521

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests - that demonstrates feature works or fixes the issue

dduportal · 2023-04-17T19:21:11Z

First tentative (build #2) was unsuccessful but showed the following:

Pod allocation:
- It takes ~5 min for the initial scale up phase, after prep.sh.
- There are a lot of pod scheduling requests during this initial phase (which include scheduling + autoscaling + pulling the base image which is ~1.8 Gb). I guess tuning the Jenkins pod startup timeout could avoid hammering the scheduler
- The build seems to peak at 280 running agents

CPU usage:
- High usage during the first 30 min, then it was a complete mess (CPU throttling?)
- We see pods going over the CPU limit => let's remove it

Memory Usage:
- No pod used more than 7.5 Gb (limit of 8 Gb)
- After the initial 30 min period, memory seems to stay at 4 Gb allocated. Don't forget that memory usage includes the /tmp in this case

The flowGraphTable shows that a lof of steps such as the sh 'mvn -version' took minutes (2 to 4 minutes) for a lot of stages. It feels like there is a contention somewhere: CPU throttling? Issue with the ci.jenkins.io Kubernetes plugin? Something else?

dduportal · 2023-04-17T19:24:11Z

Retrying (build #4) with the following changes:

Podtemplate moved to a YAML file for clarity
Remove the CPU resource limit to ensure no throttling could happen
Grouped sh expressions when possible (only the obvious ones)
Removed the deleteDir() expensive step (which is not needed as agents are ephemeral)

jetersen · 2023-04-18T10:27:16Z

Could the base image be loaded into the AMI that is spun up, to save some build time? 🤔

dduportal · 2023-04-18T10:57:43Z

Could the base image be loaded into the AMI that is spun up, to save some build time? 🤔

Technically yes. But it would not be sustainable (rebuild AMI on each Docker image rebuild, eg. 3-7 times a week, then redeploy the node pool which takes 4 to 20 min).
Not sure it would be worth it.

But it sounds like it's not really the core challenge (at least not yet): there are clearly contention on Jenkins side: it takes minutes to execute the sh steps. The massive parallelization is clearly blocked at different levels.

The infra team has a first wave of proposals (ready to apply on short term) to work on the controller setup (VM hardware including dystem disk, removing the jobconfig hisotry plugin, using a proper network, cleanup jcasc for azure setup)
But I wonder if building the megawar 280 times isn't creating more problem than it solves. (don't get me wrong: @jglick wors was really helpful and proved that we either have to pay for compute or bandwitdh). Maybe implementing a "custom stash/unstash" to a S3 bucket specifcially here + move back the megawar build to prep.sh could be a nice thing to try.

jetersen · 2023-04-18T11:24:51Z

Could rebuilt the Ami once a week or month. Docker image would only download the bits that have changed since the last pull. So instead fo 2GB it could be a few MB

jetersen · 2023-04-18T11:30:48Z

But I wonder if building the megawar 280 times isn't creating more problem than it solves. (don't get me wrong: @jglick wors was really helpful and proved that we either have to pay for compute or bandwitdh). Maybe implementing a "custom stash/unstash" to a S3 bucket specifcially here + move back the megawar build to prep.sh could be a nice thing to try.

There is the possibility to use volume binding on the Kubernetes nodes to mount the volume for megawar 🤔. There are some options at least:
https://kubernetes.io/blog/2021/09/13/read-write-once-pod-access-mode-alpha/
https://dev.to/otomato_io/mount-s3-objects-to-kubernetes-pods-12f5

dduportal · 2023-04-18T11:41:48Z

Thanks @jetersen for the ideas and recommendations! Alas, we've already studied these elements:

Could rebuilt the Ami once a week or month. Docker image would only download the bits that have changed since the last pull. So instead fo 2GB it could be a few MB

In a scenario with Docker images shaped with layers, for a pure container runtime usage, that would be true. It's not the case for the Jenkins infra as we generate the Docker image using Packer to ensure the same environment between VM and container agents (rationale: maintaining a multitude of tiny Docker images in CI environments with all the possible combination of tools is an unsustainable task). It means that packer generate single layered image.

=> the image pull is a minor optimization, at best we'll be able to avoid 5 min waiting on a 120+ minute jobs. If we have to solve this problem, a local Docker registry (e.g. same network) would clearly be more efficient.

There is the possibility to use volume binding on the Kubernetes nodes to mount the volume for megawar 🤔. There are some options at least:
https://kubernetes.io/blog/2021/09/13/read-write-once-pod-access-mode-alpha/
https://dev.to/otomato_io/mount-s3-objects-to-kubernetes-pods-12f5

That is a good idea that could help!
Alas, our experience with Kubernetes, wether on AWS EKS, Azure AKS or Digital Ocean shows that adding PVC creates contention at the Kuberenetes scheduler even with S3 or Azurefile drivers.

Worth giving it a try though, if it does not require a cluster heavy change. otherwise aws s3 cp as a pipeline step could clearly be easy to implement (only constraint: credential, which is easily solvable).

jglick · 2023-04-19T11:59:36Z

implementing a "custom stash/unstash" to a S3 bucket

Why would we not simply revert #1955 and enable artifact-manager-s3?

Regarding base image discussions, that all sounds like overthinking it. If an image is published to an appropriate registry local to the cloud provider, its nodes ought to be able to pull that image in a reasonable amount of time, and cache it as well. I think your issue is that you are pulling from Docker Hub (jenkinsciinfra/jenkins-agent-ubuntu-22.04). You should rather pull from ECR if running on AWS, etc.

sh 'mvn -version' took minutes (2 to 4 minutes)

That does not sound right—mvn -version should not require downloading anything, so it should complete within a second or so. What specifically is going on at this time? Does the actual process in the agent container take that long, or does the controller just take a long time to launch it, or a long time to register that it has completed and process the output? Do pipeline and/or native thread dumps from the controller offer any information at this time?

podTemplate.yaml

jglick · 2023-04-19T11:57:31Z

podTemplate.yaml

+        - name: "ARTIFACT_CACHING_PROXY_PROVIDER"
+          value: "aws"


Does this get overridden?

What do you mean?

IIUC

bom/Jenkinsfile

Line 18 in 57116da

infra.withArtifactCachingProxy(env.ARTIFACT_CACHING_PROXY_PROVIDER != 'do') {

should set the same variable?

We're setting this ARTIFACT_CACHING_PROXY_PROVIDER variable value in every agent (except the Azure VM agents as it's not possible to set an env var on them AFAIK) in our infrastructure as code definitions.
This variable is then used in the infra.withArtifactCachingProxy function to determine which acp provider to use.
This variable is not set in the pipeline library.
Putting it in this podTemplate allows the correct selection of the AWS acp provider, without it it would default to the Azure one. (cf https://github.com/jenkins-infra/pipeline-library/blob/9e0f0a2f09a7c91b26824e706a8963598b685742/vars/infra.groovy#L101)

jglick · 2023-04-19T11:58:50Z

podTemplate.yaml

+        privileged: false
+      tty: false
+      volumeMounts:
+        - mountPath: "/home/jenkins/.m2/repository"


Should not be used, given

bom/Jenkinsfile

Line 20 in 57116da

"MAVEN_ARGS=${env.MAVEN_ARGS != null ? MAVEN_ARGS : ''} -B -ntp -Dmaven.repo.local=${WORKSPACE_TMP}/m2repo"

This is a "copy-and-paste" of the pod yaml from the existing config defined by the admins of ci.jenkins.io. I would want to stick to this realit y as close as possible.

jglick · 2023-04-19T12:04:05Z

credential, which is easily solvable

Directly granting access to an S3 bucket to prep.sh or pct.sh is a non-starter here, since anyone filing a PR could just edit those to do something else.

dduportal · 2023-04-19T12:07:38Z

Why would we not simply revert #1955 and enable artifact-manager-s3?

@jglick if we enable artifact-manager-s3, can we state "only for bom ? Or is it controller wide ?
(and corollary, if we also enable Azure Artfact Caching Manager, what is the expected behavior?)

dduportal · 2023-04-19T12:12:03Z

That does not sound right—mvn -version should not require downloading anything, so it should complete within a second or so.

That's also what we understand. The unexpected (slow) behavior is the same for any sh step (the curl to ACP for instances).

What specifically is going on at this time?

When the build #2 ran, only 6 builds (on VMs only) happened on ci.jenkins during the whole build.

Does the actual process in the agent container take that long, or does the controller just take a long time to launch it, or a long time to register that it has completed and process the output?

Based on the build logs, the process on the agent is really fast to execute (mvn --version reports less than 1s most of the time, curl command reports ~2s).
I'm not sure if the whole pipeline step was slow, or if maybe it is only reporting. The flow graph table shows unexpected long timing so I assume it's not only reporting, but a real contention

Do pipeline and/or native thread dumps from the controller offer any information at this time?

Need to check again and export threaddump + checking if we can see this on the main branch of the BOM builds

jglick · 2023-04-19T12:46:31Z

if we enable artifact-manager-s3, can we state "only for bom ? Or is it controller wide ?

It is controller-wide. But are there other jobs using artifacts and/or stashes which you would want to use the built-in storage for?

timja · 2023-04-20T06:22:40Z

implementing a "custom stash/unstash" to a S3 bucket

There's also this plugin https://plugins.jenkins.io/s3/.
or just using shell steps.

dduportal · 2023-04-20T08:11:31Z

I'm trying to summarize the possible options for the "stash/unstash" with their pros and cons. Let me know if it make sense for you:

Use Azure Artifact Manager (Azure blob storage)
- ✅ Close to the ci.jenkins.io controller
- ✅ Outbound bandwidth costs less from blob than VM
- ✅ Using Azure is less prone to sudden migration (due to ended sponsorship) compared to AWS or DigitalOcean
- ~~No GC of stashed elements~~ (see comment below)
- ❌ BOM builds are mostly running in AWS (and eventually in DO) so there is a network latence AWS <-> Azure (but both are us-east-1)
- ❌ The Azure Artifact Manager failed randomly after the first test, without enough debugging information so it's a risk that it could happen again
- ❌ Unable to scope to only the bom
Use S3 Artifact Manager with AWS
- ✅ Perfect network performance level for bom builds (assuming they should stick to AWS)
- ✅ Works out of the box with AWS S3 bucket (of course)
- ✅ Nice safety level (operation are checked by the controller so no risk from malicious PR and no credential in the pipeline)
- ✅ Automatic GC of stashed artefacts (there is an option in its configuration UI)
- 🚧 Costs to evaluate, it could impact the AWS bill which we are trying to decrease but if it helps to drastically decrease cost of bom build, then this cost can be compensated
- ❌ Create a dependency to AWS (which is only a sponsor that can stop at any moment)
- ❌ Network latency for builds in Azure (ATH, jobs requiring Docker CE with a VM) and DO (plugins and a bit of BOM today)
- ❌ Unable to scope to only the bom
Step doing the custom stah/unstash in the Jenkinsfile, using an AWS S3:
- ✅ Scoped to only the bom
- ✅ Perfect network performance level for bom builds (assuming they should stick to AWS)
- 🚧 Costs to evaluate, it could impact the AWS bill which we are trying to decrease but if it helps to drastically decrease cost of bom build, then this cost can be compensated
- ❌ No automatic GC of stashed artifacts
- ❌ Safety level (needs a pipeline-level credential) but could rely on the "trusted" Jenkinsfile (only maintainer can edit it)
Step doing the custom stah/unstash in the jenkins-infra/pipeline-library:
- ✅ Could be scoped to each job (allows selecting the proper cloud in most of the cases if we stick a build to a single cloud)
- ✅ Perfect network performance level for bom builds (assuming they should stick to AWS)
- ❌ No automatic GC of stashed artifacts
- ❌ Safety level is better but still require a credential accessed by the pipeline (top level)

Are there other solutions or things I missed?

timja · 2023-04-20T08:28:10Z

If it's done in the pipeline library you could also abstract away the cloud provider, depending on which provider the agent is running on you could use that, although it would assume that other agents in the build use the same provider...

The artifact managers do clean up stashes.
see https://github.com/jenkinsci/azure-artifact-manager-plugin/blob/3506da42a953e7631b4bbb272f43de097aa84c27/src/main/java/com/microsoft/jenkins/artifactmanager/AzureArtifactManager.java#L475

jglick · 2023-04-20T12:57:35Z

artifact-manager-s3 by default does not delete anything, by design: so you can give the controller lesser permissions. Automatic deletion of artifacts past a certain age can easily be configured in S3 itself, which would be more reliable anyway. (And stashes and artifacts are kept in separate top-level folders, so you can apply distinct policies.) FWIW

jglick · 2023-04-20T13:00:23Z

Safety level (needs a pipeline-level credential) but could rely on the "trusted" Jenkinsfile
Safety level is better but still require a credential accessed by the pipeline (top level)

Since in either of these cases you are letting some sh step run on an agent with blob store credentials, you could not safely pass in bucket-scope credentials because the command could be intercepted or overridden with various tricks. You would have to do a bunch of IAM magic to pass in session credentials scoped to a per-PR folder, or something. This is the sort of concern that is taken care of for you by an artifact manager plugin.

dduportal · 2023-04-21T14:36:44Z

artifact-manager-s3 by default does not delete anything, by design: so you can give the controller lesser permissions. Automatic deletion of artifacts past a certain age can easily be configured in S3 itself, which would be more reliable anyway. (And stashes and artifacts are kept in separate top-level folders, so you can apply distinct policies.) FWIW

I was only thinking about stash deletion driven by Jenkins (as per https://plugins.jenkins.io/artifact-manager-s3/#plugin-content-delete-stash).

But it's true that a Garbage collection could be done on S3 side to avoid pressuring Jenkins controller 👍

This is the sort of concern that is taken care of for you by an artifact manager plugin.

Absolutely, that is the main reason why I'm reluctant to use the "pipeline step" / "pipeline library" solution.

jglick · 2023-04-21T14:51:31Z

stash deletion driven by Jenkins

As noted there, this is deliberately not done by default, and must be enabled via system property.

… node pool Signed-off-by: Damien Duportal <[email protected]>

Signed-off-by: Damien Duportal <[email protected]>

dduportal · 2023-04-25T05:16:56Z

Build in 2hours and 09 minutes => still the same error with "long steps".

The ci.jenkins.io machine seems to be "waiting" as well as agents, not sure where to look.

Got a thread dump when the phenomen was seen

Archive.zip

jglick · 2023-04-25T15:29:14Z

most of the shell steps are reported in Jenkins a way longer than the process reported time in build output

A few seconds of overhead is normal enough, but not this. It is difficult to pinpoint the issue from general thread dumps; need a thread dump of the controller and that agent taken at the moment that a sh step is taking too long to run.

From Thread dump [Jenkins].html I see nothing obviously amiss. There is

"org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep [#25]: checking /home/jenkins/agent/workspace/Tools_bom_PR-1969 on tools-bom-pr-1969-13-vdz4c-1g361-nmk5m / waiting for JNLP4-connect connection from ec2-3-18-53-88.us-east-2.compute.amazonaws.com/3.18.53.88:30580 id=244896" Id=620 Group=main TIMED_WAITING on hudson.remoting.UserRequest@5f1ba24e
	at [email protected]/java.lang.Object.wait(Native Method)
	-  waiting on hudson.remoting.UserRequest@5f1ba24e
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:999)
	at hudson.FilePath.act(FilePath.java:1192)
	at hudson.FilePath.act(FilePath.java:1181)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.writeLog(FileMonitoringTask.java:340)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:600)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:557)

so looking for output. There is also an agent thread dump from related code, busy doing remote class loading

"pool-1-thread-14 for JNLP4-connect connection to ci.jenkins.io/104.208.238.39:50000 id=244896 / waiting for JNLP4-connect connection to ci.jenkins.io/104.208.238.39:50000 id=281" Id=41 Group=main TIMED_WAITING on hudson.remoting.RemoteInvocationHandler$RPCRequest@bc1a0b
	at [email protected]/java.lang.Object.wait(Native Method)
	-  waiting on hudson.remoting.RemoteInvocationHandler$RPCRequest@bc1a0b
	at app//hudson.remoting.Request.call(Request.java:177)
	at app//hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:288)
	at app//com.sun.proxy.$Proxy7.fetch3(Unknown Source)
	at app//hudson.remoting.RemoteClassLoader.prefetchClassReference(RemoteClassLoader.java:354)
	at app//hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:259)
	at app//hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:229)
	at [email protected]/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
	-  locked hudson.remoting.RemoteClassLoader@514a1d94
	at [email protected]/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
	at com.google.common.io.Closer.<clinit>(Closer.java:99)
	at com.google.common.io.CharSource.readFirstLine(CharSource.java:314)
	at com.google.common.io.Files.readFirstLine(Files.java:535)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController$StatusCheck.invoke(FileMonitoringTask.java:390)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController$StatusCheck.invoke(FileMonitoringTask.java:384)
	at hudson.FilePath$FileCallableWrapper.call(FilePath.java:3578)
	at app//hudson.remoting.UserRequest.perform(UserRequest.java:211)
	at app//hudson.remoting.UserRequest.perform(UserRequest.java:54)
	at app//hudson.remoting.Request$2.run(Request.java:377)

You can try running Jenkins with -Dorg.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.USE_WATCHING which is currently off by default (except when pipeline-cloudwatch-logs and similar JEP-210 plugins are enabled: JENKINS-52165) but which may scale better. OTOH I see plenty of DurableTaskStep threads idle on the controller, so the problem is not one of thread exhaustion in this pool; the agent connection just may be slow for some reason.

There are a lot of PingThreads. Sometime I need to try to fix https://issues.jenkins.io/browse/JENKINS-20217 (no reason to have threads in a pool sleeping).

The immediate problem might be

"Running CpsFlowExecution[Owner[Tools/bom/PR-1969/13:Tools/bom/PR-1969 #13]] / waiting for JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:65271 id=244875" Id=9408 Group=main TIMED_WAITING on hudson.remoting.UserRequest@32d167fe
	at [email protected]/java.lang.Object.wait(Native Method)
	-  waiting on hudson.remoting.UserRequest@32d167fe
	at hudson.remoting.Request.call(Request.java:177)
	at hudson.remoting.Channel.call(Channel.java:999)
	at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1124)
	at hudson.Launcher$ProcStarter.start(Launcher.java:509)
	at org.jenkinsci.plugins.durabletask.BourneShellScript.launchWithCookie(BourneShellScript.java:176)
	at org.jenkinsci.plugins.durabletask.FileMonitoringTask.launch(FileMonitoringTask.java:132)
	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:326)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:322)
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:196)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:124)

meaning the sh step is waiting just to launch the wrapper script which would then keep running in the background. Unfortunately the thread dump does not indicate the agent name and there seem to be a lot of agents running on this VM.

In 07-Thread dump [Jenkins].html I notice a bunch of JnlpSlaveRestarterInstaller$Install.install which jenkinsci/jenkins#7693 will remove in the next LTS.

"Running CpsFlowExecution[Owner[Tools/bom/PR-1969/9:Tools/bom/PR-1969 #9]]saving /var/jenkins_home/jobs/Tools/jobs/bom/branches/PR-1969/builds/9/program.dat" Id=13785 Group=main RUNNABLE

is busy writing out metadata. The more stuff is happening, especially in parallel, the more program there is to save, so these files can probably get big. A support bundle will include a pipeline timings file that indicates how much controller CPU time is spent on various sorts of build overhead including this.

dduportal · 2023-04-25T15:53:20Z

Thanks for the details and explanation!

Still trying to digest it but

A support bundle will include a pipeline timings file that indicates how much controller CPU time is spent on various sorts of build overhead including this.

kept my attention, as we have support bundle so I can share one

jglick · 2023-04-25T15:57:03Z

https://github.com/jenkinsci/workflow-cps-plugin/blob/e2151c4967c129d2ca975d45c40bb520b8210996/plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java#L2035-L2079 FTR

dduportal · 2023-04-25T16:44:25Z

Just a note: work on the bom is delayed due to jenkins-infra/helpdesk#3542

dduportal · 2023-04-26T17:09:06Z

Now that jenkins-infra/helpdesk#3542 is fixed, tried a build to verify the new agent cluster.

Works (from infrastructure point of view) but the build failed due to degraded agent pool with error messages related to spot (gotta check later)/

jglick · 2023-06-09T10:58:28Z

@dduportal are you still expecting to do something here? Not harming anything leaving it open if so, just wondering if this was forgotten.

dduportal · 2023-07-12T17:22:29Z

@dduportal are you still expecting to do something here? Not harming anything leaving it open if so, just wondering if this was forgotten.

I did not forget but was waiting for migrating ci.jenkins.io's VM to a new hardware + networks and also for jenkins-infra/helpdesk#3573 to have a full observability (including traces).

Last test with these new elements still shows the dreaded sh steps taking (2 to 11) minutes to run a curl command.

jglick · 2023-07-12T19:30:38Z

Interesting that you should mention this now, as I have been following some performance testing inside @cloudbees suggesting that inexplicably slow sh steps can be a side effect of excessive program.dat writes in heavily loaded controllers, as hinted at in #1969 (comment). Enabling push mode (-Dorg.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.USE_WATCHING=true pending jenkinsci/workflow-durable-task-step-plugin#323) helps considerably; so does changing the durability mode of builds, though this could also increase the chance of builds being left in an anomalous state after controller crashes (should not matter for graceful restarts).

jglick · 2023-07-12T19:36:18Z

@dduportal does https://ci.jenkins.io/job/Tools/configure have an option for Pipeline branch speed/durability override? If so, please set that to “performance optimized”. Like this:

(It is possible to set this from Jenkinsfile using the properties step but that does not apply to build 1 of a branch, since the setting is considered when a build starts before the properties step is even run, making it not very helpful for PRs.)

MarkEWaite · 2023-07-12T20:12:47Z

@dduportal does https://ci.jenkins.io/job/Tools/configure have an option for Pipeline branch speed/durability override? If so, please set that to “performance optimized”

I've found that option and enabled it. Thanks for the recommendation

MarkEWaite · 2023-07-12T20:17:16Z

I've "borrowed" a test job that @jtnord had defined and am running it with the performance optimized setting on the job configuration. See https://ci.jenkins.io/job/Tools/job/bom/view/change-requests/job/PR-2257/3/ in a few hours to see the results.

I'm borrowing:

testing script-security-plugin#516 #2257

jglick · 2023-07-12T20:33:17Z

Hmm, seems to be not working at all: https://ci.jenkins.io/job/Tools/job/bom/job/PR-2257/3/threadDump/ has just created parallel branches but is not running anything inside of them.

dduportal · 2023-07-13T11:56:34Z

Interesting that you should mention this now, as I have been following some performance testing inside @cloudbees suggesting that inexplicably slow sh steps can be a side effect of excessive program.dat writes in heavily loaded controllers, as hinted at in #1969 (comment). Enabling push mode (-Dorg.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.USE_WATCHING=true pending jenkinsci/workflow-durable-task-step-plugin#323) helps considerably; so does changing the durability mode of builds, though this could also increase the chance of builds being left in an anomalous state after controller crashes (should not matter for graceful restarts).

Interesting: is is worth testing the incremental built plugins from this PR to check the difference on ci.jenkins.io? Would that help?

MarkEWaite · 2023-07-13T13:11:10Z

Hmm, seems to be not working at all: https://ci.jenkins.io/job/Tools/job/bom/job/PR-2257/3/threadDump/ has just created parallel branches but is not running anything inside of them.

The #3 build completed in less than 2 hours when its preceding build took 3.5 hours. I think that means the performance optimized setting that is currently configured on the Tools folder of ci.jenkins.io is worth retaining. Even if it required a new run across a Jenkins controller restart, I think the performance improvement is worth it.

jglick · 2023-07-13T17:16:09Z

is is worth testing the incremental built plugins from [PR-323] to check the difference on ci.jenkins.io?

Or simply setting the system property: besides flipping the default, PR-323 so far is only fixing a handful of robustness bugs affecting tests and corner cases. I think it would be worth a try. (It would have a bigger effect without PERFORMANCE_OPTIMIZED mode, but I think still some effect in conjunction.)

jglick · 2023-09-08T17:07:15Z

besides flipping the default, PR-323 so far is only fixing a handful of robustness bugs

Update: I decided not to flip the default in that PR, but I did make a number of fixes there and in associated PRs which should make -Dorg.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep.USE_WATCHING safer to use. It would make sense to try it on ci.jenkins.io given the large logs and heavy use of parallelism from certain projects, especially bom.

dduportal · 2023-09-12T06:39:48Z

For info: jenkins-infra/helpdesk#3745 (comment)

This PR is still in draft (as the goal is to improve parallelization of agents for bom to decrease build time and billing)

dduportal · 2023-11-29T14:52:12Z

Closing this PR as we're gonna try a new angle:

Cleanup of the experimental Kubernetes cloud (with BIG nodes) in the infra: cleanup(ci.jenkins.io) remove kubernetes cloud 'cik8s-experiments' as no longer needed jenkins-infra/jenkins-infra#3197
Issue opened to rule out the Kubernetes plugin: [ci.jenkins.io / BOM] Diagnose slowness when >200 parallel pipeline steps are running jenkins-infra/helpdesk#3839

dduportal changed the title ~~chore(pipeline) switch to podTemplate instead of label to use the new…~~ chore(pipeline) switch to podTemplate instead of label to use the new node pool Apr 17, 2023

dduportal mentioned this pull request Apr 17, 2023

[ci.jenkins.io] separate container agent resources between bom and other builds jenkins-infra/helpdesk#3521

Closed

jglick added the chore Reduces future maintenance label Apr 19, 2023

jglick reviewed Apr 19, 2023

View reviewed changes

dduportal mentioned this pull request Apr 21, 2023

[ci.jenkins.io] Use a new VM instance type jenkins-infra/helpdesk#3535

Closed

dduportal mentioned this pull request Apr 21, 2023

[ci.jenkins.io] Use Artifact Manager to store archived artifacts (and stashes) jenkins-infra/helpdesk#3496

Closed

jglick mentioned this pull request Apr 21, 2023

Resume stashing megawars #1992

Merged

dduportal added 3 commits April 22, 2023 14:06

chore(pipeline) switch to podTemplate instead of label to use the new…

ec5b8aa

… node pool Signed-off-by: Damien Duportal <[email protected]>

wip: move pod template definition in a YAML file

c4c29c0

Signed-off-by: Damien Duportal <[email protected]>

wip: group sh steps together and avoid expansive deleteDir()

b9f7528

Signed-off-by: Damien Duportal <[email protected]>

dduportal force-pushed the chore/use-jenkins-infra-new-nodepool branch from 57116da to 23d6104 Compare April 22, 2023 12:11

Update podTemplate.yaml

5ad98f2

dduportal mentioned this pull request Apr 25, 2023

feat(cik8s) add a new 3x node pool dedicated to the bom builds jenkins-infra/aws#396

Merged

Merge branch 'master' into chore/use-jenkins-infra-new-nodepool

baba2ab

dduportal mentioned this pull request Apr 29, 2023

chore(pipeline) switch to dedicated agent resources #2032

Merged

5 tasks

dduportal mentioned this pull request Jul 24, 2023

ATH builds commonly become unresponsive jenkins-infra/helpdesk#3673

Closed

dduportal mentioned this pull request Sep 8, 2023

Unexpected long delay uploading BOM artifact s3 bucket jenkins-infra/helpdesk#3745

Closed

dduportal mentioned this pull request Sep 11, 2023

feat(ci.jenkins.io, trusted.ci, cert.ci) enable watch mode for durable tasks jenkins-infra/jenkins-infra#3062

Merged

This was referenced Nov 29, 2023

[ci.jenkins.io / BOM] Diagnose slowness when >200 parallel pipeline steps are running jenkins-infra/helpdesk#3839

Closed

cleanup(ci.jenkins.io) remove kubernetes cloud 'cik8s-experiments' as no longer needed jenkins-infra/jenkins-infra#3197

Merged

dduportal closed this Nov 29, 2023

dduportal deleted the chore/use-jenkins-infra-new-nodepool branch November 29, 2023 14:52

chore(pipeline) switch to podTemplate instead of label to use the new node pool #1969

chore(pipeline) switch to podTemplate instead of label to use the new node pool #1969

Conversation

dduportal commented Apr 17, 2023 • edited Loading

dduportal commented Apr 17, 2023

dduportal commented Apr 17, 2023 • edited Loading

jetersen commented Apr 18, 2023

dduportal commented Apr 18, 2023

jetersen commented Apr 18, 2023

jetersen commented Apr 18, 2023

dduportal commented Apr 18, 2023

jglick commented Apr 19, 2023 • edited Loading

jglick Apr 19, 2023

Choose a reason for hiding this comment

dduportal Apr 19, 2023

Choose a reason for hiding this comment

jglick Apr 19, 2023

Choose a reason for hiding this comment

lemeurherve Apr 20, 2023 • edited Loading

Choose a reason for hiding this comment

jglick Apr 19, 2023

Choose a reason for hiding this comment

dduportal Apr 19, 2023

Choose a reason for hiding this comment

jglick commented Apr 19, 2023

dduportal commented Apr 19, 2023

dduportal commented Apr 19, 2023

jglick commented Apr 19, 2023

timja commented Apr 20, 2023

dduportal commented Apr 20, 2023 • edited Loading

timja commented Apr 20, 2023

jglick commented Apr 20, 2023

jglick commented Apr 20, 2023

dduportal commented Apr 21, 2023

jglick commented Apr 21, 2023

dduportal commented Apr 25, 2023

jglick commented Apr 25, 2023

dduportal commented Apr 25, 2023

jglick commented Apr 25, 2023

dduportal commented Apr 25, 2023

dduportal commented Apr 26, 2023

jglick commented Jun 9, 2023

dduportal commented Jul 12, 2023

jglick commented Jul 12, 2023 • edited Loading

jglick commented Jul 12, 2023 • edited Loading

MarkEWaite commented Jul 12, 2023

MarkEWaite commented Jul 12, 2023 • edited Loading

jglick commented Jul 12, 2023

dduportal commented Jul 13, 2023

MarkEWaite commented Jul 13, 2023

jglick commented Jul 13, 2023

jglick commented Sep 8, 2023

dduportal commented Sep 12, 2023

dduportal commented Nov 29, 2023

dduportal commented Apr 17, 2023 •

edited

Loading

dduportal commented Apr 17, 2023 •

edited

Loading

jglick commented Apr 19, 2023 •

edited

Loading

lemeurherve Apr 20, 2023 •

edited

Loading

dduportal commented Apr 20, 2023 •

edited

Loading

jglick commented Jul 12, 2023 •

edited

Loading

jglick commented Jul 12, 2023 •

edited

Loading

MarkEWaite commented Jul 12, 2023 •

edited

Loading