[ci.jenkins.io / BOM] Diagnose slowness when >200 parallel pipeline steps are running #3839

dduportal · 2023-11-29T14:47:28Z

Service(s)

ci.jenkins.io

Summary

Follow up of jenkinsci/bom#1969 and #3521.

The BOM builds are taking way more time (and money) than they should given the parallelization and infrastructure capacity.
A lot of steps are way slower than expected when there are hundreds of parallel jobs running on ci.jenkins.io.

This issue is about identifying the most probable causes by ruling out the Kubernetes plugin and/or the AWS EKS cloud used for agents.

The following tasks are expected:

PR to rule out (or point at) the Kubernetes plugin by trying a BOM build using Azure VMs. Also a nice stress test for [Sponsorships] Setup the secondary Azure subscription to consume the sponsor credits #3818
- Extend quota on the new sponsorship subscription to support 200 parallel VM agents
- Check if existing VM template are good for the BOM build (same size in CPU/memory as the containers currently used) or create a new on
- Open a PR with labels and check behavior of the build
Analyze result and report

Reproduction steps

No response

dduportal · 2023-11-29T15:12:42Z

Update: quota increase requested in the new subscription to allow for 1000 vCPUs to sustain this experiment

jglick · 2023-11-29T15:19:58Z

rule out (or point at) the Kubernetes plugin by trying a BOM build using Azure VMs

Fine enough if that is straightforward to do, but the most direct diagnosis would be to capture a controller thread dump at the moment that a particular (sh?) step is running without the agent actually running the corresponding external process, or that a step has supposedly completed yet the next step in the pipeline has not yet started. Being able to reproduce the problem and look at a thread dump could potentially mean immediately zeroing in one some 🤦 mistake in some plugin that is easily solved once we see it.

dduportal · 2024-09-10T17:23:53Z

The past months showed that moving the agent kubernetes cluster close to the controller (from AWS to Azure) did solve the problem. As such I'm closing the issue as not reproducible.

dduportal added the triage Incoming issues that need review label Nov 29, 2023

dduportal added this to the infra-team-sync-2023-12-12 milestone Nov 29, 2023

jenkins-infra-helpdesk-app bot added the ci.jenkins.io label Nov 29, 2023

This was referenced Nov 29, 2023

cleanup(ci.jenkins.io) remove kubernetes cloud 'cik8s-experiments' as no longer needed jenkins-infra/jenkins-infra#3197

Merged

chore(pipeline) switch to podTemplate instead of label to use the new node pool jenkinsci/bom#1969

Closed

dduportal removed the triage Incoming issues that need review label Dec 12, 2023

dduportal removed this from the infra-team-sync-2023-12-12 milestone Dec 12, 2023

dduportal self-assigned this Dec 12, 2023

dduportal closed this as completed Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci.jenkins.io / BOM] Diagnose slowness when >200 parallel pipeline steps are running #3839

[ci.jenkins.io / BOM] Diagnose slowness when >200 parallel pipeline steps are running #3839

dduportal commented Nov 29, 2023

dduportal commented Nov 29, 2023

jglick commented Nov 29, 2023

dduportal commented Sep 10, 2024

[ci.jenkins.io / BOM] Diagnose slowness when >200 parallel pipeline steps are running #3839

[ci.jenkins.io / BOM] Diagnose slowness when >200 parallel pipeline steps are running #3839

Comments

dduportal commented Nov 29, 2023

Service(s)

Summary

Reproduction steps

dduportal commented Nov 29, 2023

jglick commented Nov 29, 2023

dduportal commented Sep 10, 2024