Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci.jenkins.io / BOM] Diagnose slowness when >200 parallel pipeline steps are running #3839

Closed
5 tasks
dduportal opened this issue Nov 29, 2023 · 3 comments
Closed
5 tasks
Assignees

Comments

@dduportal
Copy link
Contributor

Service(s)

ci.jenkins.io

Summary

Follow up of jenkinsci/bom#1969 and #3521.

The BOM builds are taking way more time (and money) than they should given the parallelization and infrastructure capacity.
A lot of steps are way slower than expected when there are hundreds of parallel jobs running on ci.jenkins.io.

This issue is about identifying the most probable causes by ruling out the Kubernetes plugin and/or the AWS EKS cloud used for agents.

The following tasks are expected:

  • PR to rule out (or point at) the Kubernetes plugin by trying a BOM build using Azure VMs. Also a nice stress test for [Sponsorships] Setup the secondary Azure subscription to consume the sponsor credits #3818
    • Extend quota on the new sponsorship subscription to support 200 parallel VM agents
    • Check if existing VM template are good for the BOM build (same size in CPU/memory as the containers currently used) or create a new on
    • Open a PR with labels and check behavior of the build
  • Analyze result and report

Reproduction steps

No response

@dduportal
Copy link
Contributor Author

Update: quota increase requested in the new subscription to allow for 1000 vCPUs to sustain this experiment

@jglick
Copy link

jglick commented Nov 29, 2023

rule out (or point at) the Kubernetes plugin by trying a BOM build using Azure VMs

Fine enough if that is straightforward to do, but the most direct diagnosis would be to capture a controller thread dump at the moment that a particular (sh?) step is running without the agent actually running the corresponding external process, or that a step has supposedly completed yet the next step in the pipeline has not yet started. Being able to reproduce the problem and look at a thread dump could potentially mean immediately zeroing in one some 🤦 mistake in some plugin that is easily solved once we see it.

@dduportal dduportal removed the triage Incoming issues that need review label Dec 12, 2023
@dduportal dduportal removed this from the infra-team-sync-2023-12-12 milestone Dec 12, 2023
@dduportal dduportal self-assigned this Dec 12, 2023
@dduportal
Copy link
Contributor Author

The past months showed that moving the agent kubernetes cluster close to the controller (from AWS to Azure) did solve the problem. As such I'm closing the issue as not reproducible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants