[ci.jenkins.io] separate container agent resources between `bom` and other builds #3521

dduportal · 2023-04-17T11:07:52Z

Why

The bom builds are challenging for the infrastructure:

They are slow (ref. Bump gitlab-api from 5.1.0-84.v491924123a_f7 to 5.2.0-86.v1ed41a_9cf486 in /bom-weekly jenkinsci/bom#1962 (comment)) and slowing down contributors
The have an impact on the cloud costs:
- Compute time: Spring 2023: Decrease AWS costs #3502 and Digital Ocean: Credits are almost depleted #3487
- Bandwidth: [ci.jenkins.io] Azure billing shows huge cloud cost due to outbound bandwidth #3485
They are quite often unreliable and need to be retried: Bump gitlab-api from 5.1.0-84.v491924123a_f7 to 5.2.0-86.v1ed41a_9cf486 in /bom-weekly jenkinsci/bom#1962 (comment)

What

This issue tracks the work related to using a dedicated node pool(s) designed to run only the bom builds to:

Avoid contributors of other projects than the bom to wait 1-2 hours for an agent allocation because there is a peak of bom builds
Design the resource allocation specifically for the bom build usage- to decrease the cost impact

First round is to focus only on AWS: no more bom builds on DigitalOcean:

DigitalOcean credits should be preserved (Digital Ocean: Credits are almost depleted #3487)
If AWS costs does not decrease, then we can emove the "normal" node pool and split bom to AWS and others on DigitalOcean
Simpler baseline for metrics and cost analysis: only 1 kind of builds

The sizing for the new node pool follows this principles:

A bom build is estimated to run:
- Between 200 and 300 parallel steps: we'll count ~300 elements in the build queue as baseline
- Each step takes ~ 10-11 minutes (used to be ~4-5 before Avoid large stashes jenkinsci/bom#1955 but outbound bandwidth costs way more)
- Each step is retried up to 3 times when the underlying agent is killed (due to resources: OOM, CPU limit, eviction, spot instance reclaim), which had way more elements in the build queue
We do not pay for "agent minutes", but rather for "VM minutes": the more time each builds takes waiting with the node pool fully scaled, the more we pay
- Massively parallelizing to take the build queue item is a priority to decrease both costs and pressure on ci.jenkins.io

Costs report: https://docs.google.com/spreadsheets/d/1_C0I0jE-X0e0vDcdKOFIWcnwpOqWC8RQ4YOCgXNnplY/edit#gid=292621391

How

First tentative:

A new node pool with taints (to ensure the current workload cannot be run accidentally) to split physically
- Keeping spot instances for now, to avoid cost explosion. But ensure that the instance spot costs allow up to 4 retries before reaching the on-demand (non-spot) cost (below the 3 retries of the pipeline)
- Decrease the "waste" of resources in the instances: more pods per node, right sizing (room for improvement for second tentative)
- Increase the maximum number of parallel pods to ensure a given bom builds can be treated in "one shot"
A new namespace jenkins-agents-bom in cik8s to split logically
Use the podTemplate() pipeline method in jenkinsci/bom to improve visibility

The text was updated successfully, but these errors were encountered:

dduportal · 2023-04-17T11:10:11Z

Node pool initial implementation: feat(eks) split and optimize node pools for ci.jenkins.io usage (bom vs. plugins) aws#388

dduportal · 2023-04-17T11:18:02Z

Adding the new namespace: feat(cik8s) add a new 'jenkins-agents-bom' namespace ready for handling 'bom' builds kubernetes-management#3846

dduportal · 2023-04-17T14:30:08Z

feat(jenkins-kubernetes-agents) add an SVC account token (no longer generated since Kubernetes 1.23) helm-charts#488 along the way (update Service Accounts behavior after Kubernetes 1.23 to have an explicit API token)
- Deployed in Bump jenkins-kubernetes-agents helm chart version to 0.5.0 kubernetes-management#3847
feat(ci.jenkins.io) add a new Kubernetes Cloud for bom builds jenkins-infra#2770 to add a new Kubernetes Cloud on ci.jenkins.io, tested with success to spawn agents on the new node pool

jglick · 2023-04-17T14:30:41Z

BTW one change I am thinking about is to skip building master by default (only when a release is planned), so that the builds would at least be limited to PRs.

dduportal · 2023-04-17T15:08:27Z

Ok, let's try the new node pool: jenkinsci/bom#1969

dduportal · 2023-04-17T15:26:20Z

Configuration fixup to ensure datadog agent runs on the new nodepool (toleration): fix(datadog in cik8s) ensure datadog agent runs on all node pools kubernetes-management#3850

dduportal · 2023-04-25T09:31:54Z

Following the "experiments" in jenkinsci/bom#1969, it seems that there is no "easy & obvious" solution, scoped in the node pool sizing, for the AWS cost decrease. So let's deliver the "split bom and plugin resources" using the same sizing as today and we'll continue to diagnose.

Reminder of the expected benefits:

Non bom builds (plugins, packaging, etcc.) using Linux container agents on ci.jenkins.io won't have to wait 1-2 hours when there are bom builds waiting.
Observability of the bom builds will be easier (separate namespace, separate node pool, etc.).

Proposed implementation:

Ensure there are 2 different namespaces in cik8s:
- jenkins-agents-bom for the "production" bom builds
- jenkins-agents-experiments for the "tests" with bigger node pool (and future experiments)
Keep using the cik8s-bom Kubernetes cloud defined in ci.jenkins.io to provide the "admin defined" pod templates
- Expecting 2 pod templates, one per namespace with each one its own "max pod limit"
- Defaults to JDK11 agent and JDK17 for dev tools. Specific Java version for dev tools will be defined in the bom pipeline
Open a new PR in the bom project to "only" split resources and rename the current chore(pipeline) switch to podTemplate instead of label to use the new node pool jenkinsci/bom#1969

dduportal · 2023-04-25T09:32:49Z

New node pool: feat(cik8s) add a new 3x node pool dedicated to the bom builds aws#396
New namespace: feat(cik8s) add a new agent namespace to support experiments kubernetes-management#3888
Update ci.jenkins.io agent configuration (feat(ci.jenkins.io) helpdesk-3521: add cik8s experiments cloud and set up properly cik8s-bom jenkins-infra#2805)
- insert new SA token for experiment namespace
- Add experiment kube cloud
- Update limits
Update updatecli manifest for namespace quotas in kubernetes management - chore(updatecli) track cik8s-experiments pods quota kubernetes-management#3889
Validate automated updatecli PR updating pods quota in kubernetes management - Update quotas.pods within cik8s cluster to 150, cik8s cluster (bom node pool) to 150 and/or doks cluster to 48 kubernetes-management#3890
Update jenkins-infra/jenkins-infra to define agent template scheduling data (nodeselector and tolerations) - feat(ci.jenkins.io) setup cik8s-bom with a single agent template using JDK17 by default, with nodeSelector and tolerations jenkins-infra#2808

dduportal · 2023-04-29T13:02:41Z

This issue surfaced an issue in the cik8s autoscaler setup: jenkins-infra/aws#405

dduportal · 2023-04-29T13:23:51Z

Let's try a real PR with jenkinsci/bom#2032

dduportal · 2023-05-02T08:34:35Z

Looks good: plugins are able to be built as seen durin the (long) week end.

dduportal added this to the infra-team-sync-2023-04-18 milestone Apr 17, 2023

dduportal added ci.jenkins.io aws labels Apr 17, 2023

dduportal self-assigned this Apr 17, 2023

This was referenced Apr 17, 2023

Spring 2023: Decrease AWS costs #3502

Closed

feat(cik8s) add a new 'jenkins-agents-bom' namespace ready for handling 'bom' builds jenkins-infra/kubernetes-management#3846

Merged

This was referenced Apr 17, 2023

feat(jenkins-kubernetes-agents) add an SVC account token (no longer generated since Kubernetes 1.23) jenkins-infra/helm-charts#488

Merged

feat(ci.jenkins.io) add a new Kubernetes Cloud for bom builds jenkins-infra/jenkins-infra#2770

Merged

dduportal mentioned this issue Apr 17, 2023

chore(pipeline) switch to podTemplate instead of label to use the new node pool jenkinsci/bom#1969

Closed

6 tasks

dduportal mentioned this issue Apr 17, 2023

fix(datadog in cik8s) ensure datadog agent runs on all node pools jenkins-infra/kubernetes-management#3850

Merged

lemeurherve mentioned this issue Apr 17, 2023

chore(updatecli): add cik8s-bom to kubernetes-pods-quotas.yaml jenkins-infra/kubernetes-management#3851

Merged

dduportal modified the milestones: infra-team-sync-2023-04-18, infra-team-sync-2023-04-25 Apr 19, 2023

jglick mentioned this issue Apr 21, 2023

Suppressing builds on master branch jenkinsci/bom#1993

Merged

dduportal mentioned this issue Apr 25, 2023

feat(cik8s) add a new 3x node pool dedicated to the bom builds jenkins-infra/aws#396

Merged

dduportal mentioned this issue Apr 25, 2023

fix(cik8s) manage aws_auth configmap to add the new AWS SSO infra-admin group jenkins-infra/aws#397

Merged

dduportal modified the milestones: infra-team-sync-2023-04-25, infra-team-sync-2023-05-02 Apr 25, 2023

dduportal mentioned this issue Apr 29, 2023

chore(pipeline) switch to dedicated agent resources jenkinsci/bom#2032

Merged

5 tasks

dduportal closed this as completed May 2, 2023

dduportal mentioned this issue Nov 29, 2023

[ci.jenkins.io / BOM] Diagnose slowness when >200 parallel pipeline steps are running #3839

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci.jenkins.io] separate container agent resources between `bom` and other builds #3521

[ci.jenkins.io] separate container agent resources between `bom` and other builds #3521

dduportal commented Apr 17, 2023 •

edited

Loading

dduportal commented Apr 17, 2023

dduportal commented Apr 17, 2023

dduportal commented Apr 17, 2023 •

edited

Loading

jglick commented Apr 17, 2023

dduportal commented Apr 17, 2023

dduportal commented Apr 17, 2023

dduportal commented Apr 25, 2023

dduportal commented Apr 25, 2023 •

edited

Loading

dduportal commented Apr 29, 2023

dduportal commented Apr 29, 2023

dduportal commented May 2, 2023

[ci.jenkins.io] separate container agent resources between bom and other builds #3521

[ci.jenkins.io] separate container agent resources between bom and other builds #3521

Comments

dduportal commented Apr 17, 2023 • edited Loading

dduportal commented Apr 17, 2023

dduportal commented Apr 17, 2023

dduportal commented Apr 17, 2023 • edited Loading

jglick commented Apr 17, 2023

dduportal commented Apr 17, 2023

dduportal commented Apr 17, 2023

dduportal commented Apr 25, 2023

dduportal commented Apr 25, 2023 • edited Loading

dduportal commented Apr 29, 2023

dduportal commented Apr 29, 2023

dduportal commented May 2, 2023

[ci.jenkins.io] separate container agent resources between `bom` and other builds #3521

[ci.jenkins.io] separate container agent resources between `bom` and other builds #3521

dduportal commented Apr 17, 2023 •

edited

Loading

dduportal commented Apr 17, 2023 •

edited

Loading

dduportal commented Apr 25, 2023 •

edited

Loading