Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci.jenkins.io] separate container agent resources between bom and other builds #3521

Closed
dduportal opened this issue Apr 17, 2023 · 11 comments
Closed

Comments

@dduportal
Copy link
Contributor

dduportal commented Apr 17, 2023

Why

The bom builds are challenging for the infrastructure:

What

This issue tracks the work related to using a dedicated node pool(s) designed to run only the bom builds to:

  • Avoid contributors of other projects than the bom to wait 1-2 hours for an agent allocation because there is a peak of bom builds
  • Design the resource allocation specifically for the bom build usage- to decrease the cost impact

First round is to focus only on AWS: no more bom builds on DigitalOcean:

  • DigitalOcean credits should be preserved (Digital Ocean: Credits are almost depleted #3487)
  • If AWS costs does not decrease, then we can emove the "normal" node pool and split bom to AWS and others on DigitalOcean
  • Simpler baseline for metrics and cost analysis: only 1 kind of builds

The sizing for the new node pool follows this principles:

  • A bom build is estimated to run:

    • Between 200 and 300 parallel steps: we'll count ~300 elements in the build queue as baseline
    • Each step takes ~ 10-11 minutes (used to be ~4-5 before Avoid large stashes jenkinsci/bom#1955 but outbound bandwidth costs way more)
    • Each step is retried up to 3 times when the underlying agent is killed (due to resources: OOM, CPU limit, eviction, spot instance reclaim), which had way more elements in the build queue
  • We do not pay for "agent minutes", but rather for "VM minutes": the more time each builds takes waiting with the node pool fully scaled, the more we pay

    • Massively parallelizing to take the build queue item is a priority to decrease both costs and pressure on ci.jenkins.io

Costs report: https://docs.google.com/spreadsheets/d/1_C0I0jE-X0e0vDcdKOFIWcnwpOqWC8RQ4YOCgXNnplY/edit#gid=292621391

How

First tentative:

  • A new node pool with taints (to ensure the current workload cannot be run accidentally) to split physically
    • Keeping spot instances for now, to avoid cost explosion. But ensure that the instance spot costs allow up to 4 retries before reaching the on-demand (non-spot) cost (below the 3 retries of the pipeline)
    • Decrease the "waste" of resources in the instances: more pods per node, right sizing (room for improvement for second tentative)
    • Increase the maximum number of parallel pods to ensure a given bom builds can be treated in "one shot"
  • A new namespace jenkins-agents-bom in cik8s to split logically
  • Use the podTemplate() pipeline method in jenkinsci/bom to improve visibility
@dduportal
Copy link
Contributor Author

@dduportal
Copy link
Contributor Author

@dduportal
Copy link
Contributor Author

dduportal commented Apr 17, 2023

@jglick
Copy link

jglick commented Apr 17, 2023

BTW one change I am thinking about is to skip building master by default (only when a release is planned), so that the builds would at least be limited to PRs.

@dduportal
Copy link
Contributor Author

Ok, let's try the new node pool: jenkinsci/bom#1969

@dduportal
Copy link
Contributor Author

@dduportal
Copy link
Contributor Author

Following the "experiments" in jenkinsci/bom#1969, it seems that there is no "easy & obvious" solution, scoped in the node pool sizing, for the AWS cost decrease. So let's deliver the "split bom and plugin resources" using the same sizing as today and we'll continue to diagnose.

Reminder of the expected benefits:

  • Non bom builds (plugins, packaging, etcc.) using Linux container agents on ci.jenkins.io won't have to wait 1-2 hours when there are bom builds waiting.
  • Observability of the bom builds will be easier (separate namespace, separate node pool, etc.).

Proposed implementation:

  • Ensure there are 2 different namespaces in cik8s:

    • jenkins-agents-bom for the "production" bom builds
    • jenkins-agents-experiments for the "tests" with bigger node pool (and future experiments)
  • Keep using the cik8s-bom Kubernetes cloud defined in ci.jenkins.io to provide the "admin defined" pod templates

    • Expecting 2 pod templates, one per namespace with each one its own "max pod limit"
    • Defaults to JDK11 agent and JDK17 for dev tools. Specific Java version for dev tools will be defined in the bom pipeline
  • Open a new PR in the bom project to "only" split resources and rename the current chore(pipeline) switch to podTemplate instead of label to use the new node pool jenkinsci/bom#1969

@dduportal
Copy link
Contributor Author

dduportal commented Apr 25, 2023

@dduportal
Copy link
Contributor Author

This issue surfaced an issue in the cik8s autoscaler setup: jenkins-infra/aws#405

@dduportal
Copy link
Contributor Author

Let's try a real PR with jenkinsci/bom#2032

@dduportal
Copy link
Contributor Author

Looks good: plugins are able to be built as seen durin the (long) week end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants