-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(pipeline) switch to podTemplate instead of label to use the new node pool #1969
chore(pipeline) switch to podTemplate instead of label to use the new node pool #1969
Conversation
First tentative (build #2) was unsuccessful but showed the following:
|
Retrying (build #4) with the following changes:
|
Could the base image be loaded into the AMI that is spun up, to save some build time? 🤔 |
Technically yes. But it would not be sustainable (rebuild AMI on each Docker image rebuild, eg. 3-7 times a week, then redeploy the node pool which takes 4 to 20 min). But it sounds like it's not really the core challenge (at least not yet): there are clearly contention on Jenkins side: it takes minutes to execute the
|
Could rebuilt the Ami once a week or month. Docker image would only download the bits that have changed since the last pull. So instead fo 2GB it could be a few MB |
There is the possibility to use volume binding on the Kubernetes nodes to mount the volume for megawar 🤔. There are some options at least: |
Thanks @jetersen for the ideas and recommendations! Alas, we've already studied these elements:
In a scenario with Docker images shaped with layers, for a pure container runtime usage, that would be true. It's not the case for the Jenkins infra as we generate the Docker image using Packer to ensure the same environment between VM and container agents (rationale: maintaining a multitude of tiny Docker images in CI environments with all the possible combination of tools is an unsustainable task). It means that packer generate single layered image. => the image pull is a minor optimization, at best we'll be able to avoid 5 min waiting on a 120+ minute jobs. If we have to solve this problem, a local Docker registry (e.g. same network) would clearly be more efficient.
That is a good idea that could help! Worth giving it a try though, if it does not require a cluster heavy change. otherwise |
Why would we not simply revert #1955 and enable Regarding base image discussions, that all sounds like overthinking it. If an image is published to an appropriate registry local to the cloud provider, its nodes ought to be able to pull that image in a reasonable amount of time, and cache it as well. I think your issue is that you are pulling from Docker Hub (
That does not sound right— |
- name: "ARTIFACT_CACHING_PROXY_PROVIDER" | ||
value: "aws" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this get overridden?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC
Line 18 in 57116da
infra.withArtifactCachingProxy(env.ARTIFACT_CACHING_PROXY_PROVIDER != 'do') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're setting this ARTIFACT_CACHING_PROXY_PROVIDER
variable value in every agent (except the Azure VM agents as it's not possible to set an env var on them AFAIK) in our infrastructure as code definitions.
This variable is then used in the infra.withArtifactCachingProxy
function to determine which acp provider to use.
This variable is not set in the pipeline library.
Putting it in this podTemplate allows the correct selection of the AWS acp provider, without it it would default to the Azure one. (cf https://github.com/jenkins-infra/pipeline-library/blob/9e0f0a2f09a7c91b26824e706a8963598b685742/vars/infra.groovy#L101)
privileged: false | ||
tty: false | ||
volumeMounts: | ||
- mountPath: "/home/jenkins/.m2/repository" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not be used, given
Line 20 in 57116da
"MAVEN_ARGS=${env.MAVEN_ARGS != null ? MAVEN_ARGS : ''} -B -ntp -Dmaven.repo.local=${WORKSPACE_TMP}/m2repo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a "copy-and-paste" of the pod yaml from the existing config defined by the admins of ci.jenkins.io. I would want to stick to this realit y as close as possible.
Directly granting access to an S3 bucket to |
That's also what we understand. The unexpected (slow) behavior is the same for any
When the build
Need to check again and export threaddump + checking if we can see this on the |
It is controller-wide. But are there other jobs using artifacts and/or stashes which you would want to use the built-in storage for? |
There's also this plugin https://plugins.jenkins.io/s3/. |
I'm trying to summarize the possible options for the "stash/unstash" with their pros and cons. Let me know if it make sense for you:
Are there other solutions or things I missed? |
If it's done in the pipeline library you could also abstract away the cloud provider, depending on which provider the agent is running on you could use that, although it would assume that other agents in the build use the same provider... The artifact managers do clean up stashes. |
|
Since in either of these cases you are letting some |
I was only thinking about stash deletion driven by Jenkins (as per https://plugins.jenkins.io/artifact-manager-s3/#plugin-content-delete-stash). But it's true that a Garbage collection could be done on S3 side to avoid pressuring Jenkins controller 👍
Absolutely, that is the main reason why I'm reluctant to use the "pipeline step" / "pipeline library" solution. |
As noted there, this is deliberately not done by default, and must be enabled via system property. |
… node pool Signed-off-by: Damien Duportal <[email protected]>
Signed-off-by: Damien Duportal <[email protected]>
Signed-off-by: Damien Duportal <[email protected]>
57116da
to
23d6104
Compare
Build in 2hours and 09 minutes => still the same error with "long steps". The ci.jenkins.io machine seems to be "waiting" as well as agents, not sure where to look. Got a thread dump when the phenomen was seen |
A few seconds of overhead is normal enough, but not this. It is difficult to pinpoint the issue from general thread dumps; need a thread dump of the controller and that agent taken at the moment that a From
so looking for output. There is also an agent thread dump from related code, busy doing remote class loading
You can try running Jenkins with There are a lot of The immediate problem might be
meaning the In
is busy writing out metadata. The more stuff is happening, especially in |
Thanks for the details and explanation! Still trying to digest it but
kept my attention, as we have support bundle so I can share one |
Just a note: work on the |
Now that jenkins-infra/helpdesk#3542 is fixed, tried a build to verify the new agent cluster. Works (from infrastructure point of view) but the build failed due to degraded agent pool with error messages related to spot (gotta check later)/ |
@dduportal are you still expecting to do something here? Not harming anything leaving it open if so, just wondering if this was forgotten. |
I did not forget but was waiting for migrating ci.jenkins.io's VM to a new hardware + networks and also for jenkins-infra/helpdesk#3573 to have a full observability (including traces). Last test with these new elements still shows the dreaded |
Interesting that you should mention this now, as I have been following some performance testing inside @cloudbees suggesting that inexplicably slow |
@dduportal does https://ci.jenkins.io/job/Tools/configure have an option for Pipeline branch speed/durability override? If so, please set that to “performance optimized”. Like this: (It is possible to set this from |
I've found that option and enabled it. Thanks for the recommendation |
I've "borrowed" a test job that @jtnord had defined and am running it with the performance optimized setting on the job configuration. See https://ci.jenkins.io/job/Tools/job/bom/view/change-requests/job/PR-2257/3/ in a few hours to see the results. I'm borrowing: |
Hmm, seems to be not working at all: https://ci.jenkins.io/job/Tools/job/bom/job/PR-2257/3/threadDump/ has just created |
Interesting: is is worth testing the incremental built plugins from this PR to check the difference on ci.jenkins.io? Would that help? |
The #3 build completed in less than 2 hours when its preceding build took 3.5 hours. I think that means the performance optimized setting that is currently configured on the |
Or simply setting the system property: besides flipping the default, PR-323 so far is only fixing a handful of robustness bugs affecting tests and corner cases. I think it would be worth a try. (It would have a bigger effect without |
Update: I decided not to flip the default in that PR, but I did make a number of fixes there and in associated PRs which should make |
For info: jenkins-infra/helpdesk#3745 (comment) This PR is still in draft (as the goal is to improve parallelization of agents for bom to decrease build time and billing) |
Closing this PR as we're gonna try a new angle:
|
This PR is work in progress to switch the
bom
builds to a new pool of agents resources as per jenkins-infra/helpdesk#3521