-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached #39
Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached #39
Conversation
@bernerhat: This pull request references Bugzilla bug 2247748, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 2 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: nehaberry. Note that only red-hat-storage members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@bernerhat could you pls add |
@@ -470,6 +470,7 @@ func (s *StorageClientReconciler) reconcileClientStatusReporterJob(instance *v1a | |||
}, | |||
}, | |||
}, | |||
ConcurrencyPolicy: "Forbid", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Move the fields that directly belong to a struct to the top, ie, move this line to after 439, ie, after
Schedule
which generates diff for easy viewing - Even though the string
Forbid
is technically correct, since this is a enum-ish field, you should be searching for any predefined constants as a value for this field, ie,batchv1.ForbidConcurrent
should be used - Please comment on how you simulated the scenario as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack. all addressed.
var jobDeadLineSeconds int64 = 155 | ||
var podDeadLineSeconds int64 = 120 | ||
var keepJobResourceSeconds int32 = 600 | ||
var reducedKeptSuccecsful int32 = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to create passable variables to the pointer variables required in the new fields. looking into other options as well.
Firstly addressing the issue at hand, simulated and tested the options to either Secondly, Concerns brought up to me by Ohad were:
In order to address these concerns i've added a few new specs to the crafted CR:
|
|
var podDeadLineSeconds int64 = 120 | ||
var keepJobResourceSeconds int32 = 600 | ||
var reducedKeptSuccecsful int32 = 1 | ||
|
||
|
||
_, err := controllerutil.CreateOrUpdate(s.ctx, s.Client, cronJob, func() error { | ||
cronJob.Spec = batchv1.CronJobSpec{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use startindDeadlineSeconds
as well, just to safegaurd against 100 times job failure w/ Forbid. can be b/n 30 to 60s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've verified locally that a new job starts right after the deadline is reached for the previous one, so i dont believe this is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its only referring to job scheduling failure, with the deadline (to finish) in place there wont be a skipped scheduling more than twice per iteration (at most). once the 155 seconds deadline is reached a new iteration will spawn and the counter for how many missed jobs occurred will be reset. there is still the option that the CronJob controller happens to be down for a long time (more than 100 minutes) which will cause the issue you are referring to but wouldn't that be considered a cluster issue at this point? i can add the field but i'm unsure if it will cause issues with the other fields i've added and will need additional testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not mandatory, the reasoning is good enough.
did not miss, its failing for the first commit still.. is there a way to fix or should i re-create the pr? |
added the capability to keep resources on failed job execution with a timeout Signed-off-by: Amit Berner <[email protected]>
cf01fdd
to
7fb2735
Compare
/approve |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bernerhat, leelavg, nb-ohad The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@bernerhat: All pull requests linked via external trackers have merged: Bugzilla bug 2247748 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fixes the issue created where consecutive CronJobs iterations create pods unnecessarily and caused maxPod limit to be reached thus resulting in future pods scheduling to be stuck in a pending state.
Forbid
The
Forbid
option specifies waiting for the current job to finish before starting a new one and theReplace
option starts a new job no matter what is the current one's status is (which can also create a new pod while the previous job’s pod is still terminating and increase pod count unnecessarily).Since the created pod for this cronJob purpose is a heartbeat to the provider server i don't see the point in choosing to replace the current job with another, therefore forbid made more sense.