Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached #39

Merged
merged 1 commit into from
Nov 20, 2023

Conversation

bernerhat
Copy link
Contributor

Fixes the issue created where consecutive CronJobs iterations create pods unnecessarily and caused maxPod limit to be reached thus resulting in future pods scheduling to be stuck in a pending state.

  • Added a ConcurrencyPolicy field to the CronJob for reconcileClientStatusReporterJob in StorageClientReconciler
  • ConcurrencyPolicy chosen in this case to be Forbid

The Forbid option specifies waiting for the current job to finish before starting a new one and the Replace option starts a new job no matter what is the current one's status is (which can also create a new pod while the previous job’s pod is still terminating and increase pod count unnecessarily).
Since the created pod for this cronJob purpose is a heartbeat to the provider server i don't see the point in choosing to replace the current job with another, therefore forbid made more sense.

Copy link

openshift-ci bot commented Nov 9, 2023

@bernerhat: This pull request references Bugzilla bug 2247748, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

2 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @nehaberry

In response to this:

Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

openshift-ci bot commented Nov 9, 2023

@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: nehaberry.

Note that only red-hat-storage members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@bernerhat: This pull request references Bugzilla bug 2247748, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

2 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @nehaberry

In response to this:

Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@leelavg
Copy link
Contributor

leelavg commented Nov 13, 2023

@bernerhat could you pls add controllers as the prefix to commit msg to pass github actions?

@@ -470,6 +470,7 @@ func (s *StorageClientReconciler) reconcileClientStatusReporterJob(instance *v1a
},
},
},
ConcurrencyPolicy: "Forbid",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Move the fields that directly belong to a struct to the top, ie, move this line to after 439, ie, after Schedule which generates diff for easy viewing
  2. Even though the string Forbid is technically correct, since this is a enum-ish field, you should be searching for any predefined constants as a value for this field, ie, batchv1.ForbidConcurrent should be used
  3. Please comment on how you simulated the scenario as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack. all addressed.

Comment on lines 436 to 439
var jobDeadLineSeconds int64 = 155
var podDeadLineSeconds int64 = 120
var keepJobResourceSeconds int32 = 600
var reducedKeptSuccecsful int32 = 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to create passable variables to the pointer variables required in the new fields. looking into other options as well.

@bernerhat
Copy link
Contributor Author

Firstly addressing the issue at hand, simulated and tested the options to either Forbid or Replace the current running Job locally using minikube and kubernetes version 1.27.4

Secondly, Concerns brought up to me by Ohad were:

  1. Setting a proper deadline for the pod to finish it's operation to prevent an infinite loop scenario.
  2. Keeping last run objects to investigate such option

In order to address these concerns i've added a few new specs to the crafted CR:

  • TTLSecondsAfterFinished - keep iteration objects (job and pod). set the ttl to 10 minutes, can be changed.
  • ActiveDeadlineSeconds - deadline before killing the objects. Pod 2 minutes after consultation, Job 2 minutes 35 seconds.
  • SuccessfulJobsHistoryLimit - set to 1. default value is 3, reduced to prevent unnecessary object clutter.

ActiveDeadlineSeconds on the job level was set to 2 minutes and 35 seconds in order to keep the pod. for some reason the pod was not kept after the deadline reached, after many testing i've managed to find that the pod was only kept after moving into error status 30 seconds after the deadline is reached and the kill order was issued (extra 5 seconds to be sure).

@leelavg
Copy link
Contributor

leelavg commented Nov 20, 2023

@bernerhat could you pls add controllers as the prefix to commit msg to pass github actions?

  • missed, GH action is still failing.

var podDeadLineSeconds int64 = 120
var keepJobResourceSeconds int32 = 600
var reducedKeptSuccecsful int32 = 1


_, err := controllerutil.CreateOrUpdate(s.ctx, s.Client, cronJob, func() error {
cronJob.Spec = batchv1.CronJobSpec{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use startindDeadlineSeconds as well, just to safegaurd against 100 times job failure w/ Forbid. can be b/n 30 to 60s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've verified locally that a new job starts right after the deadline is reached for the previous one, so i dont believe this is necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I'm referring to a very edgy case search for 100 in this and another post

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its only referring to job scheduling failure, with the deadline (to finish) in place there wont be a skipped scheduling more than twice per iteration (at most). once the 155 seconds deadline is reached a new iteration will spawn and the counter for how many missed jobs occurred will be reset. there is still the option that the CronJob controller happens to be down for a long time (more than 100 minutes) which will cause the issue you are referring to but wouldn't that be considered a cluster issue at this point? i can add the field but i'm unsure if it will cause issues with the other fields i've added and will need additional testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not mandatory, the reasoning is good enough.

@bernerhat
Copy link
Contributor Author

@bernerhat could you pls add controllers as the prefix to commit msg to pass github actions?

  • missed, GH action is still failing.

did not miss, its failing for the first commit still.. is there a way to fix or should i re-create the pr?

added the capability to keep resources on failed job execution with a timeout

Signed-off-by: Amit Berner <[email protected]>
@nb-ohad
Copy link
Contributor

nb-ohad commented Nov 20, 2023

/approve

Copy link

openshift-ci bot commented Nov 20, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bernerhat, leelavg, nb-ohad

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nb-ohad nb-ohad merged commit 3a8f949 into red-hat-storage:fusion-hci-4.14 Nov 20, 2023
12 checks passed
Copy link

openshift-ci bot commented Nov 20, 2023

@bernerhat: All pull requests linked via external trackers have merged:

Bugzilla bug 2247748 has been moved to the MODIFIED state.

In response to this:

Bug 2247748: A storage-client CronJob create too many jobs and pods causing maxPod limit to be reached

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants