-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix resuming Jobset after restoring PodTemplate (by deleting Jobs) #625
Conversation
✅ Deploy Preview for kubernetes-sigs-jobset canceled.
|
c8d7013
to
24448c5
Compare
// This is needed for integration with Kueue/DWS. | ||
if ptr.Deref(oldJS.Spec.Suspend, false) { | ||
if ptr.Deref(oldJS.Spec.Suspend, false) || ptr.Deref(js.Spec.Suspend, false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ptr.Deref(oldJS.Spec.Suspend, false) || ptr.Deref(js.Spec.Suspend, false) { | |
if ptr.Deref(oldJS.Spec.Suspend, false) || (ptr.Deref(js.Spec.Suspend, false) && js.Spec.ReplicatedJobs[*].status.startTime == nil) { |
Wouldn't it be helpful to validate if the jobs don't have startTime?
Because if any jobs have startTime, this operation should fail due to batch/job validation.
But, I'm not sure if we should add validations deeply depending on the batch/job specifications.
@danielvegamyhre @mimowo WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the flow is that we first suspend the parent JobSet, then it suspends the Jobs. So, if we added this condition, the startTime
on running child jobs would prevent the update.
However, this is complex, and I think I should have added an e2e test in JobSet to verify the suspending works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I’m thinking integration and e2e test solidifying the requirements that JobSet needs for working with Kueue will be a great idea.
We obviously don’t want to bring in Kueue as a dependency so I think just verifying that the updates/patches work in integration/e2e will be welcome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the flow is that we first suspend the parent JobSet, then it suspends the Jobs. So, if we added this condition, the startTime on running child jobs would prevent the update.
Uhm, that makes sense.
But, I guess that users fall into the rabbit hole when they accidentally modify the scheduling directives against the ReplicatedJob with startTime because the jobset-controller just output errors in the controller logs, this is not directly user feedback.
Maybe we can improve these implied specifications for the JobSet users once we introduce kubernetes/kubernetes#113221.
So, I'm ok with the current implementation for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the e2e test would not work (see in the PR I already added it), because there is a deeper issue - JobSet does not delete the Jobs once suspended: #535. The fact that we keep the old jobs means we never create new jobs on resuming again if the PodTemplate is updated in JobSet.
I'm trying to implement it by deleting the Jobs on suspend, and it looks promising. There are some integration tests for startupPolicy to adjust I yet need to look into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you have some more context knowledge if traps I may encounter with this approach (I will probably continue next week), but essentially this is the only I see, and it would mimic the Job behavior, where suspended Jobs deletes the pods.
Thanks for working on the fix for this @mimowo. Please take a look at my comment here #624 (comment) as well, I'd like to improve our coordination to prevent these kinds of issues from occurring in the future. |
/hold |
24c5d8b
to
38e5520
Compare
85077d4
to
0974200
Compare
0974200
to
f02353d
Compare
Let me know when this is ready for review and I'll take a look. Thanks! |
b9678b7
to
b40a6d2
Compare
Thank you @danielvegamyhre ! I think it is ready now, let me also know if some parts require additional explanation. |
/hold cancel |
@danielvegamyhre I'm wondering about introducing |
pkg/webhooks/jobset_webhook.go
Outdated
rStatus := js.Status.ReplicatedJobsStatus | ||
// Don't allow to mutate PodTemplate on unsuspending if there | ||
// are still active or suspended jobs from the previous run | ||
if len(rStatus) > index && (rStatus[index].Active > 0 || rStatus[index].Suspended > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assign len(rStatus) > index
to a well-named variable to make it more clear/readable why it matters if the number of replicatedJob statuses is greater than the current replicatedJob index?
Furthermore, could we abstract this block out into a helper function to make it clear at a glance what we are trying to check here, and make it more concise/readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, instead of using len(rStatus) > index
I basically want to check if the ReplicatedJobsStatus is initialized. If not, then I assume the JobSet didn't run yet, so there are no Jobs. I have also extracted the code to a function and documented. PTAL.
b40a6d2
to
eb3f230
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mimowo The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
eb3f230
to
495b8ff
Compare
// controller to delete the Jobs from the previous run if they could | ||
// conflict with the Job creation on resume. | ||
rStatus := js.Status.ReplicatedJobsStatus | ||
if rStatus[index].Active > 0 || rStatus[index].Suspended > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should include terminating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, currently we only count as "active" Jobs which have at least one "active" pods. This means that it is possible that some Jobs which have only terminating pods may still exist.
One fixing idea would be to count Jobs as active if they have at least one active or terminating pod. Another is to introduce a new counter for "terminating" jobs - Jobs which have all pods terminating. WDYT?
I guess this is a corner case for now, so maybe could be done in a follow up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not too bad in this PR, because a terminating Job (all pods are terminating) is going to be deleted eventually. Yes, if we resume the JobSet while it exists, then it can block creating replacement Job for a while, but IIUC the JobSet controller will create replacement for such a Job as soon as it is fully gone, and the replacement Job will have a good PodTemplate.
In order to prevent resuming a JobSet while there is a terminating Job we could:
- include such a Job as active, if
job.Status.Active > 0 || ptr.Deref(job.Status.Terminating, 0) > 0
here - introduce a new counter for
Terminating
Jobs. Terminating would be a fallback (so job.Status.Active == 0) counter ifptr.Deref(job.Status.Terminating, 0) > 0
I'm leaning towards option (2.). I could implement it either in this PR or a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a follow up is fine. Terminating on Jobs is currently a beta feature so logic may be somewhat complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Terminating on Jobs is currently a beta feature so logic may be somewhat complex.
I think we don't need any code in JobSet depending on the feature-gate in k8s. If the feature gate is disabled in k8s the field is not set. So, the code in JobSet would see nil
, we could prevent panics with ptr.Deref(job.Status.Terminating, 0)
, but this should be enough, IIUC.
I don't think we need a feature gate for this. |
@danielvegamyhre @kannon92 thanks for reviewing. let me know what are the remaining things to address in this PR. |
/hold I agree this is worth exploring as this is closer to how JobSet operates currently, so the impact is more predictable. I will try this approach and update the PR or summarize if there are any blockers. |
@danielvegamyhre @kannon92 |
/close |
@mimowo: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #624
Fixes #535
Special notes for your reviewer:
This approach relies on updating the Jobs on resume, rather the deleting Jobs on suspend.
Alternative implementation was done in: #625
Does this PR introduce a user-facing change?