Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always update TrainJob status on errors #2352

Merged
merged 1 commit into from
Dec 13, 2024

Conversation

astefanutti
Copy link
Contributor

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #2351

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Dec 13, 2024

Pull Request Test Coverage Report for Build 12320339089

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 12320152919: 0.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls

@@ -199,7 +200,7 @@ func setSuspendedCondition(trainJob *kubeflowv2.TrainJob) {

func setTerminalCondition(ctx context.Context, runtime jobruntimes.Runtime, trainJob *kubeflowv2.TrainJob) error {
terminalCond, err := runtime.TerminalCondition(ctx, trainJob)
if err != nil {
if err != nil && !apierrors.IsNotFound(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, I'm wondering if we should update the status regardless of errors there: https://github.com/kubeflow/training-operator/blob/3eb1a0c9e3616a7fed9864d8b1373d6d677a6c77/pkg/controller.v2/trainjob_controller.go#L99-

So, if we get any error in L100

if terminalCondErr := setTerminalCondition(ctx, runtime, &trainJob); terminalCondErr != nil {
return ctrl.Result{}, errors.Join(err, terminalCondErr)
}
. we just join the terminalCondErr with error, but we do not return error here.

In other words, we update the status in the case of status differences between old and new even if we get any errors from setCondition functions.

Because we do not restrict underlying Job to JobSet and we can not estimate that the NotFound error can be ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that's much better!

I've just re-pushed. Let me know if that corresponds to what you had in mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see.

@astefanutti astefanutti changed the title Handle not found error when getting JobSet terminal condition Always update TrainJob status on errors Dec 13, 2024
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit de16432 into kubeflow:master Dec 13, 2024
52 checks passed
@astefanutti astefanutti deleted the pr-04 branch December 13, 2024 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conditions are never set on TrainJobs when the creation of PodSets fail
3 participants