Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for the managedBy field #2193

Closed
mimowo opened this issue Aug 5, 2024 · 21 comments · Fixed by #2203
Closed

Add support for the managedBy field #2193

mimowo opened this issue Aug 5, 2024 · 21 comments · Fixed by #2203
Assignees

Comments

@mimowo
Copy link

mimowo commented Aug 5, 2024

What you would like to be added?

The support for the managedBy field which can delegate reconciliation from built-in controller, to a custom one.

The semantics of the field are:

  • whenever the value is set, and it does not point to the built-in operator, then skip reconciliation
  • the field is immutable
  • the field is not defaulted

Why is this needed?

For context, we have in Kueue the effort (see Support Kubeflow Jobs in MultiKueue) to support kubeflow-training (it will include MPIJob), but it will not be complete without the support for managedBy.

The complete support for the users of MultiKueue (multi-cluster Kueue) means:

  • simpler installation (just follow the standard installation path), otherwise only installation of kubeflow CRDs is required
  • support for mixed setup in the cluster - some Jobs could be run by MultiKueue and some by the default operator in the same cluster

For context, the efforts to support the field in:

Love this feature?

Give it a 👍 We prioritize the features with most 👍

@mimowo
Copy link
Author

mimowo commented Aug 5, 2024

/cc @andreyvelich @tenzen-y
Would you be willing to consider this feature still in v1?

@mimowo
Copy link
Author

mimowo commented Aug 5, 2024

For xreference of with support for MPIJob: kubeflow/mpi-operator#646

@andreyvelich
Copy link
Member

Thank you for creating this @mimowo!
Let's discuss this during our next AutoML and Training WG call this Wednesday at 5pm UTC: https://bit.ly/2PWVCkV

@mimowo
Copy link
Author

mimowo commented Aug 7, 2024

I believe the implementation could follow the JobSet example. These are the key bits:

@tenzen-y
Copy link
Member

tenzen-y commented Aug 7, 2024

In the JobSet managedBy implementations, the basic status transitions are guarded by the batch/v1 Job status validations.
But, since the kubeflow TrainJobs v1 is not guarded by any validations due to controlling naked Pod.

So, I would propose selecting either of the following 2 approaches:

  1. Implement the status transition validations similar to batch/v1 Job so that we can disallow violated status transitions from external controllers. Because the managedBy field allows external controllers to reconcile/manage the Kubeflow Jobs.
  2. Restrict the acceptable managedBy values to either kubflow.org/training-operator or kueue.x-k8s.io/multikueue. And then, after we find some use cases to specify the arbitrary values in the managedBy field, we can relax the validations.

TBH, I would propose selecting opt 2 since I can not find any use cases so that the arbitrary external controller needs to reconcile/control the kubeflow Jobs.

@mimowo
Copy link
Author

mimowo commented Aug 7, 2024

I'm also in favor of (2.) to simplify the code.

@andreyvelich
Copy link
Member

I agree on the 2nd approach given that we will migrate to TrainJob in the future versions.

@tenzen-y do we need to allow user to set kubflow.org/training-operator in managedBy field ?
I guess, if managedBy is empty the OSS Training Operator controller will reconcile the training job.

@tenzen-y
Copy link
Member

tenzen-y commented Aug 7, 2024

I assumed that the training operator defaulter mutate it to the kubeflow.org/training-operator when the Kubeflow Jobs don't have any managedBy values.

So, there is no impact on the existing users.

@andreyvelich
Copy link
Member

I see, do we have the same logic for batchv1/job ?

@mimowo
Copy link
Author

mimowo commented Aug 7, 2024

In JobSet and Jobwe don't default the value to the jobset-controller (we leave null, meaning default). However, to make it future compatible, and to allow some users to express the intent better, we reserve one value for the default controller.

If we want consistency with Job, then, we reserve kubflow.org/tf-operator, but we can simplify here and only support two values: null (default), or multi-kueue.

EDIT: I don't have a strong preference, but given the code is simple anyway, I would be leaning to follow the pattern of Job and reserve one value, it could be kubeflow.org/training-operator for all of them.

@tenzen-y
Copy link
Member

tenzen-y commented Aug 7, 2024

Uhm, that makes sense. If so, can we allow the training-operator to reconcile against the Jobs with an empty or kubeflow.org/training-operator managedBy field so that we can align with Job and JobSet?

@mszadkow
Copy link
Contributor

mszadkow commented Aug 8, 2024

/assign
I have consulted with @mimowo, and I will be responsible for the implementation and the support in Kueue.

@andreyvelich
Copy link
Member

andreyvelich commented Sep 23, 2024

I saw that the managedBy PR has been merged, thank you for doing this @mszadkow @mimowo 🎉

Please can you add a new user guide on how to use managedBy in the Kubeflow website https://www.kubeflow.org/docs/components/training/user-guides/ ?
If we have some docs under Kueue, we can also reference them.

@mimowo
Copy link
Author

mimowo commented Sep 23, 2024

Since we restrict for now the field only to multikueue or built-in controller, the documentation in Kubeflow should probably be short and focused on MultiKueue (maybe just briefly mention what the field does, and reference the MutliKueue page).

For the docs in Kueue the only challenge is that it would need to reference dev (from master) installation of the training-operator, until the official release, but I think it would be a nice starting point I would suggest.

@mimowo
Copy link
Author

mimowo commented Sep 23, 2024

I have opened the issue to capture the docs extension at the Kueue side: kubernetes-sigs/kueue#3121

@andreyvelich
Copy link
Member

Since we restrict for now the field only to multikueue or built-in controller, the documentation in Kubeflow should probably be short and focused on MultiKueue (maybe just briefly mention what the field does, and reference the MutliKueue page).

I agree with you, but even the reference would be useful.

For the docs in Kueue the only challenge is that it would need to reference dev (from master) installation of the training-operator, until the official release, but I think it would be a nice starting point I would suggest.

Let's create pending PR in the kubeflow/website and merge it when we release this feature.

@mimowo
Copy link
Author

mimowo commented Sep 24, 2024

Sounds like a plan:

  1. a docs PR in Kueue showing how to use training-operator (master for now) with ManagedBy (we need to note clearly this is only for master version, for released version there is another way - only install CRDs), [MultiKueue] Document how to use training-operator with managedBy kubernetes-sigs/kueue#3121
  2. a docs PR in kubeflow/website referencing the MultiKueue documentation
  3. once training-operator is released a small PR in Kueue to use the released version rather than master

I synced with @mszadkow who will follow up on the steps.

@mimowo
Copy link
Author

mimowo commented Oct 8, 2024

@mszadkow investigated this a bit more and it turns out that we cannot use master version of the training-operator with Kueue (point 1 of the plan), because Kueue requires to see the field to operate properly, but we don't want to compile Kueue against unreleased training-operator. So, for now, the docs would be just confusing to end users.

Since the spec.runPolicy.managedBy field is only supported for MultiKueue I think we can defer the documentation until we have:

  1. the next release of the training-operator,
  2. version of Kueue (not necessarily released yet, might be from the main branch) which allows to use it

I could open an issue in kubeflow to add the documentation page once the above conditions are met. WDYT @andreyvelich @tenzen-y?

@andreyvelich
Copy link
Member

Sure, @mimowo please open a new issue in Training Operator repo: https://github.com/kubeflow/training-operator/issues.

@mimowo
Copy link
Author

mimowo commented Oct 9, 2024

Sure, opened: #2279

@tenzen-y
Copy link
Member

tenzen-y commented Oct 9, 2024

@mszadkow investigated this a bit more and it turns out that we cannot use master version of the training-operator with Kueue (point 1 of the plan), because Kueue requires to see the field to operate properly, but we don't want to compile Kueue against unreleased training-operator. So, for now, the docs would be just confusing to end users.

Since the spec.runPolicy.managedBy field is only supported for MultiKueue I think we can defer the documentation until we have:

  1. the next release of the training-operator,
  2. version of Kueue (not necessarily released yet, might be from the main branch) which allows to use it

I could open an issue in kubeflow to add the documentation page once the above conditions are met. WDYT @andreyvelich @tenzen-y?

I discussed this with @mimowo offline, and we agreed with this decision based on Kueue policies. We may aim to contain the training-operator managedBy feature once the traininig-operator RC version is in the future. But that is not guaranteed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants