Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kueue support for Kubeflow Notebook #126

Open
varshaprasad96 opened this issue Nov 13, 2024 · 0 comments
Open

Add Kueue support for Kubeflow Notebook #126

varshaprasad96 opened this issue Nov 13, 2024 · 0 comments

Comments

@varshaprasad96
Copy link

This issue is cross-referenced from kubernetes-sigs/kueue#3352. The intention is to clarify integration related questions from the notebook controller side.

Brief Overview:

Notebooks, especially GPU-enabled ones, can demand substantial resources, similar to other ML batch workloads. Managing them through Kueue allows users to schedule Notebooks more efficiently within cluster resources. With this feature, scheduling of NB resources based on cluster quota will be handled by Kueue, whereas the lifecycle of the NB resource itself will remain the responsibility of the NB controller. Ideally, there should not be any changes in the current features or responsibilities of the NB controller.

Additional References:
As notebook v1beta APIs currently use StatefulSets underneath to manage pods, it would be easier to use Kueue with StatefulSet+Pod integration enabled to be able to manage NB workloads.
Documentation on Kueue integrations: https://kueue.sigs.k8s.io/docs/tasks/run/statefulset/
Enable scaling of pods belonging to SS: kubernetes-sigs/kueue#3487

Note: The intention is to make Kueue compatible with both v1 and v2 APIs.

On testing Kueue with the above integrations enabled, NB pods are able to be queued in Local Queue based on resource quota.

Open Question:

  1. In case that a Notebook is preempted by Kueue, should the notebook-controller be modified to add a finalizer to perform backups? Is it the responsibility of the NB controller to handle backups? Preemption in general need not necessarily be by Kueue, it could also be that the underlying pod is preempted by the Kube scheduler. Or is it reasonable to assume that NBs would always use persistent volumes to store data?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Needs Triage
Development

No branches or pull requests

1 participant