-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Kubeflow Training V2 API #2171
KEP-2170: Kubeflow Training V2 API #2171
Conversation
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: kubeflow/wg-data-leads, nairbv, danielvegamyhre, sandipanpanda, bigsur0, kannon92, itayvallach, brannondorsey, kellyaa, ahg-g, shravan-achar, vsoch, akshaychitneni, zanetworker, kubeflow/kubeflow-steering-committee, mimowo. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Pull Request Test Coverage Report for Build 10268126274Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really exciting! I added some comment on question 1 in my notes below, and for 2:
Do we want to keep each parameter for dataset and model providers in our APIs, or we should just introduce storageUri field and environment variable that user can modify:
My preference is for storage URI - it's more flexible, and a design that workflow tools in the HPC community (and even container technologies) have been using and it's simple, straight forward, and seems to work!
I'll try to ping others on my team for feedback, and I hope we get to talk about this at the next batch wg meeting.
spec: | ||
trainingRuntimeRef: | ||
name: pytorch-distributed | ||
trainerConfig: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be passed to the scheduler (eventually) - I'm a bit tired so the path there isn't totally clear - I'm guessing it will eventually be turned into units of pods with resource requests.
numProcPerNode: 5 | ||
numNodes: 5 | ||
replicatedJobs: | ||
- name: Launcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried about this design because MPI doesn't always need a node that is explicitly for a launcher - that's a design that the MPI operator has taken (and makes sense, it's part of how MPI design is described), but for example, the lead broker of the flux operator does the launch with that node handling the same work as a worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up question - could the Flux Operator be handled here? High level, it's an MPI launcher (but actually is an entire HPC cluster that simply can be used to run one MPI job - it's an indexed job + headless service under the hood, with flux orchestration and nodes connected via a tree based overlay network with zeromq).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried about this design because MPI doesn't always need a node that is explicitly for a launchel
Yes, I know we have an API to run launcher as worker in MPI-Operator: https://github.com/kubeflow/mpi-operator/blob/master/pkg/apis/kubeflow/v2beta1/types.go#L161.
I want to hear @tenzen-y and @alculquicondor opinion on this. We didn't get a chance to deeply discuss how MPI Runtime should look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up question - could the Flux Operator be handled here? High level, it's an MPI launcher (but actually is an entire HPC cluster that simply can be used to run one MPI job - it's an indexed job + headless service under the hood, with flux orchestration and nodes connected via a tree based overlay network with zeromq)
Yes, I think that is possible. Do you know if we can construct the Flux cluster for batch using JobSet spec or we need to use Flux operator mini-cluster spec ?
JobSetSpec *batchv1.JobSetSpec `json:",inline"`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Flux Operator (under the hood) is an indexed job, so yes I could very easily turn that into a JobSet. I actually did a design with JobSet early on, but didn't wind up converting entirely for simple reasons - JobSet wasn't part of Kubernetes core (and would require the user to install an additional thing), and JobSet makes the DNS names much longer, and (in my testing) it often led to errors depending on the name of the job that I chose. Since I just needed the IndexedJob and then the headless service, it was more straight forward to just implement that.
Please put me down to implement Flux Framework into this setup - the nested hierarchical scheduler can handle a lot of really cool stuff within a single job, and the zeromq bootstrap is an improvement over ssh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be interesting to see the benchmarks to compare zmq over ssh for distributed model training. cc @tenzen-y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can definitely help with that, if there is interest to run experiments. We compared an HPC proxy app (LAMMPS) to the MPI operator, now that was a long time ago (early 2023) and there was about a 5% difference in means, and you can imagine that would compound with runtime (our runs were very short). That said, both have changed immensely since then!
I have a hard time getting the MPI Operator working, but if we wanted to decide on a workflow, I could offer to set it up and run in the Flux Operator. We would need to choose a cloud that has a suitable network and resources for the work. If it's ML then probably Google would work - I have credits currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that both MPIJob modes (runLauncherAsWorker and specific launcher) would be helpful for the GPU and CPU mixed clusters. Additionally, the launcher allows us to easily migrate from MPIJob v2 to TrainJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporarily, I will add the runLauncherAsNode
API. Let's discuss it again after we have MPI runtimes.
14aac85
to
85dee06
Compare
FWIW I am in favor of adding
I personally like the latter approach rather than an environment variable but just my opinion. 🤷
Those seems good. I would also recommend one of the good small ones (e.g., Mistral 7B - DPO).
👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some small nits, just for readability.
@franciscojavierarceo Please can you check the latest API for model and dataset configs ? I recently made some updates based on this 🧵 #2171 (comment)
I think, after initial implementation we can add blueprints for more LLMs: Mistral, BERT, Llama, etc. |
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Add runLauncherAsNode parameter Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
643598f
to
08fec42
Compare
Signed-off-by: Andrey Velichkevich <[email protected]>
- torchrun train.py | ||
``` | ||
|
||
Training Operator will create the `PodGroup` using the following spec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Naive question - does this imply that any custom gang scheduler used is required to make a PodGroup to function? Or if you have a plugin that doesn't create a PodGroup (and does its own thing) the group would be made anyway? And from the example above, how is minMember
of 5 derived? You would probably need custom logic per scheduler backend for this API. For example, the PodGroup created by coscheduling vs volcano.sh are different underlying objects. (https://github.com/kubernetes-sigs/scheduler-plugins/blob/d67d2c15199595ccc6218574bd8e07b68b7146f4/apis/scheduling/v1alpha1/types.go#L128 and https://github.com/volcano-sh/apis/blob/78d912ce096c9fd9120488b5c7b999d4996f2cb5/pkg/apis/scheduling/types.go#L147)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this Training API is going t take on the challenge of orchestrating the logic for creation of needed PodGroup (varying by the plugin) then this probably works OK. Otherwise, I'd still place the responsibility on the user for now for creating the pod group, but just allow the schedulerName
and other labels to use it in whatever turns into the actual job or podspec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this imply that any custom gang scheduler used is required to make a PodGroup to function?
The PodGroup creation will be implemented only for supported gang-schedulers (initially Volcano and Coscheduling). Since different schedulers require the various PodGroups to be created.
And from the example above, how is minMember of 5 derived?
minMember
is always equal to numNodes
. For ML Training all gangs should be alive to execute training. In the future, if we find other use-cases (e.g. HPC), we can discuss it again.
If this Training API is going t take on the challenge of orchestrating the logic for creation of needed PodGroup (varying by the plugin) then this probably works OK
Yeah, that's correct.
Otherwise, I'd still place the responsibility on the user for now for creating the pod group, but just allow the schedulerName and other labels to use it in whatever turns into the actual job or podspec.
Btw, this also will be supported, since users can just ignore PodGroupSpec
parameter, and set the .spec.schedulerName
field in the PodSpec.
Signed-off-by: Andrey Velichkevich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
/lgtm
spec: | ||
scheduleTimeoutSeconds: 100 | ||
minMember: 5 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a great PodGroup specification unveil!
We should be ready to merge this and start the initial implementation. Thanks to everyone for the comprehensive review of this proposal! /hold cancel |
This is the Kubeflow Enhancement Proposal for Kubeflow Training V2 APIs: https://bit.ly/3WzjTlw
Related: #2170
We will collect the final community feedback by mid of next week and start the implementation.
Open Questions:
Since we are planning to introduce
OutputArtifact
to theModelConfig
, how should we support one-off for different model providers (e.g. HF, PVC, S3, etc.) ? Another option could be to addOutputArtifact
as part ofTrainerConfig
.Do we want to keep each parameter for dataset and model providers in our APIs, or we should just introduce
storageUri
field and environment variable that user can modify:What LLM Fine-Tuning Runtimes do we want to add initially. E.g. Llama-7b, Gemma-7b, Llama-70b?
TrainJobStatus
needs to be updated with more info./cc @kubeflow/wg-training-leads @kubeflow/kubeflow-steering-committee @kubeflow/wg-data-leads @danielvegamyhre @kannon92 @mimowo @ahg-g @kuizhiqing @sandipanpanda @Electronic-Waste @helenxie-bit @franciscojavierarceo @diegolovison @alculquicondor @zw0610 @lowang-bh @deepak-muley @bigsur0 @Syulin7 @itayvallach @richardsliu @shravan-achar @akshaychitneni @StefanoFioravanzo @zanetworker @nairbv @vsoch @yuzisun @brannondorsey @kellyaa
/hold for review