Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Kubeflow Training V2 API #2171

Merged
merged 24 commits into from
Aug 6, 2024

Conversation

andreyvelich
Copy link
Member

This is the Kubeflow Enhancement Proposal for Kubeflow Training V2 APIs: https://bit.ly/3WzjTlw
Related: #2170

We will collect the final community feedback by mid of next week and start the implementation.

Open Questions:

  1. Since we are planning to introduce OutputArtifact to the ModelConfig, how should we support one-off for different model providers (e.g. HF, PVC, S3, etc.) ? Another option could be to add OutputArtifact as part of TrainerConfig.

  2. Do we want to keep each parameter for dataset and model providers in our APIs, or we should just introduce storageUri field and environment variable that user can modify:

    datasetConfig:
      storageUri: hf://yelp_review_full
      env:
        - name: SPLIT
          value: train[:4000]
    --- or ---
    datasetConfig:
      huggingface:
        repoId: yelp_review_full
        split: train[:4000]
  3. What LLM Fine-Tuning Runtimes do we want to add initially. E.g. Llama-7b, Gemma-7b, Llama-70b?

  4. TrainJobStatus needs to be updated with more info.

/cc @kubeflow/wg-training-leads @kubeflow/kubeflow-steering-committee @kubeflow/wg-data-leads @danielvegamyhre @kannon92 @mimowo @ahg-g @kuizhiqing @sandipanpanda @Electronic-Waste @helenxie-bit @franciscojavierarceo @diegolovison @alculquicondor @zw0610 @lowang-bh @deepak-muley @bigsur0 @Syulin7 @itayvallach @richardsliu @shravan-achar @akshaychitneni @StefanoFioravanzo @zanetworker @nairbv @vsoch @yuzisun @brannondorsey @kellyaa

/hold for review

Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: kubeflow/wg-data-leads, nairbv, danielvegamyhre, sandipanpanda, bigsur0, kannon92, itayvallach, brannondorsey, kellyaa, ahg-g, shravan-achar, vsoch, akshaychitneni, zanetworker, kubeflow/kubeflow-steering-committee, mimowo.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This is the Kubeflow Enhancement Proposal for Kubeflow Training V2 APIs: https://bit.ly/3WzjTlw
Related: #2170

We will collect the final community feedback by mid of next week and start the implementation.

Open Questions:

  1. Since we are planning to introduce OutputArtifact to the ModelConfig, how should we support one-off for different model providers (e.g. HF, PVC, S3, etc.) ? Another option could be to add OutputArtifact as part of TrainerConfig.

  2. Do we want to keep each parameter for dataset and model providers in our APIs, or we should just introduce storageUri field and environment variable that user can modify:

datasetConfig:
  storageUri: hf://yelp_review_full
  env:
    - name: SPLIT
      value: train[:4000]
--- or ---
datasetConfig:
  huggingface:
    repoId: yelp_review_full
    split: train[:4000]
  1. What LLM Fine-Tuning Runtimes do we want to add initially. E.g. Llama-7b, Gemma-7b, Llama-70b?

  2. TrainJobStatus needs to be updated with more info.

/cc @kubeflow/wg-training-leads @kubeflow/kubeflow-steering-committee @kubeflow/wg-data-leads @danielvegamyhre @kannon92 @mimowo @ahg-g @kuizhiqing @sandipanpanda @Electronic-Waste @helenxie-bit @franciscojavierarceo @diegolovison @alculquicondor @zw0610 @lowang-bh @deepak-muley @bigsur0 @Syulin7 @itayvallach @richardsliu @shravan-achar @akshaychitneni @StefanoFioravanzo @zanetworker @nairbv @vsoch @yuzisun @brannondorsey @kellyaa

/hold for review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coveralls
Copy link

coveralls commented Jul 17, 2024

Pull Request Test Coverage Report for Build 10268126274

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.008%) to 33.477%

Totals Coverage Status
Change from base Build 10252995442: 0.008%
Covered Lines: 3949
Relevant Lines: 11796

💛 - Coveralls

Copy link

@vsoch vsoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really exciting! I added some comment on question 1 in my notes below, and for 2:

Do we want to keep each parameter for dataset and model providers in our APIs, or we should just introduce storageUri field and environment variable that user can modify:

My preference is for storage URI - it's more flexible, and a design that workflow tools in the HPC community (and even container technologies) have been using and it's simple, straight forward, and seems to work!

I'll try to ping others on my team for feedback, and I hope we get to talk about this at the next batch wg meeting.

docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
spec:
trainingRuntimeRef:
name: pytorch-distributed
trainerConfig:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be passed to the scheduler (eventually) - I'm a bit tired so the path there isn't totally clear - I'm guessing it will eventually be turned into units of pods with resource requests.

numProcPerNode: 5
numNodes: 5
replicatedJobs:
- name: Launcher
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about this design because MPI doesn't always need a node that is explicitly for a launcher - that's a design that the MPI operator has taken (and makes sense, it's part of how MPI design is described), but for example, the lead broker of the flux operator does the launch with that node handling the same work as a worker.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up question - could the Flux Operator be handled here? High level, it's an MPI launcher (but actually is an entire HPC cluster that simply can be used to run one MPI job - it's an indexed job + headless service under the hood, with flux orchestration and nodes connected via a tree based overlay network with zeromq).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about this design because MPI doesn't always need a node that is explicitly for a launchel

Yes, I know we have an API to run launcher as worker in MPI-Operator: https://github.com/kubeflow/mpi-operator/blob/master/pkg/apis/kubeflow/v2beta1/types.go#L161.

I want to hear @tenzen-y and @alculquicondor opinion on this. We didn't get a chance to deeply discuss how MPI Runtime should look like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up question - could the Flux Operator be handled here? High level, it's an MPI launcher (but actually is an entire HPC cluster that simply can be used to run one MPI job - it's an indexed job + headless service under the hood, with flux orchestration and nodes connected via a tree based overlay network with zeromq)

Yes, I think that is possible. Do you know if we can construct the Flux cluster for batch using JobSet spec or we need to use Flux operator mini-cluster spec ?

 JobSetSpec *batchv1.JobSetSpec `json:",inline"`

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Flux Operator (under the hood) is an indexed job, so yes I could very easily turn that into a JobSet. I actually did a design with JobSet early on, but didn't wind up converting entirely for simple reasons - JobSet wasn't part of Kubernetes core (and would require the user to install an additional thing), and JobSet makes the DNS names much longer, and (in my testing) it often led to errors depending on the name of the job that I chose. Since I just needed the IndexedJob and then the headless service, it was more straight forward to just implement that.

Please put me down to implement Flux Framework into this setup - the nested hierarchical scheduler can handle a lot of really cool stuff within a single job, and the zeromq bootstrap is an improvement over ssh.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be interesting to see the benchmarks to compare zmq over ssh for distributed model training. cc @tenzen-y

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely help with that, if there is interest to run experiments. We compared an HPC proxy app (LAMMPS) to the MPI operator, now that was a long time ago (early 2023) and there was about a 5% difference in means, and you can imagine that would compound with runtime (our runs were very short). That said, both have changed immensely since then!

I have a hard time getting the MPI Operator working, but if we wanted to decide on a workflow, I could offer to set it up and run in the Flux Operator. We would need to choose a cloud that has a suitable network and resources for the work. If it's ML then probably Google would work - I have credits currently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that both MPIJob modes (runLauncherAsWorker and specific launcher) would be helpful for the GPU and CPU mixed clusters. Additionally, the launcher allows us to easily migrate from MPIJob v2 to TrainJob.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporarily, I will add the runLauncherAsNode API. Let's discuss it again after we have MPI runtimes.

@franciscojavierarceo
Copy link
Contributor

franciscojavierarceo commented Jul 25, 2024

This is the Kubeflow Enhancement Proposal for Kubeflow Training V2 APIs: https://bit.ly/3WzjTlw Related: #2170

We will collect the final community feedback by mid of next week and start the implementation.

Open Questions:

  1. Since we are planning to introduce OutputArtifact to the ModelConfig, how should we support one-off for different model providers (e.g. HF, PVC, S3, etc.) ? Another option could be to add OutputArtifact as part of TrainerConfig.

FWIW I am in favor of adding OutputArtifact to TrainerConfig.

  1. Do we want to keep each parameter for dataset and model providers in our APIs, or we should just introduce storageUri field and environment variable that user can modify:
    datasetConfig:
      storageUri: hf://yelp_review_full
      env:
        - name: SPLIT
          value: train[:4000]
    --- or ---
    datasetConfig:
      huggingface:
        repoId: yelp_review_full
        split: train[:4000]

I personally like the latter approach rather than an environment variable but just my opinion. 🤷

  1. What LLM Fine-Tuning Runtimes do we want to add initially. E.g. Llama-7b, Gemma-7b, Llama-70b?

Those seems good. I would also recommend one of the good small ones (e.g., Mistral 7B - DPO).

  1. TrainJobStatus needs to be updated with more info.

👍

Copy link
Contributor

@franciscojavierarceo franciscojavierarceo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some small nits, just for readability.

docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved
@andreyvelich
Copy link
Member Author

andreyvelich commented Jul 26, 2024

I personally like the latter approach rather than an environment variable but just my opinion. 🤷

@franciscojavierarceo Please can you check the latest API for model and dataset configs ? I recently made some updates based on this 🧵 #2171 (comment)

What LLM Fine-Tuning Runtimes do we want to add initially. E.g. Llama-7b, Gemma-7b, Llama-70b?

I think, after initial implementation we can add blueprints for more LLMs: Mistral, BERT, Llama, etc.

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Add runLauncherAsNode parameter

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich andreyvelich force-pushed the training-v2-proposal branch from 643598f to 08fec42 Compare August 6, 2024 11:47
Signed-off-by: Andrey Velichkevich <[email protected]>
- torchrun train.py
```

Training Operator will create the `PodGroup` using the following spec:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Naive question - does this imply that any custom gang scheduler used is required to make a PodGroup to function? Or if you have a plugin that doesn't create a PodGroup (and does its own thing) the group would be made anyway? And from the example above, how is minMember of 5 derived? You would probably need custom logic per scheduler backend for this API. For example, the PodGroup created by coscheduling vs volcano.sh are different underlying objects. (https://github.com/kubernetes-sigs/scheduler-plugins/blob/d67d2c15199595ccc6218574bd8e07b68b7146f4/apis/scheduling/v1alpha1/types.go#L128 and https://github.com/volcano-sh/apis/blob/78d912ce096c9fd9120488b5c7b999d4996f2cb5/pkg/apis/scheduling/types.go#L147)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this Training API is going t take on the challenge of orchestrating the logic for creation of needed PodGroup (varying by the plugin) then this probably works OK. Otherwise, I'd still place the responsibility on the user for now for creating the pod group, but just allow the schedulerName and other labels to use it in whatever turns into the actual job or podspec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this imply that any custom gang scheduler used is required to make a PodGroup to function?

The PodGroup creation will be implemented only for supported gang-schedulers (initially Volcano and Coscheduling). Since different schedulers require the various PodGroups to be created.

And from the example above, how is minMember of 5 derived?

minMember is always equal to numNodes. For ML Training all gangs should be alive to execute training. In the future, if we find other use-cases (e.g. HPC), we can discuss it again.

If this Training API is going t take on the challenge of orchestrating the logic for creation of needed PodGroup (varying by the plugin) then this probably works OK

Yeah, that's correct.

Otherwise, I'd still place the responsibility on the user for now for creating the pod group, but just allow the schedulerName and other labels to use it in whatever turns into the actual job or podspec.

Btw, this also will be supported, since users can just ignore PodGroupSpec parameter, and set the .spec.schedulerName field in the PodSpec.

Signed-off-by: Andrey Velichkevich <[email protected]>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!
/lgtm

spec:
scheduleTimeoutSeconds: 100
minMember: 5
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great PodGroup specification unveil!

@google-oss-prow google-oss-prow bot added the lgtm label Aug 6, 2024
@andreyvelich
Copy link
Member Author

We should be ready to merge this and start the initial implementation.
Feel free to submit the follow-up PRs if we need any updates to the KEP.

Thanks to everyone for the comprehensive review of this proposal!
I am looking forward to see the adoption of these features 🎉

/hold cancel

@google-oss-prow google-oss-prow bot merged commit 53341c9 into kubeflow:master Aug 6, 2024
38 checks passed
@andreyvelich andreyvelich deleted the training-v2-proposal branch August 6, 2024 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.