Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added test for create-pytorchjob.ipynb python notebook (#2274)
* Added test for create-pytorchjob.ipynb Signed-off-by: sailesh duddupudi <[email protected]> * fix yaml syntax Signed-off-by: sailesh duddupudi <[email protected]> * Fix uses path Signed-off-by: sailesh duddupudi <[email protected]> * Add actions/checkout Signed-off-by: sailesh duddupudi <[email protected]> * Add bash to action.yaml Signed-off-by: sailesh duddupudi <[email protected]> * Install pip dependencies step Signed-off-by: sailesh duddupudi <[email protected]> * Add quotes for args Signed-off-by: sailesh duddupudi <[email protected]> * Add jupyter Signed-off-by: sailesh duddupudi <[email protected]> * Add nbformat_minor: 5 to fix invalid format error Signed-off-by: sailesh duddupudi <[email protected]> * Fix job name Signed-off-by: sailesh duddupudi <[email protected]> * test papermill-args-yaml Signed-off-by: sailesh duddupudi <[email protected]> * testing multi line args Signed-off-by: sailesh duddupudi <[email protected]> * testing multi line args1 Signed-off-by: sailesh duddupudi <[email protected]> * testing multi line args2 Signed-off-by: sailesh duddupudi <[email protected]> * testing multi line args3 Signed-off-by: sailesh duddupudi <[email protected]> * Parameterize sdk install Signed-off-by: sailesh duddupudi <[email protected]> * Remove unnecessary output Signed-off-by: sailesh duddupudi <[email protected]> * nbformat normailze Signed-off-by: sailesh duddupudi <[email protected]> * [SDK] Training Client Conditions related unit tests (#2253) * test: add unit test for get_job_conditions function of training client Signed-off-by: Bobbins228 <[email protected]> * test: add unit test for is_job_created function of training client Signed-off-by: Bobbins228 <[email protected]> * test: add unit test for is_job_running function of training client Signed-off-by: Bobbins228 <[email protected]> * test: add unit test for is_job_restarting function of training client Signed-off-by: Bobbins228 <[email protected]> * test: add unit test for is_job_failed function of training client Signed-off-by: Bobbins228 <[email protected]> * test: add unit test for is_job_succeded function of training client Signed-off-by: Bobbins228 <[email protected]> * test: improve job condition unit tests efficiency Signed-off-by: Bobbins228 <[email protected]> --------- Signed-off-by: Bobbins228 <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * [SDK] test: add unit test for list_jobs method of the training_client (#2267) Signed-off-by: wei-chenglai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273) Generate clientset, informers, listers and open api spec for v2alpha1 APIs. Signed-off-by: Varsha Prasad Narsing <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * [SDK] Use torchrun to create PyTorchJob from function (#2276) * [SDK] Use torchrun to create PyTorchJob from function Signed-off-by: Andrey Velichkevich <[email protected]> * Update PyTorchJob SDK example Signed-off-by: Andrey Velichkevich <[email protected]> * Add consts for entrypoint Signed-off-by: Andrey Velichkevich <[email protected]> * Add check for num procs per worker Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * [SDK] test: add unit test for get_job_logs method of the training_client (#2275) Signed-off-by: wei-chenglai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * [v2alpha] Move GV related codebase (#2281) Move GV related codebase in v2alpha Signed-off-by: Varsha Prasad Narsing <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Implement runtime framework (#2248) * KEP-2170: Implement runtime framework interfaces Signed-off-by: Yuki Iwai <[email protected]> * Remove grep dependency Signed-off-by: Yuki Iwai <[email protected]> * KEP-2170: Implement ValidateObjects interface to the runtime framework Signed-off-by: Yuki Iwai <[email protected]> * KEP-2170: Expose the TrainingRuntime and ClusterTrainingRuntime Kind Signed-off-by: Yuki Iwai <[email protected]> * KEP-2170: Remove unneeded scheme field from the internal TrainingRuntime Signed-off-by: Yuki Iwai <[email protected]> * Rephrase the error message Signed-off-by: Yuki Iwai <[email protected]> * Distinguish TrainingRuntime and ClusterTrainingRuntime when creating indexes for the TrainJobs Signed-off-by: Yuki Iwai <[email protected]> * Propagate the TrainJob labels and annotations to the JobSet Signed-off-by: Yuki Iwai <[email protected]> * Remove PodAnnotations from the runtime info Signed-off-by: Yuki Iwai <[email protected]> * Implement TrainingRuntime ReplicatedJob validation Signed-off-by: Yuki Iwai <[email protected]> * Add TODO comments Signed-off-by: Yuki Iwai <[email protected]> * Replace queueSuspendedTrainJob with queueSuspendedTrainJobs Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Add DeepSpeed Example with Pytorch Operator (#2235) Signed-off-by: Syulin7 <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283) * KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API Signed-off-by: Andrey Velichkevich <[email protected]> * Rename RuntimeRef in runtime framework Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260) Signed-off-by: Akshay Chitneni <[email protected]> Co-authored-by: Akshay Chitneni <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Upgrade Deepspeed demo dependencies (#2294) Signed-off-by: Syulin7 <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Add manifests for Kubeflow Training V2 (#2289) * KEP-2170: Add manifests for Kubeflow Training V2 Signed-off-by: Andrey Velichkevich <[email protected]> * Fix invalid name for webhook config in cert Signed-off-by: Andrey Velichkevich <[email protected]> * Fix integration tests Signed-off-by: Andrey Velichkevich <[email protected]> * Move kubebuilder markers to runtime framework Signed-off-by: Andrey Velichkevich <[email protected]> * Use Kubernetes recommended labels Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286) * FSDP Example with PyTorchJob and T5 Fine-Tuning Signed-off-by: Andrey Velichkevich <[email protected]> * Modify text Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Implement TrainJob Reconciler to manage objects (#2295) * KEP-2170: Implement TrainJob Reconciler to manage objects Signed-off-by: Yuki Iwai <[email protected]> * Mode dep-crds to manifests/external-crds Signed-off-by: Yuki Iwai <[email protected]> * Rename run with runtime Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Remove Prometheus Monitoring doc (#2301) Signed-off-by: Sophie <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Decouple JobSet from TrainJob (#2296) Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304) Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Initialize runtimes before the manager starts (#2306) Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310) * Generate SDK models for the Training V2 APIs Signed-off-by: Andrey Velichkevich <[email protected]> * Create pyproject.toml config Signed-off-by: Andrey Velichkevich <[email protected]> * Remove comments Signed-off-by: Andrey Velichkevich <[email protected]> * Fix pre-commit Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Create model and dataset initializers (#2303) * KEP-2170: Create model and dataset initializers Signed-off-by: Andrey Velichkevich <[email protected]> * Add abstract classes Signed-off-by: Andrey Velichkevich <[email protected]> * Add storage URI to config Signed-off-by: Andrey Velichkevich <[email protected]> * Update .gitignore Co-authored-by: Kevin Hannon <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> * Fix the misspelling for initializer Signed-off-by: Andrey Velichkevich <[email protected]> * Add .pt and .pth to ignore_patterns Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308) * KEP-2170: Implement JobSet and PlainML Plugins Signed-off-by: Andrey Velichkevich <[email protected]> * Fix nil pointer exception for Trainer Signed-off-by: Andrey Velichkevich <[email protected]> * Fix unit tests in runtime package Signed-off-by: Andrey Velichkevich <[email protected]> * Fix unit tests Signed-off-by: Andrey Velichkevich <[email protected]> * Fix integration tests Signed-off-by: Andrey Velichkevich <[email protected]> * Fix lint Signed-off-by: Andrey Velichkevich <[email protected]> * Implement Torch Plugin Signed-off-by: Andrey Velichkevich <[email protected]> * Use list for the Info envs Signed-off-by: Andrey Velichkevich <[email protected]> * Fix golang ci Signed-off-by: Andrey Velichkevich <[email protected]> * Fix Torch plugin Signed-off-by: Andrey Velichkevich <[email protected]> * Use K8s sets Update error return Use ptr.Deref() for nil values Signed-off-by: Andrey Velichkevich <[email protected]> * Use client.Object for Build() call Signed-off-by: Andrey Velichkevich <[email protected]> * Remove DeepCopy Signed-off-by: Andrey Velichkevich <[email protected]> * Remove MLPolicy and PodGroupPolicy from the Info object Signed-off-by: Andrey Velichkevich <[email protected]> * Inline error Signed-off-by: Andrey Velichkevich <[email protected]> * Remove SDK jar file Signed-off-by: Andrey Velichkevich <[email protected]> * Add integration test for Torch plugin Signed-off-by: Andrey Velichkevich <[email protected]> * Add TODO to calculate PodGroup values in unit tests Signed-off-by: Andrey Velichkevich <[email protected]> * Revert the change to add original Runtime Policies to Info Signed-off-by: Andrey Velichkevich <[email protected]> * Create const for the DefaultJobReplicas Signed-off-by: Andrey Velichkevich <[email protected]> * Check if PodLabels is empty Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Implement Initializer builders in the JobSet plugin (#2316) * KEP-2170: Implement Initializer builder in the JobSet plugin Signed-off-by: Andrey Velichkevich <[email protected]> * Update the SDK models Signed-off-by: Andrey Velichkevich <[email protected]> * Remove Info from Initializer builder Signed-off-by: Andrey Velichkevich <[email protected]> * Update manifests Signed-off-by: Andrey Velichkevich <[email protected]> * Update pkg/constants/constants.go Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> * Use var for envs Signed-off-by: Andrey Velichkevich <[email protected]> * Remove check manifests from GitHub actions Signed-off-by: Andrey Velichkevich <[email protected]> * Move consts to JobSet plugin Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Add the TrainJob state transition design (#2298) * KEP-2170: Add the TrainJob state transition design Signed-off-by: Yuki Iwai <[email protected]> * Replace actual jobs with TrainJob Signed-off-by: Yuki Iwai <[email protected]> * Remove the JobSet conditions propagation and Add expanding runtime framework interfaces for each plugin Signed-off-by: Yuki Iwai <[email protected]> * Expand the Creation Failed reasons Signed-off-by: Yuki Iwai <[email protected]> * Rename Completed to Complete Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Update tf job examples to tf v2 (#2270) * mnist with summaries updaetd to TF v2 Signed-off-by: yelias <[email protected]> * tf_sample updaetd to TF v2 Signed-off-by: yelias <[email protected]> * Add mnist_utils and update dist-mnist Signed-off-by: yelias <[email protected]> * Add mnist_utils and update dist-mnist Signed-off-by: yelias <[email protected]> * Remove old example - estimator-API, this example has been replaced by distribution_strategy Signed-off-by: yelias <[email protected]> * Small fix Signed-off-by: yelias <[email protected]> * Remove unsupported powerPC dockerfiles Signed-off-by: yelias <[email protected]> * Fix typo in copyright Signed-off-by: yelias <[email protected]> --------- Signed-off-by: yelias <[email protected]> Co-authored-by: yelias <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Add TrainJob conditions (#2322) * KEP-2170: Implement TrainJob conditions Signed-off-by: Yuki Iwai <[email protected]> * Fix API comments Signed-off-by: Yuki Iwai <[email protected]> * Make condition message constants Signed-off-by: Yuki Iwai <[email protected]> * Stop connecting condition type and reason in JobSet plugin Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Pin Gloo repository in JAX Dockerfile to a specific commit (#2329) This commit pins the Gloo repository to a specific commit (43b7acbf) in the JAX Dockerfile to prevent build failures caused by a recent bug introduced in the Gloo codebase. By locking the version of Gloo to a known working commit, we ensure that the JAX build remains stable and functional until the issue is resolved upstream. The build failure occurs when compiling the gloo/transport/tcp/buffer.cc file due to an undefined __NR_gettid constant, which was introduced after the pinned commit. By using this commit, we bypass the issue and allow the build to complete successfully. Signed-off-by: Sandipan Panda <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * [fix] Resolve v2alpha API exceptions (#2317) Resolve v2alpha API exceptions by adding necessary listType validations. Signed-off-by: Varsha Prasad Narsing <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Upgrade Kubernetes to v1.30.7 (#2332) * Upgrade Kubernetes to v1.30.7 Signed-off-by: Antonin Stefanutti <[email protected]> * Use typed event handlers and predicates in job controllers Signed-off-by: Antonin Stefanutti <[email protected]> * Re-organize pkg/common/util/reconciler.go Signed-off-by: Antonin Stefanutti <[email protected]> * Update installation instructions in README Signed-off-by: Antonin Stefanutti <[email protected]> --------- Signed-off-by: Antonin Stefanutti <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Ignore cache exporting errors in the image building workflows (#2336) Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * KEP-2170: Add Torch Distributed Runtime (#2328) * KEP-2170: Add Torch Distributed Runtime Signed-off-by: Andrey Velichkevich <[email protected]> * Add pip list Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Refine the server-side apply installation args (#2337) Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Add openapi-generator CLI option to skip SDK v2 test generation (#2338) Signed-off-by: Antonin Stefanutti <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Upgrade kustomization files to Kustomize v5 (#2326) Signed-off-by: oksanabaza <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Pin accelerate package version in trainer (#2340) * Pin accelerate package version in trainer Signed-off-by: Gavrish Prabhu <[email protected]> * include new line to pass pre-commit hook Signed-off-by: Gavrish Prabhu <[email protected]> --------- Signed-off-by: Gavrish Prabhu <[email protected]> Signed-off-by: sailesh duddupudi <[email protected]> * Replace papermill command with bash script Signed-off-by: sailesh duddupudi <[email protected]> * Typo fix Signed-off-by: sailesh duddupudi <[email protected]> * Move Checkout step outside action.yaml file Signed-off-by: sailesh duddupudi <[email protected]> * Add newline EOF in script Signed-off-by: sailesh duddupudi <[email protected]> * Pass python dependencies as args and pin versions Signed-off-by: sailesh duddupudi <[email protected]> * Update Usage Signed-off-by: sailesh duddupudi <[email protected]> * Install dependencies in yaml Signed-off-by: sailesh duddupudi <[email protected]> * fix ipynb Signed-off-by: sailesh duddupudi <[email protected]> * set bash flags Signed-off-by: sailesh duddupudi <[email protected]> * Update script args and add more kubernetes versions for tests Signed-off-by: sailesh duddupudi <[email protected]> * add gang-scheduler-name to template Signed-off-by: sailesh duddupudi <[email protected]> * move go setup to template Signed-off-by: sailesh duddupudi <[email protected]> * remove -p parameter from script Signed-off-by: sailesh duddupudi <[email protected]> --------- Signed-off-by: sailesh duddupudi <[email protected]> Signed-off-by: Bobbins228 <[email protected]> Signed-off-by: wei-chenglai <[email protected]> Signed-off-by: Varsha Prasad Narsing <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> Signed-off-by: Yuki Iwai <[email protected]> Signed-off-by: Syulin7 <[email protected]> Signed-off-by: Akshay Chitneni <[email protected]> Signed-off-by: Sophie <[email protected]> Signed-off-by: yelias <[email protected]> Signed-off-by: Sandipan Panda <[email protected]> Signed-off-by: Antonin Stefanutti <[email protected]> Signed-off-by: oksanabaza <[email protected]> Signed-off-by: Gavrish Prabhu <[email protected]> Co-authored-by: Mark Campbell <[email protected]> Co-authored-by: Wei-Cheng Lai <[email protected]> Co-authored-by: Varsha <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]> Co-authored-by: Yuki Iwai <[email protected]> Co-authored-by: yu lin <[email protected]> Co-authored-by: Akshay Chitneni <[email protected]> Co-authored-by: Akshay Chitneni <[email protected]> Co-authored-by: Sophie Hsu <[email protected]> Co-authored-by: Kevin Hannon <[email protected]> Co-authored-by: YosiElias <[email protected]> Co-authored-by: yelias <[email protected]> Co-authored-by: Sandipan Panda <[email protected]> Co-authored-by: Antonin Stefanutti <[email protected]> Co-authored-by: Oksana Bazylieva <[email protected]> Co-authored-by: Gavrish Prabhu <[email protected]>
- Loading branch information