Skip to content

Commit

Permalink
Added test for create-pytorchjob.ipynb python notebook (#2274)
Browse files Browse the repository at this point in the history
* Added test for create-pytorchjob.ipynb

Signed-off-by: sailesh duddupudi <[email protected]>

* fix yaml syntax

Signed-off-by: sailesh duddupudi <[email protected]>

* Fix uses path

Signed-off-by: sailesh duddupudi <[email protected]>

* Add actions/checkout

Signed-off-by: sailesh duddupudi <[email protected]>

* Add bash to action.yaml

Signed-off-by: sailesh duddupudi <[email protected]>

* Install pip dependencies step

Signed-off-by: sailesh duddupudi <[email protected]>

* Add quotes for args

Signed-off-by: sailesh duddupudi <[email protected]>

* Add jupyter

Signed-off-by: sailesh duddupudi <[email protected]>

* Add nbformat_minor: 5 to fix invalid format error

Signed-off-by: sailesh duddupudi <[email protected]>

* Fix job name

Signed-off-by: sailesh duddupudi <[email protected]>

* test papermill-args-yaml

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args1

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args2

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args3

Signed-off-by: sailesh duddupudi <[email protected]>

* Parameterize sdk install

Signed-off-by: sailesh duddupudi <[email protected]>

* Remove unnecessary output

Signed-off-by: sailesh duddupudi <[email protected]>

* nbformat normailze

Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] Training Client Conditions related unit tests (#2253)

* test: add unit test for get_job_conditions function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_created function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_running function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_restarting function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_failed function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_succeded function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: improve job condition unit tests efficiency

Signed-off-by: Bobbins228 <[email protected]>

---------

Signed-off-by: Bobbins228 <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] test: add unit test for list_jobs method of the training_client (#2267)

Signed-off-by: wei-chenglai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273)

Generate clientset, informers, listers and open api spec
for v2alpha1 APIs.

Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] Use torchrun to create PyTorchJob from function (#2276)

* [SDK] Use torchrun to create PyTorchJob from function

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update PyTorchJob SDK example

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add consts for entrypoint

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add check for num procs per worker

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] test: add unit test for get_job_logs method of the training_client (#2275)

Signed-off-by: wei-chenglai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [v2alpha] Move GV related codebase (#2281)

Move GV related codebase in v2alpha

Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement runtime framework (#2248)

* KEP-2170: Implement runtime framework interfaces

Signed-off-by: Yuki Iwai <[email protected]>

* Remove grep dependency

Signed-off-by: Yuki Iwai <[email protected]>

* KEP-2170: Implement ValidateObjects interface to the runtime framework

Signed-off-by: Yuki Iwai <[email protected]>

* KEP-2170: Expose the TrainingRuntime and ClusterTrainingRuntime Kind

Signed-off-by: Yuki Iwai <[email protected]>

* KEP-2170: Remove unneeded scheme field from the internal TrainingRuntime

Signed-off-by: Yuki Iwai <[email protected]>

* Rephrase the error message

Signed-off-by: Yuki Iwai <[email protected]>

* Distinguish TrainingRuntime and ClusterTrainingRuntime when creating indexes for the TrainJobs

Signed-off-by: Yuki Iwai <[email protected]>

* Propagate the TrainJob labels and annotations to the JobSet

Signed-off-by: Yuki Iwai <[email protected]>

* Remove PodAnnotations from the runtime info

Signed-off-by: Yuki Iwai <[email protected]>

* Implement TrainingRuntime ReplicatedJob validation

Signed-off-by: Yuki Iwai <[email protected]>

* Add TODO comments

Signed-off-by: Yuki Iwai <[email protected]>

* Replace queueSuspendedTrainJob with queueSuspendedTrainJobs

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Add DeepSpeed Example with Pytorch Operator (#2235)

Signed-off-by: Syulin7 <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283)

* KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API

Signed-off-by: Andrey Velichkevich <[email protected]>

* Rename RuntimeRef in runtime framework

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260)

Signed-off-by: Akshay Chitneni <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Upgrade Deepspeed demo dependencies (#2294)

Signed-off-by: Syulin7 <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add manifests for Kubeflow Training V2 (#2289)

* KEP-2170: Add manifests for Kubeflow Training V2

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix invalid name for webhook config in cert

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix integration tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Move kubebuilder markers to runtime framework

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use Kubernetes recommended labels

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286)

* FSDP Example with PyTorchJob and T5 Fine-Tuning

Signed-off-by: Andrey Velichkevich <[email protected]>

* Modify text

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement TrainJob Reconciler to manage objects (#2295)

* KEP-2170: Implement TrainJob Reconciler to manage objects

Signed-off-by: Yuki Iwai <[email protected]>

* Mode dep-crds to manifests/external-crds

Signed-off-by: Yuki Iwai <[email protected]>

* Rename run with runtime

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Remove Prometheus Monitoring doc (#2301)

Signed-off-by: Sophie <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Decouple JobSet from TrainJob (#2296)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Initialize runtimes before the manager starts (#2306)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310)

* Generate SDK models for the Training V2 APIs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Create pyproject.toml config

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove comments

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix pre-commit

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Create model and dataset initializers (#2303)

* KEP-2170: Create model and dataset initializers

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add abstract classes

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add storage URI to config

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update .gitignore

Co-authored-by: Kevin Hannon <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix the misspelling for initializer

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add .pt and .pth to ignore_patterns

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Kevin Hannon <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308)

* KEP-2170: Implement JobSet and PlainML Plugins

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix nil pointer exception for Trainer

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix unit tests in runtime package

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix unit tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix integration tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix lint

Signed-off-by: Andrey Velichkevich <[email protected]>

* Implement Torch Plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use list for the Info envs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix golang ci

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix Torch plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use K8s sets
Update error return
Use ptr.Deref() for nil values

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use client.Object for Build() call

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove DeepCopy

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove MLPolicy and PodGroupPolicy from the Info object

Signed-off-by: Andrey Velichkevich <[email protected]>

* Inline error

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove SDK jar file

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add integration test for Torch plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add TODO to calculate PodGroup values in unit tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Revert the change to add original Runtime Policies to Info

Signed-off-by: Andrey Velichkevich <[email protected]>

* Create const for the DefaultJobReplicas

Signed-off-by: Andrey Velichkevich <[email protected]>

* Check if PodLabels is empty

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement Initializer builders in the JobSet plugin  (#2316)

* KEP-2170: Implement Initializer builder in the JobSet plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update the SDK models

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove Info from Initializer builder

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update manifests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update pkg/constants/constants.go

Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>

* Use var for envs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove check manifests from GitHub actions

Signed-off-by: Andrey Velichkevich <[email protected]>

* Move consts to JobSet plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add the TrainJob state transition design (#2298)

* KEP-2170: Add the TrainJob state transition design

Signed-off-by: Yuki Iwai <[email protected]>

* Replace actual jobs with TrainJob

Signed-off-by: Yuki Iwai <[email protected]>

* Remove the JobSet conditions propagation and Add expanding runtime framework interfaces for each plugin

Signed-off-by: Yuki Iwai <[email protected]>

* Expand the Creation Failed reasons

Signed-off-by: Yuki Iwai <[email protected]>

* Rename Completed to Complete

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Update tf job examples to tf v2 (#2270)

* mnist with summaries updaetd to TF v2

Signed-off-by: yelias <[email protected]>

* tf_sample updaetd to TF v2

Signed-off-by: yelias <[email protected]>

* Add mnist_utils and update dist-mnist

Signed-off-by: yelias <[email protected]>

* Add mnist_utils and update dist-mnist

Signed-off-by: yelias <[email protected]>

* Remove old example - estimator-API, this example has been replaced by distribution_strategy

Signed-off-by: yelias <[email protected]>

* Small fix

Signed-off-by: yelias <[email protected]>

* Remove unsupported powerPC dockerfiles

Signed-off-by: yelias <[email protected]>

* Fix typo in copyright

Signed-off-by: yelias <[email protected]>

---------

Signed-off-by: yelias <[email protected]>
Co-authored-by: yelias <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add TrainJob conditions (#2322)

* KEP-2170: Implement TrainJob conditions

Signed-off-by: Yuki Iwai <[email protected]>

* Fix API comments

Signed-off-by: Yuki Iwai <[email protected]>

* Make condition message constants

Signed-off-by: Yuki Iwai <[email protected]>

* Stop connecting condition type and reason in JobSet plugin

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Pin Gloo repository in JAX Dockerfile to a specific commit (#2329)

This commit pins the Gloo repository to a specific commit (43b7acbf) in
the JAX Dockerfile to prevent build failures caused by a recent bug
introduced in the Gloo codebase. By locking the version of Gloo to
a known working commit, we ensure that the JAX build remains stable and
functional until the issue is resolved upstream.

The build failure occurs when compiling the gloo/transport/tcp/buffer.cc
file due to an undefined __NR_gettid constant, which was introduced
after the pinned commit. By using this commit, we bypass the issue and
allow the build to complete successfully.

Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [fix] Resolve v2alpha API exceptions (#2317)

Resolve v2alpha API exceptions by adding necessary listType validations.

Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Upgrade Kubernetes to v1.30.7 (#2332)

* Upgrade Kubernetes to v1.30.7

Signed-off-by: Antonin Stefanutti <[email protected]>

* Use typed event handlers and predicates in job controllers

Signed-off-by: Antonin Stefanutti <[email protected]>

* Re-organize pkg/common/util/reconciler.go

Signed-off-by: Antonin Stefanutti <[email protected]>

* Update installation instructions in README

Signed-off-by: Antonin Stefanutti <[email protected]>

---------

Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Ignore cache exporting errors in the image building workflows (#2336)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add Torch Distributed Runtime (#2328)

* KEP-2170: Add Torch Distributed Runtime

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add pip list

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Refine the server-side apply installation args (#2337)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Add openapi-generator CLI option to skip SDK v2 test generation (#2338)

Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Upgrade kustomization files to Kustomize v5 (#2326)

Signed-off-by: oksanabaza <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Pin accelerate package version in trainer (#2340)

* Pin accelerate package version in trainer

Signed-off-by: Gavrish Prabhu <[email protected]>

* include new line to pass pre-commit hook

Signed-off-by: Gavrish Prabhu <[email protected]>

---------

Signed-off-by: Gavrish Prabhu <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Replace papermill command with bash script

Signed-off-by: sailesh duddupudi <[email protected]>

* Typo fix

Signed-off-by: sailesh duddupudi <[email protected]>

* Move Checkout step outside action.yaml file

Signed-off-by: sailesh duddupudi <[email protected]>

* Add newline EOF in script

Signed-off-by: sailesh duddupudi <[email protected]>

* Pass python dependencies as args and pin versions

Signed-off-by: sailesh duddupudi <[email protected]>

* Update Usage

Signed-off-by: sailesh duddupudi <[email protected]>

* Install dependencies in yaml

Signed-off-by: sailesh duddupudi <[email protected]>

* fix ipynb

Signed-off-by: sailesh duddupudi <[email protected]>

* set bash flags

Signed-off-by: sailesh duddupudi <[email protected]>

* Update script args and add more kubernetes versions for tests

Signed-off-by: sailesh duddupudi <[email protected]>

* add gang-scheduler-name to  template

Signed-off-by: sailesh duddupudi <[email protected]>

* move go setup to template

Signed-off-by: sailesh duddupudi <[email protected]>

* remove -p parameter from script

Signed-off-by: sailesh duddupudi <[email protected]>

---------

Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: Bobbins228 <[email protected]>
Signed-off-by: wei-chenglai <[email protected]>
Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Syulin7 <[email protected]>
Signed-off-by: Akshay Chitneni <[email protected]>
Signed-off-by: Sophie <[email protected]>
Signed-off-by: yelias <[email protected]>
Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: oksanabaza <[email protected]>
Signed-off-by: Gavrish Prabhu <[email protected]>
Co-authored-by: Mark Campbell <[email protected]>
Co-authored-by: Wei-Cheng Lai <[email protected]>
Co-authored-by: Varsha <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
Co-authored-by: yu lin <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
Co-authored-by: Sophie Hsu <[email protected]>
Co-authored-by: Kevin Hannon <[email protected]>
Co-authored-by: YosiElias <[email protected]>
Co-authored-by: yelias <[email protected]>
Co-authored-by: Sandipan Panda <[email protected]>
Co-authored-by: Antonin Stefanutti <[email protected]>
Co-authored-by: Oksana Bazylieva <[email protected]>
Co-authored-by: Gavrish Prabhu <[email protected]>
  • Loading branch information
17 people authored Dec 9, 2024
1 parent 2392c36 commit 56cbe60
Show file tree
Hide file tree
Showing 5 changed files with 204 additions and 51 deletions.
36 changes: 4 additions & 32 deletions .github/workflows/integration-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,40 +58,12 @@ jobs:
- name: Checkout
uses: actions/checkout@v4

- name: Free-Up Disk Space
uses: ./.github/workflows/free-up-disk-space

- name: Setup Python
uses: actions/setup-python@v5
- name: Setup E2E Tests
uses: ./.github/workflows/setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
python-version: ${{ matrix.python-version }}

- name: Setup Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod

- name: Create k8s Kind Cluster
uses: helm/kind-action@9fdad0686e6f19fcd572f62516f5e0436f562ee7
with:
node_image: kindest/node:${{ matrix.kubernetes-version }}
cluster_name: training-operator-cluster
kubectl_version: ${{ matrix.kubernetes-version }}

- name: Build training-operator
run: |
./scripts/gha/build-image.sh
env:
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test

- name: Deploy training operator
run: |
./scripts/gha/setup-training-operator.sh
env:
KIND_CLUSTER: training-operator-cluster
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
GANG_SCHEDULER_NAME: ${{ matrix.gang-scheduler-name }}
KUBERNETES_VERSION: ${{ matrix.kubernetes-version }}
gang-scheduler-name: ${{ matrix.gang-scheduler-name }}

- name: Run tests
run: |
Expand Down
57 changes: 57 additions & 0 deletions .github/workflows/setup-e2e-test/action.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
name: Setup E2E test template
description: A composite action to setup e2e tests

inputs:
kubernetes-version:
required: true
description: Kubernetes version
python-version:
required: true
description: Python version
gang-scheduler-name:
required: false
default: "none"
description: Gang scheduler name

runs:
using: composite
steps:
- name: Free-Up Disk Space
uses: ./.github/workflows/free-up-disk-space

- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: ${{ inputs.python-version }}

- name: Setup Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod

- name: Create k8s Kind Cluster
uses: helm/kind-action@9fdad0686e6f19fcd572f62516f5e0436f562ee7
with:
node_image: kindest/node:${{ inputs.kubernetes-version }}
cluster_name: training-operator-cluster
kubectl_version: ${{ inputs.kubernetes-version }}

- name: Build training-operator
shell: bash
run: |
./scripts/gha/build-image.sh
env:
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test

- name: Deploy training operator
shell: bash
run: |
./scripts/gha/setup-training-operator.sh
docker system prune -a -f
docker system df
df -h
env:
KIND_CLUSTER: training-operator-cluster
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
GANG_SCHEDULER_NAME: ${{ inputs.gang-scheduler-name }}
KUBERNETES_VERSION: ${{ inputs.kubernetes-version }}
39 changes: 39 additions & 0 deletions .github/workflows/test-example-notebooks.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: Test example notebooks

on:
- pull_request

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
create-pytorchjob-notebook-test:
runs-on: ubuntu-latest
timeout-minutes: 30
strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.28.7", "v1.29.2", "v1.30.6"]
python-version: ["3.9", "3.10", "3.11"]
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Setup E2E Tests
uses: ./.github/workflows/setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
python-version: ${{ matrix.python-version }}

- name: Install Python Dependencies
run: |
pip install papermill==2.6.0 jupyter==1.1.1 ipykernel==6.29.5
- name: Run Jupyter Notebook with Papermill
shell: bash
run: |
./scripts/run-notebook.sh \
-i ./examples/pytorch/image-classification/create-pytorchjob.ipynb \
-n default \
-k ./sdk/python
52 changes: 33 additions & 19 deletions examples/pytorch/image-classification/create-pytorchjob.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,20 @@
"The notebook shows how to use Kubeflow Training SDK to create, get, wait, check and delete PyTorchJob."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"training_python_sdk='kubeflow-training'\n",
"namespace='kubeflow-user-example-com'"
]
},
{
"cell_type": "markdown",
"metadata": {
Expand All @@ -42,12 +56,13 @@
"outputs": [],
"source": [
"# TODO (andreyvelich): Change to release version when SDK with the new APIs is published.\n",
"!pip install git+https://github.com/kubeflow/training-operator.git#subdirectory=sdk/python"
"# Install Kubeflow Python SDK\n",
"!pip install {training_python_sdk}"
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand Down Expand Up @@ -93,7 +108,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -102,12 +117,11 @@
"outputs": [],
"source": [
"name = \"pytorch-dist-mnist-gloo\"\n",
"namespace = \"kubeflow-user-example-com\"\n",
"container_name = \"pytorch\"\n",
"\n",
"container = V1Container(\n",
" name=container_name,\n",
" image=\"gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0\",\n",
" image=\"kubeflow/pytorch-dist-mnist:latest\",\n",
" args=[\"--backend\", \"gloo\"],\n",
")\n",
"\n",
Expand Down Expand Up @@ -157,7 +171,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -176,8 +190,8 @@
"# Namespace will be reused in every APIs.\n",
"training_client = TrainingClient(namespace=namespace)\n",
"\n",
"# If `job_kind` is not set in `TrainingClient`, we need to set it for each API.\n",
"training_client.create_job(pytorchjob, job_kind=constants.PYTORCHJOB_KIND)"
"# `job_kind` is set in `TrainingClient`\n",
"training_client.create_job(pytorchjob)"
]
},
{
Expand All @@ -195,7 +209,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -214,7 +228,7 @@
}
],
"source": [
"training_client.get_job(name, job_kind=constants.PYTORCHJOB_KIND).metadata.name"
"training_client.get_job(name).metadata.name"
]
},
{
Expand All @@ -230,7 +244,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand Down Expand Up @@ -260,7 +274,7 @@
}
],
"source": [
"training_client.get_job_conditions(name=name, job_kind=constants.PYTORCHJOB_KIND)"
"training_client.get_job_conditions(name=name)"
]
},
{
Expand All @@ -276,7 +290,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -302,7 +316,7 @@
}
],
"source": [
"pytorchjob = training_client.wait_for_job_conditions(name=name, job_kind=constants.PYTORCHJOB_KIND)\n",
"pytorchjob = training_client.wait_for_job_conditions(name=name)\n",
"\n",
"print(f\"Succeeded number of replicas: {pytorchjob.status.replica_statuses['Master'].succeeded}\")"
]
Expand All @@ -320,7 +334,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -339,7 +353,7 @@
}
],
"source": [
"training_client.is_job_succeeded(name=name, job_kind=constants.PYTORCHJOB_KIND)"
"training_client.is_job_succeeded(name=name)"
]
},
{
Expand All @@ -355,7 +369,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand Down Expand Up @@ -476,7 +490,7 @@
}
],
"source": [
"training_client.get_job_logs(name=name, job_kind=constants.PYTORCHJOB_KIND)"
"training_client.get_job_logs(name=name)"
]
},
{
Expand All @@ -492,7 +506,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand Down
71 changes: 71 additions & 0 deletions scripts/run-notebook.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
#!/bin/bash

# Copyright 2024 The Kubeflow Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This bash script is used to run the example notebooks

set -o errexit
set -o nounset
set -o pipefail

NOTEBOOK_INPUT=""
NOTEBOOK_OUTPUT="-" # outputs to console
NAMESPACE="default"
TRAINING_PYTHON_SDK="./sdk/python"

usage() {
echo "Usage: $0 -i <input_notebook> -o <output_notebook> [-p \"<param> <value>\"...] [-y <params.yaml>]"
echo "Options:"
echo " -i Input notebook (required)"
echo " -o Output notebook (required)"
echo " -k Kubeflow Training Operator Python SDK (optional)"
echo " -n Kubernetes namespace used by tests (optional)"
echo " -h Show this help message"
echo "NOTE: papermill, jupyter and ipykernel are required Python dependencies to run Notebooks"
exit 1
}

while getopts "i:o:p:k:n:r:d:h:" opt; do
case "$opt" in
i) NOTEBOOK_INPUT="$OPTARG" ;; # -i for notebook input path
o) NOTEBOOK_OUTPUT="$OPTARG" ;; # -o for notebook output path
k) TRAINING_PYTHON_SDK="$OPTARG" ;; # -k for training operator python sdk
n) NAMESPACE="$OPTARG" ;; # -n for kubernetes namespace used by tests
h) usage ;; # -h for help (usage)
*) usage; exit 1 ;;
esac
done

if [ -z "$NOTEBOOK_INPUT" ]; then
echo "Error: -i notebook input path is required."
exit 1
fi

papermill_cmd="papermill $NOTEBOOK_INPUT $NOTEBOOK_OUTPUT -p training_python_sdk $TRAINING_PYTHON_SDK -p namespace $NAMESPACE"

if ! command -v papermill &> /dev/null; then
echo "Error: papermill is not installed. Please install papermill to proceed."
exit 1
fi

echo "Running command: $papermill_cmd"
$papermill_cmd

if [ $? -ne 0 ]; then
echo "Error: papermill execution failed." >&2
exit 1
fi

echo "Notebook execution completed successfully"

0 comments on commit 56cbe60

Please sign in to comment.