Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added test for create-pytorchjob.ipynb python notebook #2274

Merged
Merged
Show file tree
Hide file tree
Changes from 65 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
c90bbaf
Added test for create-pytorchjob.ipynb
saileshd1402 Sep 29, 2024
f8fd24c
fix yaml syntax
saileshd1402 Sep 29, 2024
89023ce
Fix uses path
saileshd1402 Sep 29, 2024
62be575
Add actions/checkout
saileshd1402 Sep 29, 2024
9ea7155
Add bash to action.yaml
saileshd1402 Sep 29, 2024
da99ec8
Install pip dependencies step
saileshd1402 Sep 29, 2024
4595f32
Add quotes for args
saileshd1402 Sep 29, 2024
8b744b1
Add jupyter
saileshd1402 Sep 29, 2024
c6d1925
Add nbformat_minor: 5 to fix invalid format error
saileshd1402 Sep 29, 2024
1124ee8
Fix job name
saileshd1402 Sep 29, 2024
f882cf3
test papermill-args-yaml
saileshd1402 Sep 29, 2024
5494fb1
testing multi line args
saileshd1402 Sep 29, 2024
eb7c4be
testing multi line args1
saileshd1402 Sep 29, 2024
93b6c66
testing multi line args2
saileshd1402 Sep 29, 2024
e5aca68
testing multi line args3
saileshd1402 Sep 29, 2024
c8b1aff
Parameterize sdk install
saileshd1402 Sep 29, 2024
9145412
Remove unnecessary output
saileshd1402 Sep 29, 2024
e704b7f
nbformat normailze
saileshd1402 Sep 29, 2024
dc6a517
[SDK] Training Client Conditions related unit tests (#2253)
Bobbins228 Sep 30, 2024
c0b64e0
[SDK] test: add unit test for list_jobs method of the training_client…
seanlaii Oct 3, 2024
2e7d3c2
KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273)
varshaprasad96 Oct 10, 2024
040ba8f
[SDK] Use torchrun to create PyTorchJob from function (#2276)
andreyvelich Oct 11, 2024
f20969b
[SDK] test: add unit test for get_job_logs method of the training_cli…
seanlaii Oct 12, 2024
4ff5052
[v2alpha] Move GV related codebase (#2281)
varshaprasad96 Oct 14, 2024
24cea1b
KEP-2170: Implement runtime framework (#2248)
tenzen-y Oct 17, 2024
936620d
Add DeepSpeed Example with Pytorch Operator (#2235)
Syulin7 Oct 17, 2024
cdbc22e
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283)
andreyvelich Oct 17, 2024
5692b53
KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260)
akshaychitneni Oct 19, 2024
e6954eb
Upgrade Deepspeed demo dependencies (#2294)
Syulin7 Oct 20, 2024
009f207
KEP-2170: Add manifests for Kubeflow Training V2 (#2289)
andreyvelich Oct 21, 2024
7793706
FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286)
andreyvelich Oct 22, 2024
7f61c50
KEP-2170: Implement TrainJob Reconciler to manage objects (#2295)
tenzen-y Oct 23, 2024
13dcb6b
Remove Prometheus Monitoring doc (#2301)
sophie0730 Oct 23, 2024
b4c0d40
KEP-2170: Decouple JobSet from TrainJob (#2296)
tenzen-y Oct 23, 2024
d315aa2
KEP-2170: Strictly verify the CRD marker validation and defaulting in…
tenzen-y Oct 24, 2024
4d4d2c8
KEP-2170: Initialize runtimes before the manager starts (#2306)
tenzen-y Oct 24, 2024
82d0535
KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310)
andreyvelich Oct 27, 2024
32854c0
KEP-2170: Create model and dataset initializers (#2303)
andreyvelich Oct 27, 2024
6df87f9
KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308)
andreyvelich Oct 31, 2024
ce2febf
KEP-2170: Implement Initializer builders in the JobSet plugin (#2316)
andreyvelich Nov 1, 2024
e1505ac
KEP-2170: Add the TrainJob state transition design (#2298)
tenzen-y Nov 2, 2024
ec176e3
Update tf job examples to tf v2 (#2270)
YosiElias Nov 4, 2024
cc0ef4d
KEP-2170: Add TrainJob conditions (#2322)
tenzen-y Nov 9, 2024
3f5c458
Pin Gloo repository in JAX Dockerfile to a specific commit (#2329)
sandipanpanda Nov 18, 2024
94b8414
[fix] Resolve v2alpha API exceptions (#2317)
varshaprasad96 Nov 22, 2024
ceb4369
Upgrade Kubernetes to v1.30.7 (#2332)
astefanutti Nov 27, 2024
0c4a8d2
Ignore cache exporting errors in the image building workflows (#2336)
tenzen-y Nov 27, 2024
83da2af
KEP-2170: Add Torch Distributed Runtime (#2328)
andreyvelich Nov 28, 2024
b5a8a72
Refine the server-side apply installation args (#2337)
tenzen-y Nov 28, 2024
05baf72
Add openapi-generator CLI option to skip SDK v2 test generation (#2338)
astefanutti Nov 28, 2024
618bf6e
Upgrade kustomization files to Kustomize v5 (#2326)
oksanabaza Nov 28, 2024
1bb35da
Pin accelerate package version in trainer (#2340)
gavrissh Nov 29, 2024
745c445
Replace papermill command with bash script
saileshd1402 Dec 2, 2024
0cd3791
Typo fix
saileshd1402 Dec 2, 2024
651672d
Move Checkout step outside action.yaml file
saileshd1402 Dec 2, 2024
e607e6d
Add newline EOF in script
saileshd1402 Dec 2, 2024
0540b90
Pass python dependencies as args and pin versions
saileshd1402 Dec 2, 2024
8c7f517
Update Usage
saileshd1402 Dec 2, 2024
caeffab
Install dependencies in yaml
saileshd1402 Dec 2, 2024
b545c80
merge conflit fix
saileshd1402 Dec 2, 2024
87999f1
fix ipynb
saileshd1402 Dec 2, 2024
0ee9ca5
set bash flags
saileshd1402 Dec 2, 2024
4ea4bde
Update script args and add more kubernetes versions for tests
saileshd1402 Dec 2, 2024
72dd617
add gang-scheduler-name to template
saileshd1402 Dec 3, 2024
d3e9031
move go setup to template
saileshd1402 Dec 3, 2024
21a6129
remove -p parameter from script
saileshd1402 Dec 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 4 additions & 32 deletions .github/workflows/integration-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,40 +58,12 @@ jobs:
- name: Checkout
uses: actions/checkout@v4

- name: Free-Up Disk Space
uses: ./.github/workflows/free-up-disk-space

- name: Setup Python
uses: actions/setup-python@v5
- name: Setup E2E Tests
uses: ./.github/workflows/setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
python-version: ${{ matrix.python-version }}

- name: Setup Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod

- name: Create k8s Kind Cluster
uses: helm/kind-action@9fdad0686e6f19fcd572f62516f5e0436f562ee7
with:
node_image: kindest/node:${{ matrix.kubernetes-version }}
cluster_name: training-operator-cluster
kubectl_version: ${{ matrix.kubernetes-version }}

- name: Build training-operator
run: |
./scripts/gha/build-image.sh
env:
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test

- name: Deploy training operator
run: |
./scripts/gha/setup-training-operator.sh
env:
KIND_CLUSTER: training-operator-cluster
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
GANG_SCHEDULER_NAME: ${{ matrix.gang-scheduler-name }}
KUBERNETES_VERSION: ${{ matrix.kubernetes-version }}
gang-scheduler-name: ${{ matrix.gang-scheduler-name }}

- name: Run tests
run: |
Expand Down
57 changes: 57 additions & 0 deletions .github/workflows/setup-e2e-test/action.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
name: Setup E2E test template
description: A composite action to setup e2e tests

inputs:
kubernetes-version:
required: true
description: Kubernetes version
python-version:
required: true
description: Python version
gang-scheduler-name:
required: false
default: "none"
description: Gang scheduler name

runs:
using: composite
steps:
- name: Free-Up Disk Space
uses: ./.github/workflows/free-up-disk-space

- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: ${{ inputs.python-version }}

- name: Setup Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod

- name: Create k8s Kind Cluster
uses: helm/kind-action@9fdad0686e6f19fcd572f62516f5e0436f562ee7
with:
node_image: kindest/node:${{ inputs.kubernetes-version }}
cluster_name: training-operator-cluster
kubectl_version: ${{ inputs.kubernetes-version }}

- name: Build training-operator
shell: bash
run: |
./scripts/gha/build-image.sh
env:
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test

- name: Deploy training operator
shell: bash
run: |
./scripts/gha/setup-training-operator.sh
docker system prune -a -f
docker system df
df -h
env:
KIND_CLUSTER: training-operator-cluster
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
GANG_SCHEDULER_NAME: ${{ inputs.gang-scheduler-name }}
KUBERNETES_VERSION: ${{ inputs.kubernetes-version }}
39 changes: 39 additions & 0 deletions .github/workflows/test-example-notebooks.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: Test example notebooks

on:
- pull_request

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
create-pytorchjob-notebook-test:
runs-on: ubuntu-latest
timeout-minutes: 30
strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.28.7", "v1.29.2", "v1.30.6"]
python-version: ["3.9", "3.10", "3.11"]
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Setup E2E Tests
uses: ./.github/workflows/setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
python-version: ${{ matrix.python-version }}

- name: Install Python Dependencies
run: |
pip install papermill==2.6.0 jupyter==1.1.1 ipykernel==6.29.5

- name: Run Jupyter Notebook with Papermill
shell: bash
run: |
./scripts/run-notebook.sh \
-i ./examples/pytorch/image-classification/create-pytorchjob.ipynb \
-n default \
-k ./sdk/python
52 changes: 33 additions & 19 deletions examples/pytorch/image-classification/create-pytorchjob.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,20 @@
"The notebook shows how to use Kubeflow Training SDK to create, get, wait, check and delete PyTorchJob."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"training_python_sdk='kubeflow-training'\n",
"namespace='kubeflow-user-example-com'"
]
},
{
"cell_type": "markdown",
"metadata": {
Expand All @@ -42,12 +56,13 @@
"outputs": [],
"source": [
"# TODO (andreyvelich): Change to release version when SDK with the new APIs is published.\n",
"!pip install git+https://github.com/kubeflow/training-operator.git#subdirectory=sdk/python"
"# Install Kubeflow Python SDK\n",
"!pip install {training_python_sdk}"
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand Down Expand Up @@ -93,7 +108,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -102,12 +117,11 @@
"outputs": [],
"source": [
"name = \"pytorch-dist-mnist-gloo\"\n",
"namespace = \"kubeflow-user-example-com\"\n",
"container_name = \"pytorch\"\n",
"\n",
"container = V1Container(\n",
" name=container_name,\n",
" image=\"gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0\",\n",
" image=\"kubeflow/pytorch-dist-mnist:latest\",\n",
" args=[\"--backend\", \"gloo\"],\n",
")\n",
"\n",
Expand Down Expand Up @@ -157,7 +171,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -176,8 +190,8 @@
"# Namespace will be reused in every APIs.\n",
"training_client = TrainingClient(namespace=namespace)\n",
"\n",
"# If `job_kind` is not set in `TrainingClient`, we need to set it for each API.\n",
"training_client.create_job(pytorchjob, job_kind=constants.PYTORCHJOB_KIND)"
"# `job_kind` is set in `TrainingClient`\n",
"training_client.create_job(pytorchjob)"
]
},
{
Expand All @@ -195,7 +209,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -214,7 +228,7 @@
}
],
"source": [
"training_client.get_job(name, job_kind=constants.PYTORCHJOB_KIND).metadata.name"
"training_client.get_job(name).metadata.name"
]
},
{
Expand All @@ -230,7 +244,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand Down Expand Up @@ -260,7 +274,7 @@
}
],
"source": [
"training_client.get_job_conditions(name=name, job_kind=constants.PYTORCHJOB_KIND)"
"training_client.get_job_conditions(name=name)"
]
},
{
Expand All @@ -276,7 +290,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -302,7 +316,7 @@
}
],
"source": [
"pytorchjob = training_client.wait_for_job_conditions(name=name, job_kind=constants.PYTORCHJOB_KIND)\n",
"pytorchjob = training_client.wait_for_job_conditions(name=name)\n",
"\n",
"print(f\"Succeeded number of replicas: {pytorchjob.status.replica_statuses['Master'].succeeded}\")"
]
Expand All @@ -320,7 +334,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -339,7 +353,7 @@
}
],
"source": [
"training_client.is_job_succeeded(name=name, job_kind=constants.PYTORCHJOB_KIND)"
"training_client.is_job_succeeded(name=name)"
]
},
{
Expand All @@ -355,7 +369,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand Down Expand Up @@ -476,7 +490,7 @@
}
],
"source": [
"training_client.get_job_logs(name=name, job_kind=constants.PYTORCHJOB_KIND)"
"training_client.get_job_logs(name=name)"
]
},
{
Expand All @@ -492,7 +506,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand Down
Loading
Loading