Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added test for create-pytorchjob.ipynb python notebook #2274

Merged
Merged
Show file tree
Hide file tree
Changes from 63 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
c90bbaf
Added test for create-pytorchjob.ipynb
saileshd1402 Sep 29, 2024
f8fd24c
fix yaml syntax
saileshd1402 Sep 29, 2024
89023ce
Fix uses path
saileshd1402 Sep 29, 2024
62be575
Add actions/checkout
saileshd1402 Sep 29, 2024
9ea7155
Add bash to action.yaml
saileshd1402 Sep 29, 2024
da99ec8
Install pip dependencies step
saileshd1402 Sep 29, 2024
4595f32
Add quotes for args
saileshd1402 Sep 29, 2024
8b744b1
Add jupyter
saileshd1402 Sep 29, 2024
c6d1925
Add nbformat_minor: 5 to fix invalid format error
saileshd1402 Sep 29, 2024
1124ee8
Fix job name
saileshd1402 Sep 29, 2024
f882cf3
test papermill-args-yaml
saileshd1402 Sep 29, 2024
5494fb1
testing multi line args
saileshd1402 Sep 29, 2024
eb7c4be
testing multi line args1
saileshd1402 Sep 29, 2024
93b6c66
testing multi line args2
saileshd1402 Sep 29, 2024
e5aca68
testing multi line args3
saileshd1402 Sep 29, 2024
c8b1aff
Parameterize sdk install
saileshd1402 Sep 29, 2024
9145412
Remove unnecessary output
saileshd1402 Sep 29, 2024
e704b7f
nbformat normailze
saileshd1402 Sep 29, 2024
dc6a517
[SDK] Training Client Conditions related unit tests (#2253)
Bobbins228 Sep 30, 2024
c0b64e0
[SDK] test: add unit test for list_jobs method of the training_client…
seanlaii Oct 3, 2024
2e7d3c2
KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273)
varshaprasad96 Oct 10, 2024
040ba8f
[SDK] Use torchrun to create PyTorchJob from function (#2276)
andreyvelich Oct 11, 2024
f20969b
[SDK] test: add unit test for get_job_logs method of the training_cli…
seanlaii Oct 12, 2024
4ff5052
[v2alpha] Move GV related codebase (#2281)
varshaprasad96 Oct 14, 2024
24cea1b
KEP-2170: Implement runtime framework (#2248)
tenzen-y Oct 17, 2024
936620d
Add DeepSpeed Example with Pytorch Operator (#2235)
Syulin7 Oct 17, 2024
cdbc22e
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283)
andreyvelich Oct 17, 2024
5692b53
KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260)
akshaychitneni Oct 19, 2024
e6954eb
Upgrade Deepspeed demo dependencies (#2294)
Syulin7 Oct 20, 2024
009f207
KEP-2170: Add manifests for Kubeflow Training V2 (#2289)
andreyvelich Oct 21, 2024
7793706
FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286)
andreyvelich Oct 22, 2024
7f61c50
KEP-2170: Implement TrainJob Reconciler to manage objects (#2295)
tenzen-y Oct 23, 2024
13dcb6b
Remove Prometheus Monitoring doc (#2301)
sophie0730 Oct 23, 2024
b4c0d40
KEP-2170: Decouple JobSet from TrainJob (#2296)
tenzen-y Oct 23, 2024
d315aa2
KEP-2170: Strictly verify the CRD marker validation and defaulting in…
tenzen-y Oct 24, 2024
4d4d2c8
KEP-2170: Initialize runtimes before the manager starts (#2306)
tenzen-y Oct 24, 2024
82d0535
KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310)
andreyvelich Oct 27, 2024
32854c0
KEP-2170: Create model and dataset initializers (#2303)
andreyvelich Oct 27, 2024
6df87f9
KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308)
andreyvelich Oct 31, 2024
ce2febf
KEP-2170: Implement Initializer builders in the JobSet plugin (#2316)
andreyvelich Nov 1, 2024
e1505ac
KEP-2170: Add the TrainJob state transition design (#2298)
tenzen-y Nov 2, 2024
ec176e3
Update tf job examples to tf v2 (#2270)
YosiElias Nov 4, 2024
cc0ef4d
KEP-2170: Add TrainJob conditions (#2322)
tenzen-y Nov 9, 2024
3f5c458
Pin Gloo repository in JAX Dockerfile to a specific commit (#2329)
sandipanpanda Nov 18, 2024
94b8414
[fix] Resolve v2alpha API exceptions (#2317)
varshaprasad96 Nov 22, 2024
ceb4369
Upgrade Kubernetes to v1.30.7 (#2332)
astefanutti Nov 27, 2024
0c4a8d2
Ignore cache exporting errors in the image building workflows (#2336)
tenzen-y Nov 27, 2024
83da2af
KEP-2170: Add Torch Distributed Runtime (#2328)
andreyvelich Nov 28, 2024
b5a8a72
Refine the server-side apply installation args (#2337)
tenzen-y Nov 28, 2024
05baf72
Add openapi-generator CLI option to skip SDK v2 test generation (#2338)
astefanutti Nov 28, 2024
618bf6e
Upgrade kustomization files to Kustomize v5 (#2326)
oksanabaza Nov 28, 2024
1bb35da
Pin accelerate package version in trainer (#2340)
gavrissh Nov 29, 2024
745c445
Replace papermill command with bash script
saileshd1402 Dec 2, 2024
0cd3791
Typo fix
saileshd1402 Dec 2, 2024
651672d
Move Checkout step outside action.yaml file
saileshd1402 Dec 2, 2024
e607e6d
Add newline EOF in script
saileshd1402 Dec 2, 2024
0540b90
Pass python dependencies as args and pin versions
saileshd1402 Dec 2, 2024
8c7f517
Update Usage
saileshd1402 Dec 2, 2024
caeffab
Install dependencies in yaml
saileshd1402 Dec 2, 2024
b545c80
merge conflit fix
saileshd1402 Dec 2, 2024
87999f1
fix ipynb
saileshd1402 Dec 2, 2024
0ee9ca5
set bash flags
saileshd1402 Dec 2, 2024
4ea4bde
Update script args and add more kubernetes versions for tests
saileshd1402 Dec 2, 2024
72dd617
add gang-scheduler-name to template
saileshd1402 Dec 3, 2024
d3e9031
move go setup to template
saileshd1402 Dec 3, 2024
21a6129
remove -p parameter from script
saileshd1402 Dec 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 3 additions & 27 deletions .github/workflows/integration-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,41 +58,17 @@ jobs:
- name: Checkout
uses: actions/checkout@v4

- name: Free-Up Disk Space
uses: ./.github/workflows/free-up-disk-space

- name: Setup Python
uses: actions/setup-python@v5
- name: Setup E2E Tests
uses: ./.github/workflows/setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
python-version: ${{ matrix.python-version }}

- name: Setup Go
uses: actions/setup-go@v5
with:
go-version-file: go.mod

- name: Create k8s Kind Cluster
uses: helm/kind-action@9fdad0686e6f19fcd572f62516f5e0436f562ee7
with:
node_image: kindest/node:${{ matrix.kubernetes-version }}
cluster_name: training-operator-cluster
kubectl_version: ${{ matrix.kubernetes-version }}

- name: Build training-operator
run: |
./scripts/gha/build-image.sh
env:
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test

- name: Deploy training operator
run: |
./scripts/gha/setup-training-operator.sh
env:
KIND_CLUSTER: training-operator-cluster
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
GANG_SCHEDULER_NAME: ${{ matrix.gang-scheduler-name }}
KUBERNETES_VERSION: ${{ matrix.kubernetes-version }}

- name: Run tests
run: |
pip install pytest
Expand Down
48 changes: 48 additions & 0 deletions .github/workflows/setup-e2e-test/action.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: Setup E2E test template
description: A composite action to setup e2e tests

inputs:
kubernetes-version:
required: true
description: kubernetes version
python-version:
required: true
description: Python version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set the matrix with Kubernetes and Python versions as part of our setup-e2e-test template ?
Right now, we set it in the integration-tests.yaml.
So we can keep it consistent for our E2Es + Notebooks tests.
WDYT @tenzen-y @Electronic-Waste @saileshd1402 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small thing is that if there are other steps that need some of these versions, for example gang-scheduler-name here, we will need to export them as environments variables to access in subsequent steps using GITHUB_ENV. Is there any other way or is this fine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I guess, we only use scheduler plugins for integration tests.
@kubeflow/wg-training-leads @saileshd1402 do we want to tests our Notebooks with various scheduling plugins as well ?
Or we want to limit the tests that we run with gang-scheduling ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set the matrix with Kubernetes and Python versions as part of our setup-e2e-test template ?
Right now, we set it in the integration-tests.yaml.
So we can keep it consistent for our E2Es + Notebooks tests.
WDYT @tenzen-y @Electronic-Waste @saileshd1402 ?

I think so. It will be better if we could execute e2e tests with multiple Kubernetes and Python versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, maybe gang-scheduling could be limited to only intergration-test.yaml since it is a bit redundant to test them again in notebook tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saileshd1402 Maybe to unblock this PR, we can just use GITHUB_ENV for now and set the gang-scheduler only for integration tests.
For the V2 tests, we can come back to this discussion.

Copy link
Contributor Author

@saileshd1402 saileshd1402 Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out that we can't use matrix inside a single composite action. It can only be used in jobs/workflows files. This is because composite action avoids duplication of steps but can't be used to create more jobs like a workflow file. There is also Reusable Workflows but that can't be used in this case since it's spawns separate workflow to run the template, which means that we can't use it to setup environment of current job. Related docs: matrix strategies and composite actions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for checking!


runs:
using: composite
steps:
- name: Free-Up Disk Space
uses: ./.github/workflows/free-up-disk-space

- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: ${{ inputs.python-version }}

- name: Create k8s Kind Cluster
uses: helm/kind-action@9fdad0686e6f19fcd572f62516f5e0436f562ee7
with:
node_image: kindest/node:${{ inputs.kubernetes-version }}
cluster_name: training-operator-cluster
kubectl_version: ${{ inputs.kubernetes-version }}

- name: Build training-operator
shell: bash
run: |
./scripts/gha/build-image.sh
env:
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test

- name: Deploy training operator
shell: bash
run: |
./scripts/gha/setup-training-operator.sh
docker system prune -a -f
docker system df
df -h
env:
KIND_CLUSTER: training-operator-cluster
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
GANG_SCHEDULER_NAME: "none"
KUBERNETES_VERSION: ${{ inputs.kubernetes-version }}
39 changes: 39 additions & 0 deletions .github/workflows/test-example-notebooks.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: Test example notebooks

on:
- pull_request

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
create-pytorchjob-notebook-test:
runs-on: ubuntu-latest
timeout-minutes: 30
strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.28.7", "v1.29.2", "v1.30.6"]
python-version: ["3.9", "3.10", "3.11"]
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Setup E2E Tests
uses: ./.github/workflows/setup-e2e-test
with:
kubernetes-version: ${{ matrix.kubernetes-version }}
python-version: ${{ matrix.python-version }}

- name: Install Python Dependencies
run: |
pip install papermill==2.6.0 jupyter==1.1.1 ipykernel==6.29.5
- name: Run Jupyter Notebook with Papermill
shell: bash
run: |
./scripts/run-notebook.sh \
-i ./examples/pytorch/image-classification/create-pytorchjob.ipynb \
-n default \
-k ./sdk/python
52 changes: 33 additions & 19 deletions examples/pytorch/image-classification/create-pytorchjob.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,20 @@
"The notebook shows how to use Kubeflow Training SDK to create, get, wait, check and delete PyTorchJob."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"training_python_sdk='kubeflow-training'\n",
"namespace='kubeflow-user-example-com'"
]
},
{
"cell_type": "markdown",
"metadata": {
Expand All @@ -42,12 +56,13 @@
"outputs": [],
"source": [
"# TODO (andreyvelich): Change to release version when SDK with the new APIs is published.\n",
"!pip install git+https://github.com/kubeflow/training-operator.git#subdirectory=sdk/python"
"# Install Kubeflow Python SDK\n",
"!pip install {training_python_sdk}"
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand Down Expand Up @@ -93,7 +108,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -102,12 +117,11 @@
"outputs": [],
"source": [
"name = \"pytorch-dist-mnist-gloo\"\n",
"namespace = \"kubeflow-user-example-com\"\n",
"container_name = \"pytorch\"\n",
"\n",
"container = V1Container(\n",
" name=container_name,\n",
" image=\"gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0\",\n",
" image=\"kubeflow/pytorch-dist-mnist:latest\",\n",
" args=[\"--backend\", \"gloo\"],\n",
")\n",
"\n",
Expand Down Expand Up @@ -157,7 +171,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -176,8 +190,8 @@
"# Namespace will be reused in every APIs.\n",
"training_client = TrainingClient(namespace=namespace)\n",
"\n",
"# If `job_kind` is not set in `TrainingClient`, we need to set it for each API.\n",
"training_client.create_job(pytorchjob, job_kind=constants.PYTORCHJOB_KIND)"
"# `job_kind` is set in `TrainingClient`\n",
"training_client.create_job(pytorchjob)"
]
},
{
Expand All @@ -195,7 +209,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -214,7 +228,7 @@
}
],
"source": [
"training_client.get_job(name, job_kind=constants.PYTORCHJOB_KIND).metadata.name"
"training_client.get_job(name).metadata.name"
]
},
{
Expand All @@ -230,7 +244,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand Down Expand Up @@ -260,7 +274,7 @@
}
],
"source": [
"training_client.get_job_conditions(name=name, job_kind=constants.PYTORCHJOB_KIND)"
"training_client.get_job_conditions(name=name)"
]
},
{
Expand All @@ -276,7 +290,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -302,7 +316,7 @@
}
],
"source": [
"pytorchjob = training_client.wait_for_job_conditions(name=name, job_kind=constants.PYTORCHJOB_KIND)\n",
"pytorchjob = training_client.wait_for_job_conditions(name=name)\n",
"\n",
"print(f\"Succeeded number of replicas: {pytorchjob.status.replica_statuses['Master'].succeeded}\")"
]
Expand All @@ -320,7 +334,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand All @@ -339,7 +353,7 @@
}
],
"source": [
"training_client.is_job_succeeded(name=name, job_kind=constants.PYTORCHJOB_KIND)"
"training_client.is_job_succeeded(name=name)"
]
},
{
Expand All @@ -355,7 +369,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand Down Expand Up @@ -476,7 +490,7 @@
}
],
"source": [
"training_client.get_job_logs(name=name, job_kind=constants.PYTORCHJOB_KIND)"
"training_client.get_job_logs(name=name)"
]
},
{
Expand All @@ -492,7 +506,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": null,
"metadata": {
"pycharm": {
"name": "#%%\n"
Expand Down
78 changes: 78 additions & 0 deletions scripts/run-notebook.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
#!/bin/bash

# Copyright 2024 The Kubeflow Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This bash script is used to run the example notebooks

set -o errexit
set -o nounset
set -o pipefail

NOTEBOOK_INPUT=""
NOTEBOOK_OUTPUT="-" # outputs to console
PAPERMILL_PARAMS=()
NAMESPACE="default"
TRAINING_PYTHON_SDK="./sdk/python"

usage() {
echo "Usage: $0 -i <input_notebook> -o <output_notebook> [-p \"<param> <value>\"...] [-y <params.yaml>]"
echo "Options:"
echo " -i Input notebook (required)"
echo " -o Output notebook (required)"
echo " -p Papermill parameters (optional), pass param name and value pair (in quotes whitespace separated)"
echo " -k Kubeflow Training Operator Python SDK (optional)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to name it as -sdk to make it clearer ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation using "getopts" accepts only one char argument names. I used this so the args parsing is short and clean. I can do longer names as well, but should I update the other args to have longer names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. I think, it's fine to keep it as -k in that case.

echo " -n Kubernetes namespace used by tests"
echo " -h Show this help message"
echo "NOTE: papermill, jupyter and ipykernel are required Python dependencies to run Notebooks"
exit 1
}

while getopts "i:o:p:k:n:r:d:h:" opt; do
case "$opt" in
i) NOTEBOOK_INPUT="$OPTARG" ;; # -i for notebook input path
o) NOTEBOOK_OUTPUT="$OPTARG" ;; # -o for notebook output path
p) PAPERMILL_PARAMS+=("$OPTARG") ;; # -p for papermill parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you name other papermill parameter as -k, should we name the namespace parameter as -n?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saileshd1402 Please can you check it, so we can merge the PR ?

Copy link
Contributor Author

@saileshd1402 saileshd1402 Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it already I think. Can you please check the latest commits once?

Copy link
Member

@andreyvelich andreyvelich Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, you should remove -p parameter from the flags of this script since it is no longer needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. I mean this part:

for param in "${PAPERMILL_PARAMS[@]}"; do
papermill_cmd="$papermill_cmd -p $param"
done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh understood, you are saying let's remove custom parameters of papermill in this script. I think we may use it in the future but I guess we can add if and when necessary. I'll remove it for this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's add them in the future once we need them.

k) TRAINING_PYTHON_SDK="$OPTARG" ;; # -k for training operator python sdk
n) NAMESPACE="$OPTARG" ;; # -n for kubernetes namespace used by tests
h) usage ;; # -h for help (usage)
*) usage; exit 1 ;;
esac
done

if [ -z "$NOTEBOOK_INPUT" ]; then
echo "Error: -i notebook input path is required."
exit 1
fi

papermill_cmd="papermill $NOTEBOOK_INPUT $NOTEBOOK_OUTPUT -p training_python_sdk $TRAINING_PYTHON_SDK -p namespace $NAMESPACE"
# Add papermill parameters (param name and value)
for param in "${PAPERMILL_PARAMS[@]}"; do
papermill_cmd="$papermill_cmd -p $param"
done

if ! command -v papermill &> /dev/null; then
echo "Error: papermill is not installed. Please install papermill to proceed."
exit 1
fi

echo "Running command: $papermill_cmd"
$papermill_cmd

if [ $? -ne 0 ]; then
echo "Error: papermill execution failed." >&2
exit 1
fi

echo "Notebook execution completed successfully"
Loading