-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added test for create-pytorchjob.ipynb python notebook #2274
Changes from 62 commits
c90bbaf
f8fd24c
89023ce
62be575
9ea7155
da99ec8
4595f32
8b744b1
c6d1925
1124ee8
f882cf3
5494fb1
eb7c4be
93b6c66
e5aca68
c8b1aff
9145412
e704b7f
dc6a517
c0b64e0
2e7d3c2
040ba8f
f20969b
4ff5052
24cea1b
936620d
cdbc22e
5692b53
e6954eb
009f207
7793706
7f61c50
13dcb6b
b4c0d40
d315aa2
4d4d2c8
82d0535
32854c0
6df87f9
ce2febf
e1505ac
ec176e3
cc0ef4d
3f5c458
94b8414
ceb4369
0c4a8d2
83da2af
b5a8a72
05baf72
618bf6e
1bb35da
745c445
0cd3791
651672d
e607e6d
0540b90
8c7f517
caeffab
b545c80
87999f1
0ee9ca5
4ea4bde
72dd617
d3e9031
21a6129
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
name: Setup E2E test template | ||
description: A composite action to setup e2e tests | ||
|
||
inputs: | ||
kubernetes-version: | ||
required: true | ||
description: kubernetes version | ||
python-version: | ||
required: true | ||
description: Python version | ||
|
||
runs: | ||
using: composite | ||
steps: | ||
- name: Free-Up Disk Space | ||
uses: ./.github/workflows/free-up-disk-space | ||
|
||
- name: Setup Python | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: ${{ inputs.python-version }} | ||
|
||
- name: Create k8s Kind Cluster | ||
uses: helm/kind-action@9fdad0686e6f19fcd572f62516f5e0436f562ee7 | ||
with: | ||
node_image: kindest/node:${{ inputs.kubernetes-version }} | ||
cluster_name: training-operator-cluster | ||
kubectl_version: ${{ inputs.kubernetes-version }} | ||
|
||
- name: Build training-operator | ||
shell: bash | ||
run: | | ||
./scripts/gha/build-image.sh | ||
env: | ||
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test | ||
|
||
- name: Deploy training operator | ||
shell: bash | ||
run: | | ||
./scripts/gha/setup-training-operator.sh | ||
docker system prune -a -f | ||
docker system df | ||
df -h | ||
env: | ||
KIND_CLUSTER: training-operator-cluster | ||
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test | ||
GANG_SCHEDULER_NAME: "none" | ||
KUBERNETES_VERSION: ${{ inputs.kubernetes-version }} |
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,40 @@ | ||||
name: Test example notebooks | ||||
|
||||
on: | ||||
- pull_request | ||||
|
||||
concurrency: | ||||
group: ${{ github.workflow }}-${{ github.ref }} | ||||
cancel-in-progress: true | ||||
|
||||
jobs: | ||||
create-pytorchjob-notebook-test: | ||||
runs-on: ubuntu-latest | ||||
timeout-minutes: 30 | ||||
strategy: | ||||
fail-fast: false | ||||
matrix: | ||||
kubernetes-version: ["v1.28.7"] | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kubeflow/wg-training-leads @Electronic-Waste @saileshd1402 Do we want to run our Notebooks on all supported Kubernetes version similar to E2E tests:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I agree with you. We should expand the tests to all supported Kubernetes versions. |
||||
python-version: ["3.9", "3.10", "3.11"] | ||||
steps: | ||||
- name: Checkout | ||||
uses: actions/checkout@v4 | ||||
|
||||
- name: Setup E2E Tests | ||||
uses: ./.github/workflows/setup-e2e-test | ||||
with: | ||||
kubernetes-version: ${{ matrix.kubernetes-version }} | ||||
python-version: ${{ matrix.python-version }} | ||||
|
||||
- name: Install Python Dependencies | ||||
run: | | ||||
pip install papermill==2.6.0 jupyter==1.1.1 ipykernel==6.29.5 | ||||
|
||||
- name: Run Jupyter Notebook with Papermill | ||||
shell: bash | ||||
run: | | ||||
./scripts/run-notebook.sh \ | ||||
-i ./examples/pytorch/image-classification/create-pytorchjob.ipynb \ | ||||
-o ./examples/pytorch/image-classification/create-pytorchjob-output.ipynb \ | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How are we going to use output in the tests ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the output notebook is not being used, I'll not set it here |
||||
-p "namespace default" \ | ||||
-k ./sdk/python |
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,82 @@ | ||||||||
#!/bin/bash | ||||||||
|
||||||||
# Copyright 2021 The Kubernetes Authors. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please change the licence |
||||||||
# | ||||||||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||||||||
# you may not use this file except in compliance with the License. | ||||||||
# You may obtain a copy of the License at | ||||||||
# | ||||||||
# http://www.apache.org/licenses/LICENSE-2.0 | ||||||||
# | ||||||||
# Unless required by applicable law or agreed to in writing, software | ||||||||
# distributed under the License is distributed on an "AS IS" BASIS, | ||||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||||
# See the License for the specific language governing permissions and | ||||||||
# limitations under the License. | ||||||||
|
||||||||
# This bash script is used to run the example notebooks | ||||||||
|
||||||||
set -o errexit | ||||||||
set -o nounset | ||||||||
set -o pipefail | ||||||||
|
||||||||
NOTEBOOK_INPUT="" | ||||||||
NOTEBOOK_OUTPUT="" | ||||||||
PAPERMILL_PARAMS=() | ||||||||
PAPERMILL_PARAM_YAML="" | ||||||||
TRAINING_PYTHON_SDK="git+https://github.com/kubeflow/training-operator.git#subdirectory=sdk/python" | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we install the Python SDK directly from the repo by default, so developers can quickly test their SDK changes on one of the Notebooks? |
||||||||
|
||||||||
usage() { | ||||||||
echo "Usage: $0 -i <input_notebook> -o <output_notebook> [-p \"<param> <value>\"...] [-y <params.yaml>]" | ||||||||
echo "Options:" | ||||||||
echo " -i Input notebook (required)" | ||||||||
echo " -o Output notebook (required)" | ||||||||
echo " -p Papermill parameters (optional), pass param name and value pair (in quotes whitespace separated)" | ||||||||
echo " -y Papermill parameters YAML file (optional)" | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need this ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can pass the parameters to papermill through yaml file as well, we are not using right now, but might use in the future There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we introduce this parameter once we get feedback from users/developers that we need it ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure okay, I will remove it for now |
||||||||
echo " -k Kubeflow Training Operator Python SDK (optional)" | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you want to name it as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current implementation using "getopts" accepts only one char argument names. I used this so the args parsing is short and clean. I can do longer names as well, but should I update the other args to have longer names? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I see. I think, it's fine to keep it as -k in that case. |
||||||||
echo " -h Show this help message" | ||||||||
echo "NOTE: papermill, jupyter and ipykernel are required Python dependencies to run Notebooks" | ||||||||
exit 1 | ||||||||
} | ||||||||
|
||||||||
while getopts "i:o:y:p:k:r:d:h:" opt; do | ||||||||
case "$opt" in | ||||||||
i) NOTEBOOK_INPUT="$OPTARG" ;; # -i for notebook input path | ||||||||
o) NOTEBOOK_OUTPUT="$OPTARG" ;; # -o for notebook output path | ||||||||
p) PAPERMILL_PARAMS+=("$OPTARG") ;; # -p for papermill parameters | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since you name other papermill parameter as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @saileshd1402 Please can you check it, so we can merge the PR ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added it already I think. Can you please check the latest commits once? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think, you should remove There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. E.g. I mean this part: training-operator/scripts/run-notebook.sh Lines 61 to 63 in d3e9031
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh understood, you are saying let's remove custom parameters of papermill in this script. I think we may use it in the future but I guess we can add if and when necessary. I'll remove it for this PR There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, let's add them in the future once we need them. |
||||||||
y) PAPERMILL_PARAM_YAML="$OPTARG" ;; # -y for papermill parameter yaml path | ||||||||
k) TRAINING_PYTHON_SDK="$OPTARG" ;; # -k for training operator python sdk | ||||||||
h) usage ;; # -h for help (usage) | ||||||||
*) usage; exit 1 ;; | ||||||||
esac | ||||||||
done | ||||||||
|
||||||||
if [ -z "$NOTEBOOK_INPUT" ] || [ -z "$NOTEBOOK_OUTPUT" ]; then | ||||||||
echo "Error: -i notebook input path and -o notebook output path are required." | ||||||||
exit 1 | ||||||||
fi | ||||||||
|
||||||||
papermill_cmd="papermill $NOTEBOOK_INPUT $NOTEBOOK_OUTPUT -p training_python_sdk $TRAINING_PYTHON_SDK" | ||||||||
# Add papermill parameters (param name and value) | ||||||||
for param in "${PAPERMILL_PARAMS[@]}"; do | ||||||||
papermill_cmd="$papermill_cmd -p $param" | ||||||||
done | ||||||||
|
||||||||
if [ -n "$PAPERMILL_PARAM_YAML" ]; then | ||||||||
papermill_cmd="$papermill_cmd -y $PAPERMILL_PARAM_YAML" | ||||||||
fi | ||||||||
|
||||||||
if ! command -v papermill &> /dev/null; then | ||||||||
echo "Error: papermill is not installed. Please install papermill to proceed." | ||||||||
exit 1 | ||||||||
fi | ||||||||
|
||||||||
echo "Running command: $papermill_cmd" | ||||||||
$papermill_cmd | ||||||||
|
||||||||
if [ $? -ne 0 ]; then | ||||||||
echo "Error: papermill execution failed." >&2 | ||||||||
exit 1 | ||||||||
fi | ||||||||
|
||||||||
echo "Notebook execution completed successfully. Output saved to $NOTEBOOK_OUTPUT" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set the matrix with Kubernetes and Python versions as part of our
setup-e2e-test
template ?Right now, we set it in the integration-tests.yaml.
So we can keep it consistent for our E2Es + Notebooks tests.
WDYT @tenzen-y @Electronic-Waste @saileshd1402 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small thing is that if there are other steps that need some of these versions, for example gang-scheduler-name here, we will need to export them as environments variables to access in subsequent steps using GITHUB_ENV. Is there any other way or is this fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I guess, we only use scheduler plugins for integration tests.
@kubeflow/wg-training-leads @saileshd1402 do we want to tests our Notebooks with various scheduling plugins as well ?
Or we want to limit the tests that we run with gang-scheduling ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. It will be better if we could execute e2e tests with multiple Kubernetes and Python versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, maybe gang-scheduling could be limited to only
intergration-test.yaml
since it is a bit redundant to test them again in notebook tests.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saileshd1402 Maybe to unblock this PR, we can just use GITHUB_ENV for now and set the gang-scheduler only for integration tests.
For the V2 tests, we can come back to this discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found out that we can't use matrix inside a single composite action. It can only be used in jobs/workflows files. This is because composite action avoids duplication of steps but can't be used to create more jobs like a workflow file. There is also Reusable Workflows but that can't be used in this case since it's spawns separate workflow to run the template, which means that we can't use it to setup environment of current job. Related docs: matrix strategies and composite actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for checking!