Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add e2e test for train API #2199

Open
wants to merge 87 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
15b6cb0
add e2e test for train API
helenxie-bit Aug 9, 2024
daa0054
fix peft import error
helenxie-bit Aug 9, 2024
8d4af90
update settings of the job
helenxie-bit Aug 9, 2024
86c31c8
fix format
helenxie-bit Aug 9, 2024
01870e2
fix format
helenxie-bit Aug 9, 2024
17f3c33
fix error detection
helenxie-bit Aug 9, 2024
0685dc7
resolve conflict
helenxie-bit Aug 9, 2024
83de64b
resolve conflict
helenxie-bit Aug 9, 2024
f954f2d
resolve conflict
helenxie-bit Aug 9, 2024
ff48154
fix format
helenxie-bit Aug 9, 2024
304db5d
fix NoneType error
helenxie-bit Aug 9, 2024
486154d
fix format
helenxie-bit Aug 9, 2024
016c41d
test bug
helenxie-bit Aug 9, 2024
1e7bd23
find bug
helenxie-bit Aug 11, 2024
1aced61
find bug
helenxie-bit Aug 11, 2024
3100aae
find bug
helenxie-bit Aug 11, 2024
e5b9061
add storage_config
helenxie-bit Aug 11, 2024
ffb0685
fix format
helenxie-bit Aug 11, 2024
dc1b48a
reduce pvc size
helenxie-bit Aug 12, 2024
8894517
set storage_config
helenxie-bit Aug 12, 2024
36872d7
set storage_config
helenxie-bit Aug 12, 2024
7dd8d40
set storage_config
helenxie-bit Aug 12, 2024
60c322d
set storage_config
helenxie-bit Aug 12, 2024
dd970ab
use gpu
helenxie-bit Aug 12, 2024
10bbfa0
use gpu
helenxie-bit Aug 12, 2024
d47d6a6
use gpu
helenxie-bit Aug 12, 2024
4ccd4a7
fix 'set_device' error
helenxie-bit Aug 12, 2024
0750322
add timeout error
helenxie-bit Aug 15, 2024
5ca0923
fix format
helenxie-bit Aug 15, 2024
387eb84
fix format
helenxie-bit Aug 15, 2024
9cc5429
fix format
helenxie-bit Aug 15, 2024
8a537ad
fix typo
helenxie-bit Aug 26, 2024
e508ef4
update e2e test for train api
helenxie-bit Aug 29, 2024
788359b
add num_labels
helenxie-bit Aug 29, 2024
9b4222e
update pip install
helenxie-bit Aug 29, 2024
d75938d
check disk space
helenxie-bit Aug 29, 2024
1148bc8
change sequence of e2e tests
helenxie-bit Aug 29, 2024
d29a85d
add clean-up after each e2e test of pytorchjob
helenxie-bit Aug 29, 2024
82ea9be
update cleanup function
helenxie-bit Aug 30, 2024
b45f9f7
update cleanup function
helenxie-bit Aug 30, 2024
a204746
update cleanup function-add check disk
helenxie-bit Aug 30, 2024
2d8f8b1
check docker volumes
helenxie-bit Aug 30, 2024
c748d0e
update cleanup function
helenxie-bit Aug 30, 2024
a68e182
update cleanup function
helenxie-bit Aug 30, 2024
227129e
check docker directory
helenxie-bit Aug 30, 2024
79e9e32
update pip install and 'num_workers'
helenxie-bit Aug 30, 2024
b7dbf5c
update pip install and 'num_workers'
helenxie-bit Aug 30, 2024
1f639a7
update pip install
helenxie-bit Aug 30, 2024
8322730
change the value of 'clean_pod_policy'
helenxie-bit Aug 30, 2024
ed10574
change the value of 'update cleanup function
helenxie-bit Aug 30, 2024
50ed9e8
update cleanup function
helenxie-bit Aug 30, 2024
b2cd27a
update cleanup function
helenxie-bit Aug 31, 2024
3af5d87
check docker volumes
helenxie-bit Aug 31, 2024
1a0eff3
check docker volumes
helenxie-bit Aug 31, 2024
604265a
stop the controller and restart it again to clean up
helenxie-bit Aug 31, 2024
a4f848f
update cleanup function
helenxie-bit Aug 31, 2024
3e86e90
update cleanup function
helenxie-bit Aug 31, 2024
558330b
update cleanup function
helenxie-bit Aug 31, 2024
d4ed2d8
separate e2e test for train api
helenxie-bit Sep 3, 2024
7a2ae05
fix format
helenxie-bit Sep 3, 2024
9efcce5
fix parameter of namespace
helenxie-bit Sep 3, 2024
a443ea2
fix format
helenxie-bit Sep 3, 2024
85fd8e6
reduce resources
helenxie-bit Sep 3, 2024
1a0c455
separate e2e test for train API
helenxie-bit Sep 3, 2024
afe4240
remove go setup
helenxie-bit Sep 3, 2024
250b830
adjust the version of k8s
helenxie-bit Sep 3, 2024
c5b39a4
move test file to new place
helenxie-bit Sep 3, 2024
fa99a92
fix typos
helenxie-bit Sep 4, 2024
f0d8cc4
rerun tests
helenxie-bit Sep 4, 2024
d2c3cac
update install packages
helenxie-bit Sep 21, 2024
c3f04c3
Merge remote-tracking branch 'upstream/master' into add-e2e-test-for-…
helenxie-bit Sep 21, 2024
9f42449
build and verify images of storage-intializer and trainer
helenxie-bit Sep 21, 2024
bb406ce
fix image build error
helenxie-bit Sep 21, 2024
f0b6b38
fix image build error
helenxie-bit Sep 21, 2024
45eb7e0
check disk space
helenxie-bit Sep 21, 2024
f217794
make 'setup-storage-initializer-and-trainer' executable
helenxie-bit Sep 21, 2024
083e155
separate step of loading images
helenxie-bit Sep 21, 2024
dc74844
check disk space after loading image
helenxie-bit Sep 21, 2024
de18ef0
clean up and check disk space
helenxie-bit Sep 21, 2024
ef8742c
prune docker build cache
helenxie-bit Sep 21, 2024
1eb3ef1
prune docker build cache
helenxie-bit Sep 21, 2024
1e407a5
adjust sequence of building and loading images
helenxie-bit Sep 21, 2024
7519559
move working directory
helenxie-bit Sep 21, 2024
f5d63c4
delete moving working directory
helenxie-bit Sep 22, 2024
08c8562
fix format
helenxie-bit Sep 22, 2024
d2ae542
use 'docker system prune'
helenxie-bit Sep 24, 2024
09fc8a9
make the format of the commands to be consistent
helenxie-bit Sep 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions .github/workflows/e2e-test-train-api.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
name: E2E Test with train API
on:
- pull_request

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
e2e-test:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
kubernetes-version: ["v1.28.7"]
python-version: ["3.9", "3.10", "3.11"]
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Free-Up Disk Space
uses: ./.github/workflows/free-up-disk-space

- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Create k8s Kind Cluster
uses: helm/[email protected]
with:
node_image: kindest/node:${{ matrix.kubernetes-version }}
cluster_name: training-operator-cluster
kubectl_version: ${{ matrix.kubernetes-version }}

- name: Build training-operator
run: |
./scripts/gha/build-image.sh
env:
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test

- name: Deploy training operator
run: |
./scripts/gha/setup-training-operator.sh
docker system prune -a -f
docker system df
df -h
env:
KIND_CLUSTER: training-operator-cluster
TRAINING_CI_IMAGE: kubeflowtraining/training-operator:test
GANG_SCHEDULER_NAME: "none"
KUBERNETES_VERSION: ${{ matrix.kubernetes-version }}

- name: Build trainer
run: |
./scripts/gha/build-trainer.sh
docker builder prune -a -f
docker system df
df -h
env:
TRAINER_CI_IMAGE: kubeflowtraining/trainer:test

- name: Load trainer
run: |
kind load docker-image ${{ env.TRAINER_CI_IMAGE }} --name ${{ env.KIND_CLUSTER }}
docker system prune -a -f
docker system df
df -h
env:
KIND_CLUSTER: training-operator-cluster
TRAINER_CI_IMAGE: kubeflowtraining/trainer:test

- name: Build storage initializer
run: |
./scripts/gha/build-storage-initializer.sh
docker builder prune -a -f
docker system df
df -h
env:
STORAGE_INITIALIZER_CI_IMAGE: kubeflowtraining/storage-initializer:test
TRAINER_CI_IMAGE: kubeflowtraining/trainer:test

- name: Load storage initializer
run: |
kind load docker-image ${{ env.STORAGE_INITIALIZER_CI_IMAGE }} --name ${{ env.KIND_CLUSTER }}
docker system prune -a -f
docker system df
df -h
env:
KIND_CLUSTER: training-operator-cluster
STORAGE_INITIALIZER_CI_IMAGE: kubeflowtraining/storage-initializer:test

- name: Run tests
run: |
pip install pytest
python3 -m pip install -e sdk/python[huggingface]
pytest -s sdk/python/test/e2e-train-api/test_e2e_train_api.py --log-cli-level=debug
env:
STORAGE_INITIALIZER_IMAGE: kubeflowtraining/storage-initializer:test
TRAINER_TRANSFORMER_IMAGE_DEFAULT: kubeflowtraining/trainer:test
2 changes: 1 addition & 1 deletion .github/workflows/integration-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ jobs:
- name: Run tests
run: |
pip install pytest
python3 -m pip install -e sdk/python; pytest -s sdk/python/test --log-cli-level=debug --namespace=default
python3 -m pip install -e sdk/python; pytest -s sdk/python/test/e2e --log-cli-level=debug --namespace=default
env:
GANG_SCHEDULER_NAME: ${{ matrix.gang-scheduler-name }}

Expand Down
24 changes: 24 additions & 0 deletions scripts/gha/build-storage-initializer.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash

# Copyright 2024 The Kubeflow Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# The script is used to build Kubeflow Training image.


set -o errexit
set -o nounset
set -o pipefail

docker build sdk/python/kubeflow/storage_initializer -t ${STORAGE_INITIALIZER_CI_IMAGE} -f sdk/python/kubeflow/storage_initializer/Dockerfile
24 changes: 24 additions & 0 deletions scripts/gha/build-trainer.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash

# Copyright 2024 The Kubeflow Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# The script is used to build Kubeflow Training image.


set -o errexit
set -o nounset
set -o pipefail

docker build sdk/python/kubeflow/trainer -t ${TRAINER_CI_IMAGE} -f sdk/python/kubeflow/trainer/Dockerfile
95 changes: 95 additions & 0 deletions sdk/python/test/e2e-train-api/test_e2e_train_api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Copyright 2024 kubeflow.org.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging
import test.e2e.utils as utils

import transformers
from kubeflow.storage_initializer.hugging_face import (
HuggingFaceDatasetParams,
HuggingFaceModelParams,
HuggingFaceTrainerParams,
)
from kubeflow.training import TrainingClient, constants
from peft import LoraConfig

logging.basicConfig(format="%(message)s")
logging.getLogger("kubeflow.training.api.training_client").setLevel(logging.DEBUG)

TRAINING_CLIENT = TrainingClient(job_kind=constants.PYTORCHJOB_KIND)


def test_sdk_e2e_create_from_train_api(job_namespace="default"):
JOB_NAME = "pytorchjob-from-train-api"

# Use test case from fine-tuning API tutorial.
# https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/
TRAINING_CLIENT.train(
name=JOB_NAME,
namespace=job_namespace,
# BERT model URI and type of Transformer to train it.
model_provider_parameters=HuggingFaceModelParams(
model_uri="hf://google-bert/bert-base-cased",
transformer_type=transformers.AutoModelForSequenceClassification,
num_labels=5,
),
# In order to save test time, use 8 samples from Yelp dataset.
dataset_provider_parameters=HuggingFaceDatasetParams(
repo_id="yelp_review_full",
split="train[:8]",
),
# Specify HuggingFace Trainer parameters.
trainer_parameters=HuggingFaceTrainerParams(
training_parameters=transformers.TrainingArguments(
output_dir="test_trainer",
save_strategy="no",
evaluation_strategy="no",
do_eval=False,
disable_tqdm=True,
log_level="info",
num_train_epochs=1,
),
# Set LoRA config to reduce number of trainable model parameters.
lora_config=LoraConfig(
r=8,
lora_alpha=8,
lora_dropout=0.1,
bias="none",
),
),
num_workers=1,
num_procs_per_worker=1,
resources_per_worker={
"gpu": 0,
"cpu": 2,
"memory": "10G",
},
storage_config={
"size": "10Gi",
"access_modes": ["ReadWriteOnce"],
},
)

logging.info(f"List of created {TRAINING_CLIENT.job_kind}s")
logging.info(TRAINING_CLIENT.list_jobs(job_namespace))

try:
utils.verify_job_e2e(TRAINING_CLIENT, JOB_NAME, job_namespace, wait_timeout=900)
except Exception as e:
utils.print_job_results(TRAINING_CLIENT, JOB_NAME, job_namespace)
TRAINING_CLIENT.delete_job(JOB_NAME, job_namespace)
raise Exception(f"PyTorchJob create from API E2E fails. Exception: {e}")

utils.print_job_results(TRAINING_CLIENT, JOB_NAME, job_namespace)
TRAINING_CLIENT.delete_job(JOB_NAME, job_namespace)
Loading