Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add e2e test for tune api with LLM hyperparameter optimization #2411

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
2a882d7
update tune api for llm hyperparameters optimization
helenxie-bit Jul 21, 2024
0c3e067
resolve conflict
helenxie-bit Jul 21, 2024
158c8f3
resolve conflict
helenxie-bit Jul 21, 2024
f4a0d4e
fix the problem of dependency
helenxie-bit Jul 21, 2024
7e7dd56
fix the format of import statement
helenxie-bit Jul 21, 2024
62ad385
adjust the blank lines
helenxie-bit Jul 21, 2024
3f36740
delete the trainer to reuse it in Training Operator
helenxie-bit Jul 22, 2024
9d20253
update constants
helenxie-bit Jul 22, 2024
dfbe793
update metrics format
helenxie-bit Jul 25, 2024
290a249
update the type of and
helenxie-bit Jul 29, 2024
aba2606
update the message of 'ImportError'
helenxie-bit Jul 29, 2024
eaf0193
add TODO of PVC creation
helenxie-bit Jul 29, 2024
62355a2
update the name of pvc
helenxie-bit Jul 29, 2024
7b2b40e
reuse constants from Training Operator
helenxie-bit Jul 29, 2024
acd1dcf
keep 'parameters' and update validation
helenxie-bit Jul 30, 2024
10b057d
update for test
helenxie-bit Jul 31, 2024
5a87eb0
reuse 'get_container_spec' and 'get_pod_template_spec' from Training …
helenxie-bit Aug 7, 2024
8387e67
resolve conflicts
helenxie-bit Aug 7, 2024
71605b4
format with black
helenxie-bit Aug 7, 2024
35acedb
fix Lint error
helenxie-bit Aug 7, 2024
af534b3
fix Lint errors
helenxie-bit Aug 7, 2024
c7f6e10
delete types
helenxie-bit Aug 7, 2024
9fdbdb7
fix format
helenxie-bit Aug 7, 2024
ddd5153
update format
helenxie-bit Aug 7, 2024
b31e820
update format
helenxie-bit Aug 7, 2024
dad3831
fix e2e test error
helenxie-bit Aug 7, 2024
1afe56d
add TODO
helenxie-bit Aug 8, 2024
ad7bce8
format with max line length
helenxie-bit Aug 8, 2024
7e58c94
format docstring
helenxie-bit Aug 8, 2024
61dc8ca
update format
helenxie-bit Aug 8, 2024
ba0d7d1
add helper functions
helenxie-bit Aug 8, 2024
2a1b008
update format
helenxie-bit Aug 8, 2024
b368521
update format
helenxie-bit Aug 8, 2024
3ccbdf9
run test again
helenxie-bit Aug 12, 2024
64e34e0
run test again
helenxie-bit Aug 12, 2024
dde724c
run test again
helenxie-bit Aug 12, 2024
1cccd4a
fix dict substitution in training_parameters
helenxie-bit Aug 14, 2024
510661d
fix typo
helenxie-bit Aug 17, 2024
f03c5ba
Merge remote-tracking branch 'origin/master' into helenxie/update_tun…
helenxie-bit Aug 18, 2024
f6b15a2
resolve conflicts and add check for case of no parameters
helenxie-bit Aug 18, 2024
6a3e046
fix format
helenxie-bit Aug 18, 2024
25541b9
fix format
helenxie-bit Aug 18, 2024
99e74d1
fix format
helenxie-bit Aug 18, 2024
96cf99c
fix flake8 error
helenxie-bit Aug 18, 2024
c568806
fix format
helenxie-bit Aug 18, 2024
6f65253
fix format
helenxie-bit Aug 18, 2024
ad17ac9
fix format
helenxie-bit Aug 18, 2024
9a1e2df
fix format
helenxie-bit Aug 18, 2024
421aaa6
add pytorchjob for tune api
helenxie-bit Aug 19, 2024
bab4d92
fix format
helenxie-bit Aug 19, 2024
f11051d
add 'types' module
helenxie-bit Aug 19, 2024
96768bc
add unit test for tune api
helenxie-bit Aug 19, 2024
3edfb49
fix format
helenxie-bit Aug 19, 2024
ccdc612
fix format
helenxie-bit Aug 19, 2024
fa6c9d7
add e2e test for tune api with llm hyperparameters optimization
helenxie-bit Aug 19, 2024
752c712
fix format
helenxie-bit Aug 19, 2024
0e41aab
update e2e test for tune api
helenxie-bit Sep 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/workflows/e2e-test-tune-api.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,15 @@ jobs:
with:
kubernetes-version: ${{ matrix.kubernetes-version }}

- name: Install Training Operator SDK
shell: bash
run: pip install kubeflow-training[huggingface]

- name: Run e2e test with tune API
uses: ./.github/workflows/template-e2e-test
with:
tune-api: true
training-operator: true

strategy:
fail-fast: false
Expand Down
10 changes: 9 additions & 1 deletion sdk/python/v1beta1/kubeflow/katib/api/katib_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,13 @@
import logging
import multiprocessing
import time
from typing import Any, Callable, Dict, List, Optional, Union
from typing import Any, Callable, Dict, List, Optional, TYPE_CHECKING, Union

import grpc
import kubeflow.katib.katib_api_pb2 as katib_api_pb2
import kubeflow.katib.katib_api_pb2_grpc as katib_api_pb2_grpc
from kubeflow.katib import models
from kubeflow.katib import types
from kubeflow.katib.api_client import ApiClient
from kubeflow.katib.constants import constants
from kubeflow.katib.types.trainer_resources import TrainerResources
Expand All @@ -30,6 +31,12 @@

logger = logging.getLogger(__name__)

if TYPE_CHECKING:
from kubeflow.storage_initializer.hugging_face import HuggingFaceDatasetParams
from kubeflow.storage_initializer.hugging_face import HuggingFaceModelParams
from kubeflow.storage_initializer.hugging_face import HuggingFaceTrainerParams
from kubeflow.storage_initializer.s3 import S3DatasetParams


class KatibClient(object):
def __init__(
Expand Down Expand Up @@ -338,6 +345,7 @@ class name in this argument.
to the base image packages. These packages are installed before
executing the objective function.
pip_index_url: The PyPI url from which to install Python packages.
metrics_collector_config: Specify the config of metrics collector,
metrics_collector_config: Specify the config of metrics collector,
for example, `metrics_collector_config = {"kind": "Push"}`.
Currently, we only support `StdOut` and `Push` metrics collector.
Expand Down
1 change: 1 addition & 0 deletions sdk/python/v1beta1/kubeflow/katib/constants/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@


DEFAULT_PRIMARY_CONTAINER_NAME = "training-container"
PYTORCHJOB_PRIMARY_CONTAINER_NAME = "pytorch"

# Label to identify Experiment's resources.
EXPERIMENT_LABEL = "katib.kubeflow.org/experiment"
Expand Down
7 changes: 7 additions & 0 deletions sdk/python/v1beta1/kubeflow/katib/types/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from __future__ import absolute_import

# Import types into type package.
from kubeflow.katib.types.trainer_resources import TrainerResources

# Import Kubernetes models.
from kubernetes.client import *
98 changes: 92 additions & 6 deletions test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
import argparse
import logging

from kubeflow.katib import KatibClient, search
import transformers
from kubeflow.katib import KatibClient, search, types
from kubeflow.storage_initializer.hugging_face import (
HuggingFaceDatasetParams,
HuggingFaceModelParams,
HuggingFaceTrainerParams,
)
from kubernetes import client
from peft import LoraConfig
from verify import verify_experiment_results

# Experiment timeout is 40 min.
Expand All @@ -12,7 +19,8 @@
logging.basicConfig(level=logging.INFO)


def run_e2e_experiment_create_by_tune(
# Test for Experiment created with custom objective.
def run_e2e_experiment_create_by_tune_with_custom_objective(
katib_client: KatibClient,
exp_name: str,
exp_namespace: str,
Expand Down Expand Up @@ -57,6 +65,70 @@ def objective(parameters):
logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))

# Test for Experiment created with external models and datasets.
def run_e2e_experiment_create_by_tune_with_external_model(
katib_client: KatibClient,
exp_name: str,
exp_namespace: str,
):
# Create Katib Experiment and wait until it is finished.
logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name))

# Use the test case from fine-tuning API tutorial.
# https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/
# Create Katib Experiment.
# And Wait until Experiment reaches Succeeded condition.
katib_client.tune(
name=exp_name,
namespace=exp_namespace,
# BERT model URI and type of Transformer to train it.
model_provider_parameters=HuggingFaceModelParams(
model_uri="hf://google-bert/bert-base-cased",
transformer_type=transformers.AutoModelForSequenceClassification,
num_labels=5,
),
# In order to save test time, use 8 samples from Yelp dataset.
dataset_provider_parameters=HuggingFaceDatasetParams(
repo_id="yelp_review_full",
split="train[:8]",
),
# Specify HuggingFace Trainer parameters.
trainer_parameters=HuggingFaceTrainerParams(
training_parameters=transformers.TrainingArguments(
output_dir="test_tune_api",
save_strategy="no",
learning_rate = search.double(min=1e-05, max=5e-05),
num_train_epochs=1,
),
# Set LoRA config to reduce number of trainable model parameters.
lora_config=LoraConfig(
r = search.int(min=8, max=32),
lora_alpha=8,
lora_dropout=0.1,
bias="none",
),
),
objective_metric_name = "train_loss",
objective_type = "minimize",
algorithm_name = "random",
max_trial_count = 1,
parallel_trial_count = 1,
resources_per_trial=types.TrainerResources(
num_workers=1,
num_procs_per_worker=1,
resources_per_worker={"cpu": "2", "memory": "10G",},
),
)
experiment = katib_client.wait_for_experiment_condition(
exp_name, exp_namespace, timeout=EXPERIMENT_TIMEOUT
)

# Verify the Experiment results.
verify_experiment_results(katib_client, experiment, exp_name, exp_namespace)

# Print the Experiment and Suggestion.
logging.debug(katib_client.get_experiment(exp_name, exp_namespace))
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace))

if __name__ == "__main__":
parser = argparse.ArgumentParser()
Expand All @@ -82,15 +154,29 @@ def objective(parameters):
exp_name = "tune-example"
exp_namespace = args.namespace
try:
run_e2e_experiment_create_by_tune(katib_client, exp_name, exp_namespace)
run_e2e_experiment_create_by_tune_with_custom_objective(katib_client, f"{exp_name}", exp_namespace)
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{f"{exp_name}"}")
except Exception as e:
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{f"{exp_name}"}")
raise e
finally:
# Delete the Experiment.
logging.info("---------------------------------------------------------------")
logging.info("---------------------------------------------------------------")
katib_client.delete_experiment(f"{exp_name}", exp_namespace)

try:
run_e2e_experiment_create_by_tune_with_external_model(katib_client, f"{exp_name}", exp_namespace)
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name}")
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{f"{exp_name}"}")
except Exception as e:
logging.info("---------------------------------------------------------------")
logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{exp_name}")
logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{f"{exp_name}"}")
raise e
finally:
# Delete the Experiment.
logging.info("---------------------------------------------------------------")
logging.info("---------------------------------------------------------------")
katib_client.delete_experiment(exp_name, exp_namespace)
katib_client.delete_experiment(f"{exp_name}", exp_namespace)
Loading
Loading