Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Consolidate Naming for CRUD APIs #1907

Merged

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Sep 7, 2023

Fixes: #1877
Related: #1878

I unified CRUD APIs in our Training Operator SDK, so users can submit different Job much easier with job_kind parameter.
Users can configure namespace and job_kind in TrainingClient() once, so they can re-use it during API execution.

Here are the list of public APIs that we expose to the user:

create_job()
get_job()
get_job_conditions()
get_job_logs()
get_job_pod_names()
is_job_created()
is_job_failed()
is_job_restarting()
is_job_running()
is_job_succeeded()
wait_for_job_conditions()
list_jobs()
update_job()
delete_job()

Create Job API now supports:

  • Create Job from object - job
  • Create Job from Docker image - base_image (Currently, only for TFJob and PyTorchJob).
  • Create Job from function - train_func (Currently, only for TFJob and PyTorchJob).

I removed yapf and pylint configs, since we can use black + flake8 combination (KFP also uses flake8 for lint checks)
It would be nice to add more unit test/lint checks for our SDK.

TODO: I still need to update examples with the new APIs in this PR.
/hold

It would be nice if you could start review the SDK changes.
/assign @kubeflow/wg-training-leads @tenzen-y @kuizhiqing @yaobaiwei @zw0610

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@andreyvelich
Copy link
Member Author

I updated SDK examples and I moved them to /examples/sdk, so users can find them easily.
Please take a look at the changes once you have time.

@andreyvelich andreyvelich changed the title [WIP] [SDK] Consolidate Naming for CRUD APIs [SDK] Consolidate Naming for CRUD APIs Sep 8, 2023
@andreyvelich
Copy link
Member Author

/hold for review

@andreyvelich andreyvelich force-pushed the issue-1877-consolidate-sdk-apis branch from 4adf8e4 to 19bb327 Compare September 8, 2023 22:13
@coveralls
Copy link

coveralls commented Sep 8, 2023

Pull Request Test Coverage Report for Build 6206887875

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 42.777%

Totals Coverage Status
Change from base Build 6204968357: 0.0%
Covered Lines: 3737
Relevant Lines: 8736

💛 - Coveralls

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, LGTM

.github/workflows/test-python.yaml Outdated Show resolved Hide resolved
.github/workflows/test-python.yaml Show resolved Hide resolved
.pylintrc Outdated Show resolved Hide resolved
docs/development/developer_guide.md Outdated Show resolved Hide resolved
docs/development/developer_guide.md Outdated Show resolved Hide resolved
sdk/python/kubeflow/training/api/training_client.py Outdated Show resolved Hide resolved
sdk/python/kubeflow/training/api/training_client.py Outdated Show resolved Hide resolved
sdk/python/kubeflow/training/utils/utils.py Outdated Show resolved Hide resolved
sdk/python/test/e2e/test_e2e_mpijob.py Outdated Show resolved Hide resolved
Comment on lines 151 to 161
def test_pytorchjob_from_func(job_namespace):
# Test Training function.
def train_func(parameters):
import pandas as pd
import time

print(f"Package pandas=={pd.__version__} is installed")
print(f"Input function parameters are: {parameters}")

print("Stat Training ....")
for i in range(10):
print(f"Epoch: {i} finished")
time.sleep(1)

print("Training is complete")

TRAINING_CLIENT.create_job(
name=JOB_NAME,
namespace=job_namespace,
parameters={"lr": "0.01"},
train_func=train_func,
num_worker_replicas=1,
packages_to_install=["pandas==1.3.5"],
)

TRAINING_CLIENT.delete_pytorchjob(JOB_NAME, job_namespace)
logging.info("Get created PyTorchJob from function")
logging.info(TRAINING_CLIENT.get_job(JOB_NAME, job_namespace))

verify_job_e2e(TRAINING_CLIENT, JOB_NAME, job_namespace, timeout=900)
TRAINING_CLIENT.delete_job(JOB_NAME, job_namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, I think this test is for SDK, not the operator. So ideally, we should put this test on unit tests.
However, I'm not sure if we can put this test on unit tests.
@andreyvelich WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to test this functionality in E2Es as well, unless our unit tests are going to run Kubernetes cluster.
I agree, that we need to have unit test for our SDK, but it is separate discussion.

For this I want to verify that:

  1. Kubernetes can properly create containers when train function is embedded to the container arg.
  2. Packages can be downloaded after container is started.

I am not sure, how we can verify this in unit tests, unless we start Kubernetes cluster during unit tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that it might be better to run this test case as unit tests if there is a fake client for Python like client-go.

However, I could not find such a fake client for Python. So we should keep having this case in e2e.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's explore such options when we have time to add unit tests for our SDK client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@andreyvelich andreyvelich force-pushed the issue-1877-consolidate-sdk-apis branch 2 times, most recently from df3d177 to fea6efb Compare September 14, 2023 23:20
@andreyvelich
Copy link
Member Author

andreyvelich commented Sep 14, 2023

@tenzen-y I removed test for Create PyTorchJob from func from our E2E for now.
I am not sure why tests failed, since locally everything is working fine. So we can unblock this PR.
We can work on adding tests for such use-cases later.

For tests, I print job info when verify_e2e failed, so we can see some details on failed tests.

@andreyvelich andreyvelich force-pushed the issue-1877-consolidate-sdk-apis branch from fea6efb to 2f49954 Compare September 16, 2023 08:30
@andreyvelich andreyvelich force-pushed the issue-1877-consolidate-sdk-apis branch from 60cf3e6 to ca02651 Compare September 16, 2023 10:43
@andreyvelich
Copy link
Member Author

@tenzen-y Alright, test passed. I think, it was temporarily issue with MXnet dataset downloading.

@andreyvelich
Copy link
Member Author

@tenzen-y Please let me know if there are any other comments that you want me to address.
/hold cancel

@tenzen-y
Copy link
Member

@tenzen-y Please let me know if there are any other comments that you want me to address. /hold cancel

It will take a bit of time to review this PR again since I'm backlogged on PRs in multiple repositories.
Thanks for your patience.

@andreyvelich
Copy link
Member Author

@tenzen-y Please let me know if there are any other comments that you want me to address. /hold cancel

It will take a bit of time to review this PR again since I'm backlogged on PRs in multiple repositories. Thanks for your patience.

Sure, no problem. Thank you for your time!

@johnugeorge
Copy link
Member

/lgtm

@johnugeorge
Copy link
Member

/assign @tenzen-y

@andreyvelich
Copy link
Member Author

@tenzen-y if you had a few minutes to check final changes that would be great!

@tenzen-y
Copy link
Member

@tenzen-y if you had a few minutes to check final changes that would be great!

@andreyvelich Sorry for the late. Implementation looks good to me. However, I’m confirming by using SDK on my local why the following e2e fail. Since I faced the similar error on my local.

@tenzen-y I removed test for Create PyTorchJob from func from our E2E for now.
I am not sure why tests failed, since locally everything is working fine. So we can unblock this PR.
We can work on adding tests for such use-cases later.

@andreyvelich
Copy link
Member Author

However, I’m confirming by using SDK on my local why the following e2e fail. Since I faced the similar error on my local.

@tenzen-y Please can you show what error did you get ?

@tenzen-y
Copy link
Member

However, I’m confirming by using SDK on my local why the following e2e fail. Since I faced the similar error on my local.

@tenzen-y Please can you show what error did you get ?

Once re-created my local cluster, the error went away. Sorry for the confusion.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit bb2b58a into kubeflow:master Sep 22, 2023
57 checks passed
@andreyvelich
Copy link
Member Author

Once re-created my local cluster, the error went away. Sorry for the confusion.

@tenzen-y During our E2Es testing I saw that sometime kind just can't pulled image. Maybe we should consider to use minikube, similar to Katib E2Es.

@andreyvelich andreyvelich deleted the issue-1877-consolidate-sdk-apis branch September 22, 2023 14:42
@tenzen-y
Copy link
Member

Once re-created my local cluster, the error went away. Sorry for the confusion.

@tenzen-y During our E2Es testing I saw that sometime kind just can't pulled image. Maybe we should consider to use minikube, similar to Katib E2Es.

Or, we maybe want to load images to kind cluster with kind load docker-image XXX before starting test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SDK] Consolidate Naming for CRUD APIs
4 participants