Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Kubernetes to v1.30.7 #2332

Merged
merged 4 commits into from
Nov 27, 2024
Merged

Conversation

astefanutti
Copy link
Contributor

@astefanutti astefanutti commented Nov 22, 2024

What this PR does / why we need it:

This PR includes:

  • Kubernetes upgrade to v1.30.7
  • controller-runtime upgrade to v0.18.5
  • Adapt controllers to use controller-runtime generics API
  • Adapt code generation to new kube_codegen.sh script
  • Upgrade the helm/kind-action to the latest version to include fix: Use new mirror for downloading kubectl helm/kind-action#127
  • Configure openapi-generator to skip from generating the Python SDK unit tests as these are broken and not used anyway

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

This is the first PR to fix #2291, for upgrading to 1.30 first, and that'll be followed by #2330 to upgrade to 1.31 after this one is merged.

This supersedes #2299.

Checklist:

  • Docs included if any changes are user facing

@astefanutti
Copy link
Contributor Author

@kannon92 @tenzen-y I think this is ready for review. Please have a look. Thanks!

@astefanutti astefanutti mentioned this pull request Nov 22, 2024
1 task
@astefanutti astefanutti force-pushed the pr-k8s-1.30 branch 2 times, most recently from dee7021 to 85e3863 Compare November 22, 2024 13:01
@kannon92 kannon92 mentioned this pull request Nov 22, 2024
1 task
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test
/rerun-all

@@ -74,6 +74,7 @@ jobs:
- name: Create k8s Kind Cluster
uses: helm/[email protected]
with:
version: v0.25.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this version specification brought the CI errors. Do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the error is also present with the current version. I've speculatively upgraded it, but it turns out it's the kind-action that stills downloads the kubectl binary from storage.googleapis.com while the latest versions are now hosted on dl.k8s.io.

I've changed it to reference helm/kind-action#127 by SHA until a new version of the action gets released.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Thanks.
In that case, could you open the issue so that we can use the released dedicated tag once the new version is released?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had already created helm/kind-action#128 earlier :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for open that!

@tenzen-y
Copy link
Member

@astefanutti It seems that we specify the un existence K8s version: https://github.com/kubeflow/training-operator/actions/runs/12006038657/job/33464332223?pr=2332

Could you replace those based on https://hub.docker.com/r/kindest/node/tags?

@astefanutti
Copy link
Contributor Author

@astefanutti It seems that we specify the un existence K8s version: https://github.com/kubeflow/training-operator/actions/runs/12006038657/job/33464332223?pr=2332

Could you replace those based on https://hub.docker.com/r/kindest/node/tags?

@tenzen-y I've just re-pushed with the downgrade to kindest image v1.30.6 as the image hasn't been published for v1.30.7.

@astefanutti
Copy link
Contributor Author

astefanutti commented Nov 25, 2024

There is still one issue with the the SDK where some methods have continue as parameter in their signature which do not get prefixed as _continue and break Python as it's a reserved keyword. I'm looking into it.

@astefanutti
Copy link
Contributor Author

The issue with the SDK generation seems similar to OpenAPITools/openapi-generator#10236. It's been reported for the legacy Python generator, but a quick Look at the code shows the unit test generation template hasn't changed.

I've added a replacement into the post-generation script that replaces continue = with the correct _continue = statement.

Note the issue is only with the generated unit tests. I'm not sure how useful they are. There are currently deleted for the SDK v2.

@astefanutti
Copy link
Contributor Author

The SDK e2e tests now pass. They are some SDK unit tests that fail because openapi-generator does not generate types that are declared as "aliases" to the object type like runtime.RawExtension or v1.Fields for some reasons.

@andreyvelich
Copy link
Member

Note the issue is only with the generated unit tests. I'm not sure how useful they are. There are currently deleted for the SDK v2.

I think, we can remove those generated unit tests from the V1 SDK since we don't use them.
Similarly to V2 SDK, you can just git clean those files.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @astefanutti!
I left a few comments.

sigs.k8s.io/jobset v0.5.2
sigs.k8s.io/kueue v0.6.3
sigs.k8s.io/scheduler-plugins v0.28.9
sigs.k8s.io/structured-merge-diff/v4 v4.4.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this package ?

Copy link
Contributor Author

@astefanutti astefanutti Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's coming from one of the SSA apply configuration files in the generated Golang client, specifically pkg/client/applyconfiguration/internal/internal.go.

I don't think this is actually used, but it's needed to compile the generated client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes sense.

@@ -26,7 +26,9 @@
("import kubeflow.training", "from kubeflow.training.models import *"),
("kubeflow.training.models.v1\/.*.v1.", "V1"),
("kubeflow.training.models.kubeflow/org/v1/", "kubeflow_org_v1_"),
("kubeflow.training.models.runtime/raw_extension.runtime\.", "Runtime"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we use it ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why do we have runtime model in our APIs ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's used in the APIs, but it's being added by the OpenAPI specification generation script here:
https://github.com/kubernetes/code-generator/blob/b15df6411b47bf6e80bfc63947af6b436b2e05c6/kube_codegen.sh#L365-L367

It's hard-coded and it doesn't seem like there is an easy way to get rid of it. That being said, I don't think it actually impacts anything.

Comment on lines +22 to +24
kube::codegen::gen_helpers \
--boilerplate "${TRAINING_OPERATOR_ROOT}/hack/boilerplate/boilerplate.go.txt" \
"${TRAINING_OPERATOR_ROOT}/pkg/apis"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti @tenzen-y Is this a new recommended way to use kube codegen ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the previous command has been completely removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the "old" way has recently been EOL'ed: kubernetes/code-generator@2e5be31.

@@ -27,7 +27,7 @@ cd manifests/overlays/standalone
kustomize edit set image kubeflow/training-operator=${TRAINING_CI_IMAGE}

echo "Installing training operator manifests"
kustomize build . | kubectl apply -f -
kustomize build . | kubectl apply --server-side=true -f -
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use server side apply to deploy operator ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client-side apply now fails as the size of the CRDs have increased beyond the (default) maximum "last-applied" annotation size. It's a recurrent issue kubernetes/kubectl#712 that's often faced with CRDs.
Note that server-side apply may eventually become the default for the kubectl apply command: kubernetes/enhancements#3805.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for sharing

Signed-off-by: Antonin Stefanutti <[email protected]>
@astefanutti
Copy link
Contributor Author

Note the issue is only with the generated unit tests. I'm not sure how useful they are. There are currently deleted for the SDK v2.

I think, we can remove those generated unit tests from the V1 SDK since we don't use them. Similarly to V2 SDK, you can just git clean those files.

@andreyvelich thanks, that was my assumption as well.

I've added the --global-property apiTests=false,modelTests=false options to the openapi-generator CLI which deactivates the generation of these unit tests.

@@ -45,7 +45,7 @@ go run "${repo_root}"/hack/swagger/main.go ${VERSION} >"${SWAGGER_CODEGEN_FILE}"
echo "Removing previously generated files ..."
rm -rf "${SDK_OUTPUT_PATH}"/docs/KubeflowOrgV1*.md "${SDK_OUTPUT_PATH}"/kubeflow/training/models "${SDK_OUTPUT_PATH}"/kubeflow/training/*.py "${SDK_OUTPUT_PATH}"/test/test_*.py
echo "Generating Python SDK for Training Operator ..."
java -jar "${SWAGGER_CODEGEN_JAR}" generate -i "${repo_root}"/hack/python-sdk/swagger.json -g python -o "${SDK_OUTPUT_PATH}" -c "${SWAGGER_CODEGEN_CONF}"
java -jar "${SWAGGER_CODEGEN_JAR}" generate -i "${repo_root}"/hack/python-sdk/swagger.json -g python --global-property apiTests=false,modelTests=false -o "${SDK_OUTPUT_PATH}" -c "${SWAGGER_CODEGEN_CONF}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know that swagger has this flag to disable test generation.
Could we add the same flag for the V2 SDK, so we can get rid of git clean command ?

# TODO (andreyvelich): Discuss if we should use these test files.
git clean -f ${SDK_OUTPUT_PATH}/test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. Maybe I can do that in a follow-up PR or you'd prefer to include that here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine, we can do it as a followup PR.

@andreyvelich
Copy link
Member

Thanks for this effort @astefanutti!
Overall, lgtm. @kubeflow/wg-training-leads please check this PR.
/assign @kubeflow/wg-training-leads

@andreyvelich
Copy link
Member

/lgtm

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
Basically, lgtm

@@ -198,26 +197,26 @@ func (r *JAXJobReconciler) SetupWithManager(mgr ctrl.Manager, controllerThreads
DeleteFunc: util.OnDependentDeleteFuncGeneric(r.Expectations),
}
// inject watching for job related pod
if err = c.Watch(source.Kind(mgr.GetCache(), &corev1.Pod{}), eventHandler, predicates); err != nil {
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &corev1.Pod{}, eventHandler, predicates)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &corev1.Pod{}, eventHandler, predicates)); err != nil {
if err = c.Watch(source.Kind[*corev1.Pod](mgr.GetCache(), &corev1.Pod{}, eventHandler, predicates)); err != nil {

Could we do exactly type parameter?
There are the same questions in all of Job controllers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be possible, but one instance of eventHandler and genericPredicates would have to be created per type, as they would not be reusable for different types.

I think also predicates could be removed then to always use the generic version.

Happy to do it if having one event handler and predicates instance per type is OK for you. WDYT?

Copy link
Contributor Author

@astefanutti astefanutti Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve pushed f59f171 that should cover it so we can see what that looks like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like tests are failing, did you try to run them locally @astefanutti ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich sorry for the noise, I should have pushed that extra commit somewhere else.
I've fixed it and the tests should pass now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti I'm happy with accepting better refactoring :)

return err
}
// inject watching for job related service
if err = c.Watch(source.Kind(mgr.GetCache(), &corev1.Service{}), eventHandler, predicates); err != nil {
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &corev1.Service{}, eventHandler, predicates)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &corev1.Service{}, eventHandler, predicates)); err != nil {
if err = c.Watch(source.Kind[*corev1.Service](mgr.GetCache(), &corev1.Service{}, eventHandler, predicates)); err != nil {

return err
}
// skip watching volcano PodGroup if volcano PodGroup is not installed
if _, err = mgr.GetRESTMapper().RESTMapping(schema.GroupKind{Group: v1beta1.GroupName, Kind: "PodGroup"},
v1beta1.SchemeGroupVersion.Version); err == nil {
// inject watching for job related volcano PodGroup
if err = c.Watch(source.Kind(mgr.GetCache(), &v1beta1.PodGroup{}), eventHandler, genericPredicates); err != nil {
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &v1beta1.PodGroup{}, eventHandler, genericPredicates)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &v1beta1.PodGroup{}, eventHandler, genericPredicates)); err != nil {
if err = c.Watch(source.Kind[*v1beta1.PodGroup](mgr.GetCache(), &v1beta1.PodGroup{}, eventHandler, genericPredicates)); err != nil {

return err
}
}
// skip watching scheduler-plugins PodGroup if scheduler-plugins PodGroup is not installed
if _, err = mgr.GetRESTMapper().RESTMapping(schema.GroupKind{Group: schedulerpluginsv1alpha1.SchemeGroupVersion.Group, Kind: "PodGroup"},
schedulerpluginsv1alpha1.SchemeGroupVersion.Version); err == nil {
// inject watching for job related scheduler-plugins PodGroup
if err = c.Watch(source.Kind(mgr.GetCache(), &schedulerpluginsv1alpha1.PodGroup{}), eventHandler, genericPredicates); err != nil {
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &schedulerpluginsv1alpha1.PodGroup{}, eventHandler, genericPredicates)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err = c.Watch(source.Kind[client.Object](mgr.GetCache(), &schedulerpluginsv1alpha1.PodGroup{}, eventHandler, genericPredicates)); err != nil {
if err = c.Watch(source.Kind[*schedulerpluginsv1alpha1.PodGroup](mgr.GetCache(), &schedulerpluginsv1alpha1.PodGroup{}, eventHandler, genericPredicates)); err != nil {

@andreyvelich
Copy link
Member

/rerun-all

@astefanutti astefanutti force-pushed the pr-k8s-1.30 branch 2 times, most recently from bc31553 to ff7f60f Compare November 27, 2024 09:21
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update!
Mostly lgtm!

Comment on lines 17 to 21
import (
"fmt"
"reflect"

corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"sigs.k8s.io/controller-runtime/pkg/event"

kubeflowv1 "github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1"
"github.com/kubeflow/training-operator/pkg/controller.v1/common"
"github.com/kubeflow/training-operator/pkg/controller.v1/expectation"
commonutil "github.com/kubeflow/training-operator/pkg/util"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make dependencies as group like

<Go std libs>

<Third party libs>

<ourselves libs>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated it. I've also folded the two reconciler.go and reconciler_generic.go files now that it's generics all along.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 22 to 31
kubeflowv1 "github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1"
"github.com/kubeflow/training-operator/pkg/controller.v1/common"
"github.com/kubeflow/training-operator/pkg/controller.v1/expectation"
log "github.com/sirupsen/logrus"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/runtime/schema"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/event"
"sigs.k8s.io/controller-runtime/pkg/predicate"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -27,7 +27,7 @@ cd manifests/overlays/standalone
kustomize edit set image kubeflow/training-operator=${TRAINING_CI_IMAGE}

echo "Installing training operator manifests"
kustomize build . | kubectl apply -f -
kustomize build . | kubectl apply --server-side=true -f -
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the README. I'll raise a PR in the website repository promptly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great contribution!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit adc972e into kubeflow:master Nov 27, 2024
42 checks passed
saileshd1402 pushed a commit to saileshd1402/training-operator that referenced this pull request Dec 2, 2024
* Upgrade Kubernetes to v1.30.7

Signed-off-by: Antonin Stefanutti <[email protected]>

* Use typed event handlers and predicates in job controllers

Signed-off-by: Antonin Stefanutti <[email protected]>

* Re-organize pkg/common/util/reconciler.go

Signed-off-by: Antonin Stefanutti <[email protected]>

* Update installation instructions in README

Signed-off-by: Antonin Stefanutti <[email protected]>

---------

Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>
google-oss-prow bot pushed a commit that referenced this pull request Dec 9, 2024
* Added test for create-pytorchjob.ipynb

Signed-off-by: sailesh duddupudi <[email protected]>

* fix yaml syntax

Signed-off-by: sailesh duddupudi <[email protected]>

* Fix uses path

Signed-off-by: sailesh duddupudi <[email protected]>

* Add actions/checkout

Signed-off-by: sailesh duddupudi <[email protected]>

* Add bash to action.yaml

Signed-off-by: sailesh duddupudi <[email protected]>

* Install pip dependencies step

Signed-off-by: sailesh duddupudi <[email protected]>

* Add quotes for args

Signed-off-by: sailesh duddupudi <[email protected]>

* Add jupyter

Signed-off-by: sailesh duddupudi <[email protected]>

* Add nbformat_minor: 5 to fix invalid format error

Signed-off-by: sailesh duddupudi <[email protected]>

* Fix job name

Signed-off-by: sailesh duddupudi <[email protected]>

* test papermill-args-yaml

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args1

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args2

Signed-off-by: sailesh duddupudi <[email protected]>

* testing multi line args3

Signed-off-by: sailesh duddupudi <[email protected]>

* Parameterize sdk install

Signed-off-by: sailesh duddupudi <[email protected]>

* Remove unnecessary output

Signed-off-by: sailesh duddupudi <[email protected]>

* nbformat normailze

Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] Training Client Conditions related unit tests (#2253)

* test: add unit test for get_job_conditions function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_created function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_running function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_restarting function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_failed function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: add unit test for is_job_succeded function of training client

Signed-off-by: Bobbins228 <[email protected]>

* test: improve job condition unit tests efficiency

Signed-off-by: Bobbins228 <[email protected]>

---------

Signed-off-by: Bobbins228 <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] test: add unit test for list_jobs method of the training_client (#2267)

Signed-off-by: wei-chenglai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273)

Generate clientset, informers, listers and open api spec
for v2alpha1 APIs.

Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] Use torchrun to create PyTorchJob from function (#2276)

* [SDK] Use torchrun to create PyTorchJob from function

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update PyTorchJob SDK example

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add consts for entrypoint

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add check for num procs per worker

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [SDK] test: add unit test for get_job_logs method of the training_client (#2275)

Signed-off-by: wei-chenglai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [v2alpha] Move GV related codebase (#2281)

Move GV related codebase in v2alpha

Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement runtime framework (#2248)

* KEP-2170: Implement runtime framework interfaces

Signed-off-by: Yuki Iwai <[email protected]>

* Remove grep dependency

Signed-off-by: Yuki Iwai <[email protected]>

* KEP-2170: Implement ValidateObjects interface to the runtime framework

Signed-off-by: Yuki Iwai <[email protected]>

* KEP-2170: Expose the TrainingRuntime and ClusterTrainingRuntime Kind

Signed-off-by: Yuki Iwai <[email protected]>

* KEP-2170: Remove unneeded scheme field from the internal TrainingRuntime

Signed-off-by: Yuki Iwai <[email protected]>

* Rephrase the error message

Signed-off-by: Yuki Iwai <[email protected]>

* Distinguish TrainingRuntime and ClusterTrainingRuntime when creating indexes for the TrainJobs

Signed-off-by: Yuki Iwai <[email protected]>

* Propagate the TrainJob labels and annotations to the JobSet

Signed-off-by: Yuki Iwai <[email protected]>

* Remove PodAnnotations from the runtime info

Signed-off-by: Yuki Iwai <[email protected]>

* Implement TrainingRuntime ReplicatedJob validation

Signed-off-by: Yuki Iwai <[email protected]>

* Add TODO comments

Signed-off-by: Yuki Iwai <[email protected]>

* Replace queueSuspendedTrainJob with queueSuspendedTrainJobs

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Add DeepSpeed Example with Pytorch Operator (#2235)

Signed-off-by: Syulin7 <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283)

* KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API

Signed-off-by: Andrey Velichkevich <[email protected]>

* Rename RuntimeRef in runtime framework

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260)

Signed-off-by: Akshay Chitneni <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Upgrade Deepspeed demo dependencies (#2294)

Signed-off-by: Syulin7 <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add manifests for Kubeflow Training V2 (#2289)

* KEP-2170: Add manifests for Kubeflow Training V2

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix invalid name for webhook config in cert

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix integration tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Move kubebuilder markers to runtime framework

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use Kubernetes recommended labels

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286)

* FSDP Example with PyTorchJob and T5 Fine-Tuning

Signed-off-by: Andrey Velichkevich <[email protected]>

* Modify text

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement TrainJob Reconciler to manage objects (#2295)

* KEP-2170: Implement TrainJob Reconciler to manage objects

Signed-off-by: Yuki Iwai <[email protected]>

* Mode dep-crds to manifests/external-crds

Signed-off-by: Yuki Iwai <[email protected]>

* Rename run with runtime

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Remove Prometheus Monitoring doc (#2301)

Signed-off-by: Sophie <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Decouple JobSet from TrainJob (#2296)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Initialize runtimes before the manager starts (#2306)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310)

* Generate SDK models for the Training V2 APIs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Create pyproject.toml config

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove comments

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix pre-commit

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Create model and dataset initializers (#2303)

* KEP-2170: Create model and dataset initializers

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add abstract classes

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add storage URI to config

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update .gitignore

Co-authored-by: Kevin Hannon <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix the misspelling for initializer

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add .pt and .pth to ignore_patterns

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Kevin Hannon <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308)

* KEP-2170: Implement JobSet and PlainML Plugins

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix nil pointer exception for Trainer

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix unit tests in runtime package

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix unit tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix integration tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix lint

Signed-off-by: Andrey Velichkevich <[email protected]>

* Implement Torch Plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use list for the Info envs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix golang ci

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix Torch plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use K8s sets
Update error return
Use ptr.Deref() for nil values

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use client.Object for Build() call

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove DeepCopy

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove MLPolicy and PodGroupPolicy from the Info object

Signed-off-by: Andrey Velichkevich <[email protected]>

* Inline error

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove SDK jar file

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add integration test for Torch plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add TODO to calculate PodGroup values in unit tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Revert the change to add original Runtime Policies to Info

Signed-off-by: Andrey Velichkevich <[email protected]>

* Create const for the DefaultJobReplicas

Signed-off-by: Andrey Velichkevich <[email protected]>

* Check if PodLabels is empty

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Implement Initializer builders in the JobSet plugin  (#2316)

* KEP-2170: Implement Initializer builder in the JobSet plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update the SDK models

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove Info from Initializer builder

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update manifests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update pkg/constants/constants.go

Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>

* Use var for envs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove check manifests from GitHub actions

Signed-off-by: Andrey Velichkevich <[email protected]>

* Move consts to JobSet plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add the TrainJob state transition design (#2298)

* KEP-2170: Add the TrainJob state transition design

Signed-off-by: Yuki Iwai <[email protected]>

* Replace actual jobs with TrainJob

Signed-off-by: Yuki Iwai <[email protected]>

* Remove the JobSet conditions propagation and Add expanding runtime framework interfaces for each plugin

Signed-off-by: Yuki Iwai <[email protected]>

* Expand the Creation Failed reasons

Signed-off-by: Yuki Iwai <[email protected]>

* Rename Completed to Complete

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Update tf job examples to tf v2 (#2270)

* mnist with summaries updaetd to TF v2

Signed-off-by: yelias <[email protected]>

* tf_sample updaetd to TF v2

Signed-off-by: yelias <[email protected]>

* Add mnist_utils and update dist-mnist

Signed-off-by: yelias <[email protected]>

* Add mnist_utils and update dist-mnist

Signed-off-by: yelias <[email protected]>

* Remove old example - estimator-API, this example has been replaced by distribution_strategy

Signed-off-by: yelias <[email protected]>

* Small fix

Signed-off-by: yelias <[email protected]>

* Remove unsupported powerPC dockerfiles

Signed-off-by: yelias <[email protected]>

* Fix typo in copyright

Signed-off-by: yelias <[email protected]>

---------

Signed-off-by: yelias <[email protected]>
Co-authored-by: yelias <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add TrainJob conditions (#2322)

* KEP-2170: Implement TrainJob conditions

Signed-off-by: Yuki Iwai <[email protected]>

* Fix API comments

Signed-off-by: Yuki Iwai <[email protected]>

* Make condition message constants

Signed-off-by: Yuki Iwai <[email protected]>

* Stop connecting condition type and reason in JobSet plugin

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Pin Gloo repository in JAX Dockerfile to a specific commit (#2329)

This commit pins the Gloo repository to a specific commit (43b7acbf) in
the JAX Dockerfile to prevent build failures caused by a recent bug
introduced in the Gloo codebase. By locking the version of Gloo to
a known working commit, we ensure that the JAX build remains stable and
functional until the issue is resolved upstream.

The build failure occurs when compiling the gloo/transport/tcp/buffer.cc
file due to an undefined __NR_gettid constant, which was introduced
after the pinned commit. By using this commit, we bypass the issue and
allow the build to complete successfully.

Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* [fix] Resolve v2alpha API exceptions (#2317)

Resolve v2alpha API exceptions by adding necessary listType validations.

Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Upgrade Kubernetes to v1.30.7 (#2332)

* Upgrade Kubernetes to v1.30.7

Signed-off-by: Antonin Stefanutti <[email protected]>

* Use typed event handlers and predicates in job controllers

Signed-off-by: Antonin Stefanutti <[email protected]>

* Re-organize pkg/common/util/reconciler.go

Signed-off-by: Antonin Stefanutti <[email protected]>

* Update installation instructions in README

Signed-off-by: Antonin Stefanutti <[email protected]>

---------

Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Ignore cache exporting errors in the image building workflows (#2336)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* KEP-2170: Add Torch Distributed Runtime (#2328)

* KEP-2170: Add Torch Distributed Runtime

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add pip list

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Refine the server-side apply installation args (#2337)

Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Add openapi-generator CLI option to skip SDK v2 test generation (#2338)

Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Upgrade kustomization files to Kustomize v5 (#2326)

Signed-off-by: oksanabaza <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Pin accelerate package version in trainer (#2340)

* Pin accelerate package version in trainer

Signed-off-by: Gavrish Prabhu <[email protected]>

* include new line to pass pre-commit hook

Signed-off-by: Gavrish Prabhu <[email protected]>

---------

Signed-off-by: Gavrish Prabhu <[email protected]>
Signed-off-by: sailesh duddupudi <[email protected]>

* Replace papermill command with bash script

Signed-off-by: sailesh duddupudi <[email protected]>

* Typo fix

Signed-off-by: sailesh duddupudi <[email protected]>

* Move Checkout step outside action.yaml file

Signed-off-by: sailesh duddupudi <[email protected]>

* Add newline EOF in script

Signed-off-by: sailesh duddupudi <[email protected]>

* Pass python dependencies as args and pin versions

Signed-off-by: sailesh duddupudi <[email protected]>

* Update Usage

Signed-off-by: sailesh duddupudi <[email protected]>

* Install dependencies in yaml

Signed-off-by: sailesh duddupudi <[email protected]>

* fix ipynb

Signed-off-by: sailesh duddupudi <[email protected]>

* set bash flags

Signed-off-by: sailesh duddupudi <[email protected]>

* Update script args and add more kubernetes versions for tests

Signed-off-by: sailesh duddupudi <[email protected]>

* add gang-scheduler-name to  template

Signed-off-by: sailesh duddupudi <[email protected]>

* move go setup to template

Signed-off-by: sailesh duddupudi <[email protected]>

* remove -p parameter from script

Signed-off-by: sailesh duddupudi <[email protected]>

---------

Signed-off-by: sailesh duddupudi <[email protected]>
Signed-off-by: Bobbins228 <[email protected]>
Signed-off-by: wei-chenglai <[email protected]>
Signed-off-by: Varsha Prasad Narsing <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Syulin7 <[email protected]>
Signed-off-by: Akshay Chitneni <[email protected]>
Signed-off-by: Sophie <[email protected]>
Signed-off-by: yelias <[email protected]>
Signed-off-by: Sandipan Panda <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: oksanabaza <[email protected]>
Signed-off-by: Gavrish Prabhu <[email protected]>
Co-authored-by: Mark Campbell <[email protected]>
Co-authored-by: Wei-Cheng Lai <[email protected]>
Co-authored-by: Varsha <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
Co-authored-by: yu lin <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
Co-authored-by: Sophie Hsu <[email protected]>
Co-authored-by: Kevin Hannon <[email protected]>
Co-authored-by: YosiElias <[email protected]>
Co-authored-by: yelias <[email protected]>
Co-authored-by: Sandipan Panda <[email protected]>
Co-authored-by: Antonin Stefanutti <[email protected]>
Co-authored-by: Oksana Bazylieva <[email protected]>
Co-authored-by: Gavrish Prabhu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Kubernetes v1.29 - v1.31 or v1.28 - v1.31
3 participants