Skip to content

Commit

Permalink
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283)
Browse files Browse the repository at this point in the history
* KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API

Signed-off-by: Andrey Velichkevich <[email protected]>

* Rename RuntimeRef in runtime framework

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
  • Loading branch information
andreyvelich authored Oct 17, 2024
1 parent 2d58b49 commit 6965c1a
Show file tree
Hide file tree
Showing 15 changed files with 134 additions and 134 deletions.
20 changes: 10 additions & 10 deletions docs/proposals/2170-kubeflow-training-v2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -281,7 +281,7 @@ type TrainJob struct {

type TrainJobSpec struct {
// Reference to the training runtime.
TrainingRuntimeRef TrainingRuntimeRef `json:"trainingRuntimeRef"`
RuntimeRef RuntimeRef `json:"runtimeRef"`

// Configuration of the desired trainer.
Trainer *Trainer `json:"trainer,omitempty"`
Expand Down Expand Up @@ -317,7 +317,7 @@ type TrainJobSpec struct {
ManagedBy *string `json:"managedBy,omitempty"`
}

type TrainingRuntimeRef struct {
type RuntimeRef struct {
// Name of the runtime being referenced.
// When namespaced-scoped TrainingRuntime is used, the TrainJob must have
// the same namespace as the deployed runtime.
Expand Down Expand Up @@ -375,7 +375,7 @@ This table explains the rationale for each `TrainJob` parameter:
</td>
</tr>
<tr>
<td><code>TrainingRuntimeRef</code>
<td><code>RuntimeRef</code>
</td>
<td>Reference to the existing <code>TrainingRuntime</code> that is pre-deployed by platform engineers
</td>
Expand Down Expand Up @@ -430,7 +430,7 @@ metadata:
name: torch-ddp
namespace: tenant-alpha
spec:
trainingRuntimeRef:
runtimeRef:
name: torch-distributed-multi-node
trainer:
image: docker.io/custom-training
Expand Down Expand Up @@ -488,7 +488,7 @@ metadata:
name: tune-llama-with-yelp
namespace: tenant-alpha
spec:
trainingRuntimeRef:
runtimeRef:
name: torch-tune-llama-7b
datasetConfig:
storageUri: s3://dataset/custom-dataset/yelp-review
Expand Down Expand Up @@ -890,7 +890,7 @@ metadata:
name: pytorch-distributed
namespace: tenant-alpha
spec:
trainingRuntimeRef:
runtimeRef:
name: pytorch-distributed-gpu
trainer:
image: docker.io/custom-training
Expand Down Expand Up @@ -939,7 +939,7 @@ to control versions of `TrainingRuntime` and enable rolling updates.

We are going to create two CRDs: `TrainingRuntime` and `ClusterTrainingRuntime`. These runtimes have
exactly the same APIs, but the first one is the namespace-scoped, the second is the cluster-scoped.
User can set the `kind` and `apiGroup` parameters in the `trainingRuntimeRef` to use
User can set the `kind` and `apiGroup` parameters in the `runtimeRef` to use
the `TrainingRuntime` from the `TrainJob's` namespace, otherwise the `ClusterTrainingRuntime` will
be used.

Expand Down Expand Up @@ -1228,7 +1228,7 @@ metadata:
name: torch-test
namespace: tenant-alpha
spec:
trainingRuntimeRef:
runtimeRef:
name: torch-distributed-multi-node
trainer:
resourcesPerNode:
Expand Down Expand Up @@ -1698,15 +1698,15 @@ Note that we should implement the status transitions validations to once we supp

### Support Multiple API Versions of TrainingRuntime

We can consider to introduce the `version` field for runtime API version to the `.spec.trainingRuntimeRef`
We can consider to introduce the `version` field for runtime API version to the `.spec.runtimeRef`
so that we can support multiple API versions of TrainingRuntime.

It could mitigate the pain points when users upgrade the older API Version to newer API Version like alpha to beta.
But, we do not aim to support both Alpha and Beta versions or both first Alpha and second Alpha versions in the specific training-operator release.
Hence, the `version` field was not introduced.

```go
type TrainingRuntimeRef struct {
type RuntimeRef struct {
[...]

// APIVersion is the apiVersion for the runtime.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ spec:
openAPIV3Schema:
description: |-
ClusterTrainingRuntime represents a training runtime which can be referenced as part of
`trainingRuntimeRef` API in TrainJob. This resource is a cluster-scoped and can be referenced
`runtimeRef` API in TrainJob. This resource is a cluster-scoped and can be referenced
by TrainJob that created in *any* namespace.
properties:
apiVersion:
Expand Down
2 changes: 1 addition & 1 deletion manifests/v2/base/crds/kubeflow.org_trainingruntimes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ spec:
openAPIV3Schema:
description: |-
TrainingRuntime represents a training runtime which can be referenced as part of
`trainingRuntimeRef` API in TrainJob. This resource is a namespaced-scoped and can be referenced
`runtimeRef` API in TrainJob. This resource is a namespaced-scoped and can be referenced
by TrainJob that created in the *same* namespace as the TrainingRuntime.
properties:
apiVersion:
Expand Down
48 changes: 24 additions & 24 deletions manifests/v2/base/crds/kubeflow.org_trainjobs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2732,6 +2732,29 @@ spec:
- targetReplicatedJobs
type: object
type: array
runtimeRef:
description: Reference to the training runtime.
properties:
apiGroup:
description: |-
APIGroup of the runtime being referenced.
Defaults to `kubeflow.org`.
type: string
kind:
description: |-
Kind of the runtime being referenced.
It must be one of TrainingRuntime or ClusterTrainingRuntime.
Defaults to ClusterTrainingRuntime.
type: string
name:
description: |-
Name of the runtime being referenced.
When namespaced-scoped TrainingRuntime is used, the TrainJob must have
the same namespace as the deployed runtime.
type: string
required:
- name
type: object
suspend:
description: |-
Whether the controller should suspend the running TrainJob.
Expand Down Expand Up @@ -2937,31 +2960,8 @@ spec:
type: object
type: object
type: object
trainingRuntimeRef:
description: Reference to the training runtime.
properties:
apiGroup:
description: |-
APIGroup of the runtime being referenced.
Defaults to `kubeflow.org`.
type: string
kind:
description: |-
Kind of the runtime being referenced.
It must be one of TrainingRuntime or ClusterTrainingRuntime.
Defaults to ClusterTrainingRuntime.
type: string
name:
description: |-
Name of the runtime being referenced.
When namespaced-scoped TrainingRuntime is used, the TrainJob must have
the same namespace as the deployed runtime.
type: string
required:
- name
type: object
required:
- trainingRuntimeRef
- runtimeRef
type: object
status:
description: Current status of TrainJob.
Expand Down
86 changes: 43 additions & 43 deletions pkg/apis/kubeflow.org/v2alpha1/openapi_generated.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions pkg/apis/kubeflow.org/v2alpha1/trainingruntime_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ const (
// +kubebuilder:resource:scope=Cluster

// ClusterTrainingRuntime represents a training runtime which can be referenced as part of
// `trainingRuntimeRef` API in TrainJob. This resource is a cluster-scoped and can be referenced
// `runtimeRef` API in TrainJob. This resource is a cluster-scoped and can be referenced
// by TrainJob that created in *any* namespace.
type ClusterTrainingRuntime struct {
metav1.TypeMeta `json:",inline"`
Expand Down Expand Up @@ -72,7 +72,7 @@ type ClusterTrainingRuntimeList struct {
// +kubebuilder:storageversion

// TrainingRuntime represents a training runtime which can be referenced as part of
// `trainingRuntimeRef` API in TrainJob. This resource is a namespaced-scoped and can be referenced
// `runtimeRef` API in TrainJob. This resource is a namespaced-scoped and can be referenced
// by TrainJob that created in the *same* namespace as the TrainingRuntime.
type TrainingRuntime struct {
metav1.TypeMeta `json:",inline"`
Expand Down
6 changes: 3 additions & 3 deletions pkg/apis/kubeflow.org/v2alpha1/trainjob_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ type TrainJobList struct {
// TrainJobSpec represents specification of the desired TrainJob.
type TrainJobSpec struct {
// Reference to the training runtime.
TrainingRuntimeRef TrainingRuntimeRef `json:"trainingRuntimeRef"`
RuntimeRef RuntimeRef `json:"runtimeRef"`

// Configuration of the desired trainer.
Trainer *Trainer `json:"trainer,omitempty"`
Expand Down Expand Up @@ -99,8 +99,8 @@ type TrainJobSpec struct {
ManagedBy *string `json:"managedBy,omitempty"`
}

// TrainingRuntimeRef represents the reference to the existing training runtime.
type TrainingRuntimeRef struct {
// RuntimeRef represents the reference to the existing training runtime.
type RuntimeRef struct {
// Name of the runtime being referenced.
// When namespaced-scoped TrainingRuntime is used, the TrainJob must have
// the same namespace as the deployed runtime.
Expand Down
Loading

0 comments on commit 6965c1a

Please sign in to comment.