Mix training and inference infra and manifests #1487

TarasRudko · 2024-10-11T18:25:15Z

Description

This PR adds samples for a tutorial related to mixed training and inference in a single cluster.

Tasks

The contributing guide has been read and followed.
The samples added / modified have been fully tested.
Workflow files have been added / modified, if applicable.
Region tags have been properly added, if new samples.
All dependencies are set to up-to-date versions, as applicable.
Merge this pull-request for me once it is approved.

… gemma deployment manifest

kenthua

can we remove the src folder?

TarasRudko · 2024-11-20T11:25:31Z

can we remove the src folder?

Done

kenthua · 2024-11-20T15:37:05Z

ai-ml/mix-train-and-inference/deploy.sh

+
+
+gcloud artifacts repositories add-iam-policy-binding fine-tuning \
+    --role=roles/artifactregistry.reader \


we don't need this anymore, correct?

yes, not needed anymore. Corrected

NimJay · 2024-11-20T20:47:49Z

ai-ml/mix-train-and-inference/kueue/patch.yaml

+    #internalCertManagement:
+    #  enable: false
+    #  webhookServiceName: ""
+    #  webhookSecretName: ""


Question: There are some lines commented out through out this file. Is this intentional?

Yes, the team felt it was important to keep the context of the comments.

NimJay · 2024-11-20T20:51:03Z

ai-ml/mix-train-and-inference/workloads/fine-tune-l4.yaml

+      containers:
+      - name: gpu-job
+        imagePullPolicy: Always 
+        image: us-docker.pkg.dev/google-samples/containers/gke/gemma-fine-tuning:v1.0.0


Just a note for future developers (any passerby): This source code for this image exists at https://github.com/GoogleCloudPlatform/accelerated-platforms/tree/974f2eff748d00d2566024d6ec4dd7f309f641c5/use-cases/model-fine-tuning-pipeline/fine-tuning/pytorch/src.

NimJay · 2024-11-20T20:56:28Z

ai-ml/mix-train-and-inference/remove.sh

+
+
+cd gke-platform
+sed -ie 's/"deletion_protection": true/"deletion_protection": false/g' terraform.tfstate


Nitpick (feel free to ignore): Consider setting deletion_protection to false from the very start (in the google_container_cluster resources), since all Terraform in this repo are to be cleaned up at the end of the tutorial.

NimJay · 2024-11-20T21:04:24Z

ai-ml/mix-train-and-inference/gke-platform/modules/gke_standard/main.tf

Just a thought (no immediate action needed). This is out-of-scope for this pull-request, but we should consider modularizing the Terraform in this git repo. These gke_standard and gke_autopilot folders look similar to existing gke_standard and existing gke_autopilot. I'll add a comment in #861.

NimJay

Awesome work on these samples! 👏
Looks good to me.
I trust you've tested these samples appropriately for functionality.

Left a few comments, but nothing major.
Judging from internal discussions, I'm guessing this is ready for merge (even though this PR is still in draft mode). Merging...

TarasRudko and others added 15 commits October 11, 2024 21:22

Mix training and inference infra and manifests

ec17b52

Fixes, cleanup

a013b12

Add copyrights info, re-arrange files

6a795f3

Delete unused files

6ece8cb

Remove manifest.yaml. Update deploy.sh

fe06eb4

Include training data

ae3a9dd

Update dataset. Add kustomize patch used for Kueue deployment. Update…

fbe1586

… gemma deployment manifest

Add env variable CHECKPOINT_SAVE_STEPS

fe4d6f6

Delete Kueue manifest

d47cf2f

Add licence headers

a2cc51e

Merge branch 'main' into feature/mix_training_inference

5a4b0d9

Remove unused file

66e9d16

Change file name

8b3a2c5

Update training job manifest to use pulic image

dcaf29e

fix file names

50aa623

kenthua reviewed Nov 20, 2024

View reviewed changes

Delete src folder

4c7649c

kenthua reviewed Nov 20, 2024

View reviewed changes

Remove unneded role binding and empty lines

b5b77f3

NimJay reviewed Nov 20, 2024

View reviewed changes

NimJay approved these changes Nov 20, 2024

View reviewed changes

NimJay marked this pull request as ready for review November 20, 2024 21:12

NimJay requested review from alizaidis, yoshi-approver and a team as code owners November 20, 2024 21:12

NimJay merged commit 2960beb into GoogleCloudPlatform:main Nov 20, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mix training and inference infra and manifests #1487

Mix training and inference infra and manifests #1487

TarasRudko commented Oct 11, 2024 •

edited by NimJay

Loading

kenthua left a comment

TarasRudko commented Nov 20, 2024

kenthua Nov 20, 2024

TarasRudko Nov 20, 2024

NimJay Nov 20, 2024 •

edited

Loading

kenthua Nov 20, 2024

NimJay Nov 20, 2024

NimJay Nov 20, 2024 •

edited

Loading

NimJay Nov 20, 2024 •

edited

Loading

NimJay left a comment •

edited

Loading



		gcloud artifacts repositories add-iam-policy-binding fine-tuning \
		--role=roles/artifactregistry.reader \



		cd gke-platform
		sed -ie 's/"deletion_protection": true/"deletion_protection": false/g' terraform.tfstate

Mix training and inference infra and manifests #1487

Mix training and inference infra and manifests #1487

Conversation

TarasRudko commented Oct 11, 2024 • edited by NimJay Loading

Description

Tasks

kenthua left a comment

Choose a reason for hiding this comment

TarasRudko commented Nov 20, 2024

kenthua Nov 20, 2024

Choose a reason for hiding this comment

TarasRudko Nov 20, 2024

Choose a reason for hiding this comment

NimJay Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

kenthua Nov 20, 2024

Choose a reason for hiding this comment

NimJay Nov 20, 2024

Choose a reason for hiding this comment

NimJay Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

NimJay Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

NimJay left a comment • edited Loading

Choose a reason for hiding this comment

TarasRudko commented Oct 11, 2024 •

edited by NimJay

Loading

NimJay Nov 20, 2024 •

edited

Loading

NimJay Nov 20, 2024 •

edited

Loading

NimJay Nov 20, 2024 •

edited

Loading

NimJay left a comment •

edited

Loading