Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mix training and inference infra and manifests #1487

Merged
merged 17 commits into from
Nov 20, 2024

Conversation

TarasRudko
Copy link
Contributor

@TarasRudko TarasRudko commented Oct 11, 2024

Description

This PR adds samples for a tutorial related to mixed training and inference in a single cluster.

Tasks

  • The contributing guide has been read and followed.
  • The samples added / modified have been fully tested.
  • Workflow files have been added / modified, if applicable.
  • Region tags have been properly added, if new samples.
  • All dependencies are set to up-to-date versions, as applicable.
  • Merge this pull-request for me once it is approved.

Copy link
Contributor

@kenthua kenthua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove the src folder?

@TarasRudko
Copy link
Contributor Author

can we remove the src folder?

Done



gcloud artifacts repositories add-iam-policy-binding fine-tuning \
--role=roles/artifactregistry.reader \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this anymore, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, not needed anymore. Corrected

#internalCertManagement:
# enable: false
# webhookServiceName: ""
# webhookSecretName: ""
Copy link
Collaborator

@NimJay NimJay Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: There are some lines commented out through out this file. Is this intentional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the team felt it was important to keep the context of the comments.

containers:
- name: gpu-job
imagePullPolicy: Always
image: us-docker.pkg.dev/google-samples/containers/gke/gemma-fine-tuning:v1.0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



cd gke-platform
sed -ie 's/"deletion_protection": true/"deletion_protection": false/g' terraform.tfstate
Copy link
Collaborator

@NimJay NimJay Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick (feel free to ignore): Consider setting deletion_protection to false from the very start (in the google_container_cluster resources), since all Terraform in this repo are to be cleaned up at the end of the tutorial.

Copy link
Collaborator

@NimJay NimJay Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought (no immediate action needed). This is out-of-scope for this pull-request, but we should consider modularizing the Terraform in this git repo. These gke_standard and gke_autopilot folders look similar to existing gke_standard and existing gke_autopilot. I'll add a comment in #861.

Copy link
Collaborator

@NimJay NimJay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work on these samples! 👏
Looks good to me.
I trust you've tested these samples appropriately for functionality.

Left a few comments, but nothing major.
Judging from internal discussions, I'm guessing this is ready for merge (even though this PR is still in draft mode). Merging...

@NimJay NimJay marked this pull request as ready for review November 20, 2024 21:12
@NimJay NimJay merged commit 2960beb into GoogleCloudPlatform:main Nov 20, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants