Quick GPU Training on Google Cloud

Spin up GPU instances for temporary PyTorch training jobs on Google Cloud Platform

This guide helps you quickly set up and run GPU-accelerated PyTorch training jobs on Google Cloud. Perfect for running experiments that need a few hours of GPU time without maintaining permanent (ec2) infrastructure.

TODO - goal of this repo was to actually get the vertex docker container to pull down a specified git repo / feature branch + training data - and start training - logging to WANDB for validating new features - then just shut down the machine - see start.sh. this is dependent on job / docker image id imageUri: 'gcr.io/kommunityproject/pytorch-train:v1.0.18'

🎥 Demo: Submitting a GPU Training Job

🚀 Quick Start

Set Environment Variables

export GCP_PROJECT=your-project-id
export GOOGLE_CLOUD_BUCKET_NAME=your-bucket-name

Submit Training Job
```
./push-job.sh
```

Monitor Progress

# Your job URL will appear after submission
🔗 View job at: https://console.cloud.google.com/vertex-ai/training/custom-jobs?project=$GCP_PROJECT

🔧 Prerequisites

Google Cloud Platform (GCP) account
Google Cloud SDK installed
Docker installed locally
Weights & Biases account (optional)

See setup instructions

⚙️ Configuration

The job_config_gpu.yaml file controls your GPU and environment settings:

workerPoolSpecs:
  machineSpec:
    machineType: a2-highgpu-1g  # 40GB GPU
    acceleratorType: NVIDIA_TESLA_A100
    acceleratorCount: 1
  replicaCount: 1
  containerSpec:
    imageUri: 'us-docker.pkg.dev/deeplearning-platform-release/gcr.io/pytorch-cu121.2-2.py310'
    env:
      - name: GCS_BUCKET_NAME
        value: gs://your-bucket
      - name: BRANCH_NAME
        value: your-branch
      - name: GITHUB_REPO
        value: your-repo

📦 Storage Setup

Create a GCS bucket:
```
./create_bucket.sh
```

Upload training data:

gsutil -m cp -r ./training_data gs://your-bucket/

Ubuntu/Debian

map google cloud storage to local drive

export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install gcsfuse

# Create mount point
mkdir ~/cloud-storage

# Mount bucket (replace with your bucket name)
gcsfuse $GOOGLE_CLOUD_BUCKET_NAME ~/cloud-storage

🔑 Required Permissions

Minimum IAM roles needed:

AI Platform Admin (roles/ml.admin)
Storage Object Admin (roles/storage.objectAdmin)
Container Registry Service Agent

🐛 Troubleshooting

Job Won't Start
- Check IAM permissions
- Verify GPU quota in your region
Storage Access Issues
- Test bucket access: gsutil ls gs://your-bucket
- Verify service account permissions

Local Testing

# Mount cloud storage locally
gcsfuse --anonymous-access your-bucket /mount/point

📚 Detailed Setup Guide

1. Enable Required APIs

PRO TIP - Toggle on just these services to help you find things

2. Shell Configuration

# Install oh-my-zsh for better CLI experience
sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

# Add to .zshrc
plugins=(git)
export GCP_PROJECT=your-project-id
export GOOGLE_CLOUD_BUCKET_NAME=your-bucket-name

3. Build Process

Your builds will appear in the artifacts with version bumped:

4. Job Management

Monitor your training jobs in the console:

5. Job Logs

View detailed logs and metrics:

6. Storage Access

For public buckets, consider granting access to allUsers:

7. Resource Configuration

Available machine types:

workerPoolSpecs:
  machineSpec:
    # Choose one:
    machineType: n1-standard-8
    # machineType: n1-standard-32
    # machineType: a2-ultragpu-1g # For A100 80GB
    
    # GPU options:
    # acceleratorType: NVIDIA_TESLA_V100
    # acceleratorType: NVIDIA_A100_80GB
    # acceleratorCount: 1

8. Docker Configuration

For local testing, set these environment variables:

export GCP_PROJECT=kommunityproject
export IMAGE_NAME="pytorch-training"
export GCS_BUCKET_NAME="gs://jp-ai-experiments"
export BRANCH_NAME="feat/ada-fixed4"
export GITHUB_REPO="https://github.com/johndpope/imf.git"

9. File Structure

Dockerfile: Defines training environment
build.sh: Builds and pushes Docker image
job_config.yaml: Training job configuration
push-job.sh: Submits training job

📈 Monitoring

Track your training job:

Real-time logs
GPU utilization
Training metrics

🛟 Need Help?

Google Cloud AI Platform Documentation
PyTorch Documentation
Submit an issue for specific questions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Quick GPU Training on Google Cloud

🎥 Demo: Submitting a GPU Training Job

🚀 Quick Start

Ubuntu/Debian

1. Enable Required APIs

2. Shell Configuration

3. Build Process

4. Job Management

5. Job Logs

6. Storage Access

7. Resource Configuration

8. Docker Configuration

9. File Structure

📈 Monitoring

🛟 Need Help?

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Quick GPU Training on Google Cloud

🎥 Demo: Submitting a GPU Training Job

🚀 Quick Start

Ubuntu/Debian

1. Enable Required APIs

2. Shell Configuration

3. Build Process

4. Job Management

5. Job Logs

6. Storage Access

7. Resource Configuration

8. Docker Configuration

9. File Structure

📈 Monitoring

🛟 Need Help?