Skip to content

Latest commit

 

History

History
211 lines (160 loc) · 5.47 KB

readme.md

File metadata and controls

211 lines (160 loc) · 5.47 KB

Quick GPU Training on Google Cloud

Spin up GPU instances for temporary PyTorch training jobs on Google Cloud Platform

This guide helps you quickly set up and run GPU-accelerated PyTorch training jobs on Google Cloud. Perfect for running experiments that need a few hours of GPU time without maintaining permanent (ec2) infrastructure.

TODO - goal of this repo was to actually get the vertex docker container to pull down a specified git repo / feature branch + training data - and start training - logging to WANDB for validating new features - then just shut down the machine - see start.sh. this is dependent on job / docker image id imageUri: 'gcr.io/kommunityproject/pytorch-train:v1.0.18'

🎥 Demo: Submitting a GPU Training Job

asciicast

🚀 Quick Start

  1. Set Environment Variables

    export GCP_PROJECT=your-project-id
    export GOOGLE_CLOUD_BUCKET_NAME=your-bucket-name
  2. Submit Training Job

    ./push-job.sh
  3. Monitor Progress

    # Your job URL will appear after submission
    🔗 View job at: https://console.cloud.google.com/vertex-ai/training/custom-jobs?project=$GCP_PROJECT
🔧 Prerequisites
  • Google Cloud Platform (GCP) account
  • Google Cloud SDK installed
  • Docker installed locally
  • Weights & Biases account (optional)

See setup instructions

⚙️ Configuration

The job_config_gpu.yaml file controls your GPU and environment settings:

workerPoolSpecs:
  machineSpec:
    machineType: a2-highgpu-1g  # 40GB GPU
    acceleratorType: NVIDIA_TESLA_A100
    acceleratorCount: 1
  replicaCount: 1
  containerSpec:
    imageUri: 'us-docker.pkg.dev/deeplearning-platform-release/gcr.io/pytorch-cu121.2-2.py310'
    env:
      - name: GCS_BUCKET_NAME
        value: gs://your-bucket
      - name: BRANCH_NAME
        value: your-branch
      - name: GITHUB_REPO
        value: your-repo
📦 Storage Setup
  1. Create a GCS bucket:

    ./create_bucket.sh
  2. Upload training data:

    gsutil -m cp -r ./training_data gs://your-bucket/

Ubuntu/Debian

map google cloud storage to local drive

export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install gcsfuse

# Create mount point
mkdir ~/cloud-storage

# Mount bucket (replace with your bucket name)
gcsfuse $GOOGLE_CLOUD_BUCKET_NAME ~/cloud-storage
🔑 Required Permissions

Minimum IAM roles needed:

  • AI Platform Admin (roles/ml.admin)
  • Storage Object Admin (roles/storage.objectAdmin)
  • Container Registry Service Agent
🐛 Troubleshooting
  1. Job Won't Start

    • Check IAM permissions
    • Verify GPU quota in your region
  2. Storage Access Issues

    • Test bucket access: gsutil ls gs://your-bucket
    • Verify service account permissions
  3. Local Testing

    # Mount cloud storage locally
    gcsfuse --anonymous-access your-bucket /mount/point
📚 Detailed Setup Guide

1. Enable Required APIs

Alt text

PRO TIP - Toggle on just these services to help you find things

2. Shell Configuration

# Install oh-my-zsh for better CLI experience
sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

# Add to .zshrc
plugins=(git)
export GCP_PROJECT=your-project-id
export GOOGLE_CLOUD_BUCKET_NAME=your-bucket-name

3. Build Process

Your builds will appear in the artifacts with version bumped: Alt text

4. Job Management

Monitor your training jobs in the console: Alt text

5. Job Logs

View detailed logs and metrics: Alt text

6. Storage Access

For public buckets, consider granting access to allUsers: Alt text

7. Resource Configuration

Available machine types:

workerPoolSpecs:
  machineSpec:
    # Choose one:
    machineType: n1-standard-8
    # machineType: n1-standard-32
    # machineType: a2-ultragpu-1g # For A100 80GB
    
    # GPU options:
    # acceleratorType: NVIDIA_TESLA_V100
    # acceleratorType: NVIDIA_A100_80GB
    # acceleratorCount: 1

8. Docker Configuration

For local testing, set these environment variables:

export GCP_PROJECT=kommunityproject
export IMAGE_NAME="pytorch-training"
export GCS_BUCKET_NAME="gs://jp-ai-experiments"
export BRANCH_NAME="feat/ada-fixed4"
export GITHUB_REPO="https://github.com/johndpope/imf.git"

9. File Structure

  • Dockerfile: Defines training environment
  • build.sh: Builds and pushes Docker image
  • job_config.yaml: Training job configuration
  • push-job.sh: Submits training job

📈 Monitoring

Track your training job:

  • Real-time logs
  • GPU utilization
  • Training metrics

🛟 Need Help?