Skip to content

Commit

Permalink
add DLIO test examples
Browse files Browse the repository at this point in the history
  • Loading branch information
songjiaxun committed Feb 17, 2024
1 parent 57dc6c7 commit df1a489
Show file tree
Hide file tree
Showing 10 changed files with 650 additions and 0 deletions.
271 changes: 271 additions & 0 deletions examples/dlio/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,271 @@
# DLIO Unet3D Loading Tests

## Prerequisites

### Build DLIO docker container image

```bash
# Replace the docker registry.
git clone https://github.com/argonne-lcf/dlio_benchmark.git
cd dlio_benchmark/
docker build -t jiaxun/dlio:v1.0.0 .
docker image push jiaxun/dlio:v1.0.0
```

### Create a new node pool

For an existing GKE cluster, use the following command to create a new node pool. Make sure the cluster has the [Workload Identity feature enabled](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#enable).

> In this early stage test, the managed GCS FUSE CSI driver feature is disabled, and the driver is manually installed.
```bash
# Replace the cluster name and zone.
gcloud container node-pools create large-pool \
--cluster gcsfuse-csi-test-cluster \
--ephemeral-storage-local-ssd count=16 \
--machine-type n2-standard-96 \
--zone us-central1-a \
--num-nodes 3
```

### Set up GCS bucket

Create a GCS bucket using `Location type`: `Region`, and select the same region where your cluster runs. Follow the [GKE documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver#authentication) to configure the access. This example uses the default Kubernetes service account in the default Kubernetes namespace.

### Install Helm

The example uses Helm charts to manage the applications. Follow the [Helm documentation](https://helm.sh/docs/intro/install/#from-script) to install Helm.

## DLIO Unet3D Datasets Loading

Run the following commands to generate Unet3D datasets using DLIO, and upload to the bucket. You may need to `--set image=<your-registry>/dlio:v1.0.0` and `--set bucketName=<your-bucket-name>` to set your registry and bucket name.

```bash
cd ./examples/dlio

helm install dlio-unet3d-100kb-500k-data-loader data-loader \
--set bucketName=gke-dlio-unet3d-100kb-500k \
--set dlio.numFilesTrain=500000 \
--set dlio.recordLength=102400

helm install dlio-unet3d-500kb-1m-data-loader data-loader \
--set bucketName=gke-dlio-unet3d-500kb-1m \
--set dlio.numFilesTrain=1000000 \
--set dlio.recordLength=512000

helm install dlio-unet3d-3mb-100k-data-loader data-loader \
--set bucketName=gke-dlio-unet3d-3mb-100k \
--set dlio.numFilesTrain=100000 \
--set dlio.recordLength=3145728

helm install dlio-unet3d-150mb-5k-data-loader data-loader \
--set bucketName=gke-dlio-unet3d-150mb-5k \
--set dlio.numFilesTrain=5000 \
--set dlio.recordLength=157286400

# Clean up
helm uninstall \
dlio-unet3d-100kb-500k-data-loader \
dlio-unet3d-500kb-1m-data-loader \
dlio-unet3d-3mb-100k-data-loader \
dlio-unet3d-150mb-5k-data-loader
```

## DLIO Unet3D Loading Tests

Change the directory to `./examples/dlio`. Run the following commands to run the loading tests. Each `helm install` command will deploy a Pod to run the test, and upload logs to the bucket. You may need to `--set image=<your-registry>/dlio:v1.0.0` and `--set bucketName=<your-bucket-name>` to set your registry and bucket name.

### dlio-unet3d-100kb-500k dlio.batchSize=800

```bash
helm install dlio-unet3d-100kb-500k-800-local-ssd unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-100kb-500k \
--set dlio.numFilesTrain=500000 \
--set dlio.recordLength=102400 \
--set dlio.batchSize=800 \
--set scenario=local-ssd

helm install dlio-unet3d-100kb-500k-800-gcsfuse-file-cache unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-100kb-500k \
--set dlio.numFilesTrain=500000 \
--set dlio.recordLength=102400 \
--set dlio.batchSize=800 \
--set scenario=gcsfuse-file-cache

helm install dlio-unet3d-100kb-500k-800-gcsfuse-no-file-cache unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-100kb-500k \
--set dlio.numFilesTrain=500000 \
--set dlio.recordLength=102400 \
--set dlio.batchSize=800 \
--set scenario=gcsfuse-no-file-cache

# Clean up
helm uninstall \
dlio-unet3d-100kb-500k-800-local-ssd \
dlio-unet3d-100kb-500k-800-gcsfuse-file-cache \
dlio-unet3d-100kb-500k-800-gcsfuse-no-file-cache
```

### dlio-unet3d-100kb-500k dlio.batchSize=128

```bash
helm install dlio-unet3d-100kb-500k-128-local-ssd unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-100kb-500k \
--set dlio.numFilesTrain=500000 \
--set dlio.recordLength=102400 \
--set dlio.batchSize=128 \
--set scenario=local-ssd

helm install dlio-unet3d-100kb-500k-128-gcsfuse-file-cache unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-100kb-500k \
--set dlio.numFilesTrain=500000 \
--set dlio.recordLength=102400 \
--set dlio.batchSize=128 \
--set scenario=gcsfuse-file-cache

helm install dlio-unet3d-100kb-500k-128-gcsfuse-no-file-cache unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-100kb-500k \
--set dlio.numFilesTrain=500000 \
--set dlio.recordLength=102400 \
--set dlio.batchSize=128 \
--set scenario=gcsfuse-no-file-cache

# Clean up
helm uninstall \
dlio-unet3d-100kb-500k-128-local-ssd \
dlio-unet3d-100kb-500k-128-gcsfuse-file-cache \
dlio-unet3d-100kb-500k-128-gcsfuse-no-file-cache
```

### dlio-unet3d-500kb-1m dlio.batchSize=800

```bash
helm install dlio-unet3d-500kb-1m-800-local-ssd unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-500kb-1m \
--set dlio.numFilesTrain=1000000 \
--set dlio.recordLength=512000 \
--set dlio.batchSize=800 \
--set scenario=local-ssd

helm install dlio-unet3d-500kb-1m-800-gcsfuse-file-cache unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-500kb-1m \
--set dlio.numFilesTrain=1000000 \
--set dlio.recordLength=512000 \
--set dlio.batchSize=800 \
--set scenario=gcsfuse-file-cache

helm install dlio-unet3d-500kb-1m-800-gcsfuse-no-file-cache unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-500kb-1m \
--set dlio.numFilesTrain=1000000 \
--set dlio.recordLength=512000 \
--set dlio.batchSize=800 \
--set scenario=gcsfuse-no-file-cache

# Clean up
helm uninstall \
dlio-unet3d-500kb-1m-800-local-ssd \
dlio-unet3d-500kb-1m-800-gcsfuse-file-cache \
dlio-unet3d-500kb-1m-800-gcsfuse-no-file-cache
```

### dlio-unet3d-500kb-1m dlio.batchSize=128

```bash
helm install dlio-unet3d-500kb-1m-128-local-ssd unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-500kb-1m \
--set dlio.numFilesTrain=1000000 \
--set dlio.recordLength=512000 \
--set dlio.batchSize=128 \
--set scenario=local-ssd

helm install dlio-unet3d-500kb-1m-128-gcsfuse-file-cache unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-500kb-1m \
--set dlio.numFilesTrain=1000000 \
--set dlio.recordLength=512000 \
--set dlio.batchSize=128 \
--set scenario=gcsfuse-file-cache

helm install dlio-unet3d-500kb-1m-128-gcsfuse-no-file-cache unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-500kb-1m \
--set dlio.numFilesTrain=1000000 \
--set dlio.recordLength=512000 \
--set dlio.batchSize=128 \
--set scenario=gcsfuse-no-file-cache

# Clean up
helm uninstall \
dlio-unet3d-500kb-1m-128-local-ssd \
dlio-unet3d-500kb-1m-128-gcsfuse-file-cache \
dlio-unet3d-500kb-1m-128-gcsfuse-no-file-cache
```

### dlio-unet3d-3mb-100k dlio.batchSize=200

```bash
helm install dlio-unet3d-3mb-100k-200-local-ssd unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-3mb-100k \
--set dlio.numFilesTrain=100000 \
--set dlio.recordLength=3145728 \
--set dlio.batchSize=200 \
--set scenario=local-ssd

helm install dlio-unet3d-3mb-100k-200-gcsfuse-file-cache unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-3mb-100k \
--set dlio.numFilesTrain=100000 \
--set dlio.recordLength=3145728 \
--set dlio.batchSize=200 \
--set scenario=gcsfuse-file-cache

helm install dlio-unet3d-3mb-100k-200-gcsfuse-no-file-cache unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-3mb-100k \
--set dlio.numFilesTrain=100000 \
--set dlio.recordLength=3145728 \
--set dlio.batchSize=200 \
--set scenario=gcsfuse-no-file-cache

# Clean up
helm uninstall \
dlio-unet3d-3mb-100k-200-local-ssd \
dlio-unet3d-3mb-100k-200-gcsfuse-file-cache \
dlio-unet3d-3mb-100k-200-gcsfuse-no-file-cache
```

### dlio-unet3d-150mb-5k dlio.batchSize=4

```bash
helm install dlio-unet3d-150mb-5k-4-local-ssd unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-150mb-5k \
--set dlio.numFilesTrain=5000 \
--set dlio.recordLength=157286400 \
--set dlio.batchSize=4 \
--set scenario=local-ssd

helm install dlio-unet3d-150mb-5k-4-gcsfuse-file-cache unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-150mb-5k \
--set dlio.numFilesTrain=5000 \
--set dlio.recordLength=157286400 \
--set dlio.batchSize=4 \
--set scenario=gcsfuse-file-cache

helm install dlio-unet3d-150mb-5k-4-gcsfuse-no-file-cache unet3d-loading-test \
--set bucketName=gke-dlio-unet3d-150mb-5k \
--set dlio.numFilesTrain=5000 \
--set dlio.recordLength=157286400 \
--set dlio.batchSize=4 \
--set scenario=gcsfuse-no-file-cache

# Clean up
helm uninstall \
dlio-unet3d-150mb-5k-4-local-ssd \
dlio-unet3d-150mb-5k-4-gcsfuse-file-cache \
dlio-unet3d-150mb-5k-4-gcsfuse-no-file-cache
```

## Parsing the test results

Run the following python script to parse the logs. The results will be saved in `./examples/dlio/output.csv`.

```bash
cd ./examples/dlio
python ./parse_logs.py
```
23 changes: 23 additions & 0 deletions examples/dlio/data-loader/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
5 changes: 5 additions & 0 deletions examples/dlio/data-loader/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
apiVersion: v2
name: data-loader
description: A Helm chart for DLIO data loading to GCS buckets
type: application
version: 0.1.0
59 changes: 59 additions & 0 deletions examples/dlio/data-loader/templates/dlio-data-loader.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
apiVersion: v1
kind: Pod
metadata:
name: dlio-data-loader-{{ .Values.dlio.numFilesTrain }}-{{ .Values.dlio.recordLength }}
annotations:
gke-gcsfuse/volumes: "true"
gke-gcsfuse/cpu-limit: "0"
gke-gcsfuse/memory-limit: "0"
gke-gcsfuse/ephemeral-storage-limit: "0"
spec:
restartPolicy: Never
nodeSelector:
cloud.google.com/gke-ephemeral-storage-local-ssd: "true"
containers:
- name: dlio-data-loader
image: {{ .Values.image }}
resources:
limits:
cpu: "100"
memory: 400Gi
requests:
cpu: "30"
memory: 300Gi
command:
- "/bin/sh"
- "-c"
- |
echo "Installing gsutil..."
apt-get install -y apt-transport-https ca-certificates gnupg curl
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor -o /usr/share/keyrings/cloud.google.gpg
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
apt-get update && apt-get install google-cloud-cli
echo "Generating data for file number: {{ .Values.dlio.numFilesTrain }}, file size: {{ .Values.dlio.recordLength }}..."
mpirun -np 20 dlio_benchmark workload=unet3d \
++workload.workflow.generate_data=True \
++workload.workflow.train=False \
++workload.dataset.data_folder=/data \
++workload.dataset.num_files_train={{ .Values.dlio.numFilesTrain }} \
++workload.dataset.record_length={{ .Values.dlio.recordLength }} \
++workload.dataset.record_length_stdev=0 \
++workload.dataset.record_length_resize=0
gsutil -m cp -R /data/train gs://{{ .Values.bucketName }}
mkdir -p /bucket/valid
volumeMounts:
- name: local-dir
mountPath: /data
- name: gcs-fuse-csi-ephemeral
mountPath: /bucket
volumes:
- name: local-dir
emptyDir: {}
- name: gcs-fuse-csi-ephemeral
csi:
driver: gcsfuse.csi.storage.gke.io
volumeAttributes:
bucketName: {{ .Values.bucketName }}
10 changes: 10 additions & 0 deletions examples/dlio/data-loader/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Default values for data-loader.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

image: jiaxun/dlio:v1.0.0
bucketName: gke-dlio-unet3d-100kb-500k

dlio:
numFilesTrain: 500000
recordLength: 102400
Loading

0 comments on commit df1a489

Please sign in to comment.