This repository contains the scripts and source code for running the MLPerf v0.6 training benchmarks written by NVIDIA in RHEL7.6 and UBI8 containers. The Dockerfiles for creating the containers are in the top level directory.
The 4 directories, gnmt, maskrcnn, ssd, and transformer contain the source code and scripts for running each of these benchmarks. The resnet directory is for a benchmark using the ImageNet dataset on mxnet, which we haven't yet run. These 4 benchmarks are implemented in PyTorch, and the necessary PyTorch source code to be built and installed in the containers is in the ./pytorch directory.
For more detailed information about the benchmarks, read the original documentation at ssd, maskrcnn, gnmt, transformer.
The host system needs to be configured so that containers run with podman can access GPUs.
- Install
podman
on the host system - Install the
nvidia-container-toolkit
, so that the nvidia container runtime hook can be used.
This may take ~1 hour depending on download speeds of packages.
podman build -f ssd_dockerfile_ubi8 -t mlperf_v06_ssd_ubi8
The data download script is in the ssd directory. Further documentation can be found in the original results.
bash download_dataset.sh
Put the data in whatever directory you prefer. When you run the benchmark, you will specify the directory in DATADIR
so that the run script can mount it in a volume for the container to access.
The directory structure of DATADIR
should be:
<DATADIR>
├── coco2017/
│ ├── annotations/
│ ├── models/
│ ├── train2017/
│ └── val2017/
...
From the ssd
directory, the benchmark can be started with
CONT=mlperf_v06_ssd_ubi8 DATADIR=<coco2017> LOGDIR=/data/mlperf/logs DGXSYSTEM=DGX1 NEXP=1 PULL=0 ./podman_run.sub
On a system other than an NVIDIA DGX1 or DGX2, you will likely need to create a custom config_*.sh
file to specify the number of GPUs, and to configure other parameters to optimize performance.
This may take ~1 hour depending on download speeds of packages.
podman build -f maskrcnn_dockerfile_ubi8 -t mlperf_v06_maskrcnn_ubi8
The data is the same as that which is used for the ssd benchmark. See the above download instructions.
The Mask R-CNN uses the trained ResNet-50 model as a backbone. Run the download_weights.sh
script in the maskrcnn/
directory. Place the resulting file R-50.pkl
in the directory coco2017/models
directory, where coco2017
contains the unzipped train2017
and val2017
directories.
From the maskrcnn
directory, the benchmark can be started with
CONT=mlperf_v06_maskrcnn_ubi8 DATADIR=<coco2017> LOGDIR=/data/mlperf/logs DGXSYSTEM=DGX1 NEXP=1 PULL=0 ./podman_run.sub
On a system other than an NVIDIA DGX1 or DGX2, you will likely need to create a custom config_*.sh
file to specify the number of GPUs, and to configure other parameters to optimize performance.
This may take ~1 hour depending on download speeds of packages.
podman build -f gnmt_dockerfile_ubi8 -t mlperf_v06_gnmt_ubi8
Follow the instructions in the original documentation.
From the gnmt
directory, the benchmark can be started with
CONT=mlperf_v06_gnmt_ubi8 DATADIR=<gnmt_data> PREPROC_DATADIR=<host directory for preproc data> LOGDIR=/data/mlperf/logs DGXSYSTEM=DGX1 NEXP=1 PULL=0 ./podman_run.sub
On a system other than an NVIDIA DGX1 or DGX2, you will likely need to create a custom config_*.sh
file to specify the number of GPUs, and to configure other parameters to optimize performance.
This may take ~1 hour depending on download speeds of packages.
podman build -f transformer_dockerfile_ubi8 -t mlperf_v06_transformer_ubi8
Unfortunately the scripts in the submission files for the transformer benchmark do not work. To run the benchmark we had to get the data directly from the creators. I am still working on finding the exact source for the correct data.
From the transformer
directory, the benchmark can be started with
CONT=mlperf_v06_gnmt_ubi8 DATADIR=<transformer_data> LOGDIR=/data/mlperf/logs DGXSYSTEM=DGX1 NEXP=1 PULL=0 ./podman_run.sub
On a system other than an NVIDIA DGX1 or DGX2, you will likely need to create a custom config_*.sh
file to specify the number of GPUs, and to configure other parameters to optimize performance.