mlcommons · pgmpablo157321 · Nov 6, 2024 · Nov 8, 2024 · Nov 12, 2024 · Nov 22, 2024
@@ -0,0 +1,48 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
+SHELL ["/bin/bash", "-c"]
+
+ENV LC_ALL=C.UTF-8
+ENV LANG=C.UTF-8
+
+ENV TZ=US/Pacific
+ENV DEBIAN_FRONTEND=noninteractive
+
+RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
+RUN rm -rf /var/lib/apt/lists/* && rm /etc/apt/sources.list.d/* \
+ && apt update \
+ && apt install -y --no-install-recommends build-essential autoconf \
+        libtool git ccache curl wget pkg-config sudo ca-certificates \
+        automake libssl-dev bc python3-dev python3-pip google-perftools \
+        gdb libglib2.0-dev clang sshfs libre2-dev libboost-dev \
+        libnuma-dev numactl sysstat sshpass ntpdate less iputils-ping \
+ && apt -y autoremove \
+ && apt remove -y cmake \
+ && apt install -y --no-install-recommends pkg-config zip g++ zlib1g-dev \
+        unzip libarchive-dev
+RUN apt install -y --no-install-recommends rsync
+
+# Install setuptools
+RUN python3 -m pip install --upgrade pip \
+    && python3 -m pip install --upgrade setuptools wheel virtualenv
+
+# Install conda
+WORKDIR /tmp
+RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh \
+    && bash Miniconda3-* -b -p /opt/miniconda3
+ENV PATH="$PATH:/opt/miniconda3/bin"
+RUN conda create -n llama3-405b python=3.10
+RUN chmod -R 777 /opt/miniconda3
@@ -0,0 +1,197 @@
+# Reference Implementation for llama3-405b
+
+**Basic implementation for llama3-405b. Few noteworthy items:**
+
++ Streamer for communicating with loadgen has quite some overhead. This is only meant to provide functional implementation
++ For custom/optimized implementations of this benchmark it is important to include the :
+        - For server scenario, it is necessary to call `lg.FirstTokenComplete(response)` for each query. This way the first token will be reported and it's latency will be measured.
+        - For all scenarios, when calling `lg.QuerySamplesComplete(response)`, it is necessary that each of the elements in response is a `lg.QuerySampleResponse` that contains the number of tokens (can be create this way: `lg.QuerySampleResponse(qitem.id, bi[0], bi[1], n_tokens)`). The number of tokens reported should match with the number of tokens on your answer and this will be checked in [TEST06](../../compliance/nvidia/TEST06/)
+
+Please see the [new docs site](https://docs.mlcommons.org/inference/benchmarks/language/llama3-405b) for an automated way to run this benchmark across different available implementations and do an end-to-end submission with or without docker.
+
+
+## Prepare environment
+
+Copy the mlperf.conf file to this folder.
+```
+cp ../../mlperf.conf .
+```
+
+For a CPU-only run:
+
+```
+conda create -n llama3-405b python=3.9
+conda activate llama3-405b
+
+# Install packages
+pip install -r requirements.txt
+
+export CUR_DIR=${PWD}
+cd <inference-repo-root>/loadgen
+
+
+python -m pip install .
+```
+
+For a GPU-based run:
+
+A dockerfile is provided, along with scripts to help launch it. First, add any docker volume mounts you want in
+`launch.sh`. There is a section at the top of the file that looks like:
+```
+# Add any volume mounts here with the following syntax
+# /path/to/src:/path/to/dir/in/container
+MOUNTS=(
+    $MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
+)
+```
+
+For example if you have a raid space located at `/raid/data` on your local machine, you can add it to the same path in the container like so:
+```
+# Add any volume mounts here with the following syntax
+# /path/to/src:/path/to/dir/in/container
+MOUNTS=(
+    $MLCOMMONS_REPO_PATH:$MLCOMMONS_REPO_PATH
+    /raid/data:/raid/data
+)
+```
+Once you have added all your mounts, launch the container with `bash launch.sh`.
+
+Inside the container, set up the environment with `bash build.sh`. This will install all the dependencies from the
+CPU-only setup, as well as any GPU versions for applicable libraries like PyTorch.
+
+
+## Get Model
+### MLCommons Members Download
+
+TODO: Host model and grant access to submitters
+
+
+### External Download
++ First go to [llama3-request-link](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and make a request, sign in to HuggingFace (if you don't have account, you'll need to create one). **Please note your authentication credentials** as you may be required to provide them when cloning below.
++ Requires Git Large Files Storage
+```
+export CHECKPOINT_PATH=${PWD}/Llama-2-70b-chat-hf
+git lfs install
+git clone https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct ${CHECKPOINT_PATH}
+
+```
+
+## Get Dataset
+
+### Preprocessed
+
+You can use Rclone to download the preprocessed dataset from a Cloudflare R2 bucket.
+
+To run Rclone on Windows, you can download the executable [here](https://rclone.org/install/#windows).
+To install Rclone on Linux/macOS/BSD systems, run:
+```
+sudo -v ; curl https://rclone.org/install.sh | sudo bash
+```
+Once Rclone is installed, run the following command to authenticate with the bucket:
+```
+rclone config create mlc-inference s3 provider=Cloudflare access_key_id=f65ba5eef400db161ea49967de89f47b secret_access_key=fbea333914c292b854f14d3fe232bad6c5407bf0ab1bebf78833c2b359bdfd2b endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com
+```
+You can then navigate in the terminal to your desired download directory and run the following command to download the dataset:
+
+```
+rclone copy mlc-inference:mlcommons-inference-wg-public/llama3_405b/mlperf_llama3.1_405b_dataset_8313_processed_fp16_eval.pkl ./ -P
+```
+
+You can also download the calibration dataset from the Cloudflare R2 bucket by running the following command:
+
+```
+rclone copy mlc-inference:mlcommons-inference-wg-public/llama3_405b/mlperf_llama3.1_405b_calibration_dataset_512_processed_fp16_eval.pkl ./ -P
+```
+
+## Run Performance Benchmarks
+
+### Offline
+```
+python -u main.py --scenario Offline \
+                --model-path ${CHECKPOINT_PATH} \
+                --dtype float16 \
+                --user-conf user.conf \
+                --total-sample-count 8312 \
+                --dataset-path ${DATASET_PATH} \
+                --output-log-dir output \
+                --tensor-parallel-size ${GPU_COUNT} \
+                --vllm
+
+```
+
+### Server
+```
+python -u main.py --scenario Server \
+                --model-path ${CHECKPOINT_PATH} \
+                --dtype float16 \
+                --user-conf user.conf \
+                --total-sample-count 8312 \
+                --dataset-path ${DATASET_PATH} \
+                --output-log-dir output \
+                --tensor-parallel-size ${GPU_COUNT} \
+                --vllm
+```
+
+The ServerSUT was not tested for GPU runs.
+
+
+## Run Accuracy Benchmarks
+
+### Offline
+```
+OUTPUT_LOG_DIR=offline-accuracy-logs
+
+mkdir -p "run_outputs"  # The script will dump all the outputs to 'run_outputs'.
+
+python -u main.py --scenario Offline \
+                --model-path ${CHECKPOINT_PATH} \
+                --accuracy \
+                --dtype float16 \
+                --user-conf user.conf \
+                --total-sample-count 8312 \
+                --dataset-path ${DATASET_PATH} \
+                --output-log-dir output \
+                --tensor-parallel-size ${GPU_COUNT} \
+                --vllm
+
+
+ACCURACY_LOG_FILE=${OUTPUT_LOG_DIR}/mlperf_log_accuracy.json
+if [ -e ${ACCURACY_LOG_FILE} ]; then
+        python evaluate-accuracy.py --checkpoint-path ${CHECKPOINT_PATH} \
+                --mlperf-accuracy-file ${ACCURACY_LOG_FILE} --dataset-file ${DATASET_PATH} --dtype int32
+fi
+```
+
+For the GPU run - The above steps have been automated in `run_accuracy.sh`. You can also modify this script to use
+`--device cpu` to adapt it to a CPU-only run.
+
+
+### Server
+```
+OUTPUT_LOG_DIR=server-accuracy-logs
+
+python -u main.py --scenario Server \
+                --model-path ${CHECKPOINT_PATH} \
+                --accuracy \
+                --dtype float16 \
+                --user-conf user.conf \
+                --total-sample-count 8312 \
+                --dataset-path ${DATASET_PATH} \
+                --output-log-dir output \
+                --tensor-parallel-size ${GPU_COUNT} \
+                --vllm
+
+
+ACCURACY_LOG_FILE=${OUTPUT_LOG_DIR}/mlperf_log_accuracy.json
+if [ -e ${ACCURACY_LOG_FILE} ]; then
+        python evaluate-accuracy.py --checkpoint-path ${CHECKPOINT_PATH} \
+                --mlperf-accuracy-file ${ACCURACY_LOG_FILE} --dataset-file ${DATASET_PATH} --dtype int32
+fi
+```
+
+The ServerSUT was not tested for GPU runs.
+
+
+## Accuracy Target
+Running the GPU implementation in FP16 precision resulted in the following FP16 accuracy targets (normalized to a 0-100
+scale from a 0.0-1.0 scale):