diff --git a/docs/source/getting_started/amd-installation.rst b/docs/source/getting_started/amd-installation.rst index 3d736bf7120ec..61fcd45a26347 100644 --- a/docs/source/getting_started/amd-installation.rst +++ b/docs/source/getting_started/amd-installation.rst @@ -3,9 +3,7 @@ Installation with ROCm ====================== -vLLM 0.2.4 onwards supports model inferencing and serving on AMD GPUs with ROCm. -At the moment AWQ quantization is not supported in ROCm, but SqueezeLLM quantization has been ported. -Data types currently supported in ROCm are FP16 and BF16. +vLLM supports AMD GPUs with ROCm 5.7 and 6.0. Requirements ------------ @@ -13,114 +11,57 @@ Requirements * OS: Linux * Python: 3.8 -- 3.11 * GPU: MI200s (gfx90a), MI300 (gfx942), Radeon RX 7900 series (gfx1100) -* Pytorch 2.0.1/2.1.1/2.2 -* ROCm 5.7 (Verified on python 3.10) or ROCm 6.0 (Verified on python 3.9) +* ROCm 6.0 and ROCm 5.7 Installation options: -#. :ref:`(Recommended) Quick start with vLLM pre-installed in Docker Image ` -#. :ref:`Build from source ` #. :ref:`Build from source with docker ` +#. :ref:`Build from source ` -.. _quick_start_docker_rocm: - -(Recommended) Option 1: Quick start with vLLM pre-installed in Docker Image ---------------------------------------------------------------------------- - -This option is for ROCm 5.7 only: - -.. code-block:: console - - $ docker pull embeddedllminfo/vllm-rocm:vllm-v0.2.4 - $ docker run -it \ - --network=host \ - --group-add=video \ - --ipc=host \ - --cap-add=SYS_PTRACE \ - --security-opt seccomp=unconfined \ - --device /dev/kfd \ - --device /dev/dri \ - -v :/app/model \ - embeddedllminfo/vllm-rocm \ - bash - - -.. _build_from_source_rocm: - -Option 2: Build from source ---------------------------- - -You can build and install vLLM from source: - -Below instruction is for ROCm 5.7 only. -At the time of this documentation update, PyTorch on ROCm 6.0 wheel is not yet available on the PyTorch website. - -0. Install prerequisites (skip if you are already in an environment/docker with the following installed): - -- `ROCm `_ -- `Pytorch `_ - - .. code-block:: console - - $ pip install torch==2.2.0.dev20231206+rocm5.7 --index-url https://download.pytorch.org/whl/nightly/rocm5.7 # tested version - - -1. Install `flash attention for ROCm `_ - - Install ROCm's flash attention (v2.0.4) following the instructions from `ROCmSoftwarePlatform/flash-attention `_ - -.. note:: - - If you are using rocm5.7 with pytorch 2.1.0 onwards, you don't need to apply the `hipify_python.patch`. You can build the ROCm flash attention directly. - - If you fail to install `ROCmSoftwarePlatform/flash-attention`, try cloning from the commit `6fd2f8e572805681cd67ef8596c7e2ce521ed3c6`. - - ROCm's Flash-attention-2 (v2.0.4) does not support sliding windows attention. - - You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`) - -2. Setup `xformers==0.0.23` without dependencies, and apply patches to adapt for ROCm flash attention +.. _build_from_source_docker_rocm: - .. code-block:: console +Option 1: Build from source with docker (recommended) +----------------------------------------------------- - $ pip install xformers==0.0.23 --no-deps - $ bash patch_xformers.rocm.sh +You can build and install vLLM from source. -3. Build vLLM. +First, build a docker image from `Dockerfile.rocm `_ and launch a docker container from the image. - .. code-block:: console +`Dockerfile.rocm `_ uses ROCm 6.0 by default, but also supports ROCm 5.7. +It provides flexibility to customize the build of docker image using the following arguments: - $ cd vllm - $ pip install -U -r requirements-rocm.txt - $ python setup.py install # This may take 5-10 minutes. Currently, `pip install .`` does not work for ROCm installation +* `BASE_IMAGE`: specifies the base image used when running ``docker build``, specifically the PyTorch on ROCm base image. We have tested ROCm 5.7 and ROCm 6.0. The default is `rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1` +* `BUILD_FA`: specifies whether to build CK flash-attention. The default is 1. For `Radeon RX 7900 series (gfx1100) `_, this should be set to 0 before flash-attention supports this target. +* `FX_GFX_ARCHS`: specifies the GFX architecture that is used to build CK flash-attention, for example, `gfx90a;gfx942` for MI200 and MI300. The default is `gfx90a;gfx942` +* `FA_BRANCH`: specifies the branch used to build the CK flash-attention in `ROCm's flash-attention repo `_. The default is `ae7928c` +* `BUILD_TRITON`: specifies whether to build triton flash-attention. The default value is 1. +Their values can be passed in when running ``docker build`` with ``--build-arg`` options. -.. _build_from_source_docker_rocm: -Option 3: Build from source with docker ------------------------------------------------------ +To build vllm on ROCm 6.0 for MI200 and MI300 series, you can use the default: -You can build and install vLLM from source: +.. code-block:: console -Build a docker image from `Dockerfile.rocm`, and launch a docker container. + $ docker build -f Dockerfile.rocm -t vllm-rocm . -The `Dockerfile.rocm` is designed to support both ROCm 5.7 and ROCm 6.0 and later versions. It provides flexibility to customize the build of docker image using the following arguments: +To build vllm on ROCm 6.0 for Radeon RX7900 series (gfx1100), you should specify ``BUILD_FA`` as below: -* `BASE_IMAGE`: specifies the base image used when running ``docker build``, specifically the PyTorch on ROCm base image. We have tested ROCm 5.7 and ROCm 6.0. The default is `rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1` -* `FX_GFX_ARCHS`: specifies the GFX architecture that is used to build flash-attention, for example, `gfx90a;gfx942` for MI200 and MI300. The default is `gfx90a;gfx942` -* `FA_BRANCH`: specifies the branch used to build the flash-attention in `ROCmSoftwarePlatform's flash-attention repo `_. The default is `3d2b6f5` -* `BUILD_FA`: specifies whether to build flash-attention. For `Radeon RX 7900 series (gfx1100) `_, this should be set to 0 before flash-attention supports this target. +.. code-block:: console -Their values can be passed in when running ``docker build`` with ``--build-arg`` options. + $ docker build --build-arg BUILD_FA="0" -f Dockerfile.rocm -t vllm-rocm . -For example, to build docker image for vllm on ROCm 5.7, you can run: +To build docker image for vllm on ROCm 5.7, you can specify ``BASE_IMAGE`` as below: .. code-block:: console $ docker build --build-arg BASE_IMAGE="rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1" \ -f Dockerfile.rocm -t vllm-rocm . -To build vllm on ROCm 6.0, you can use the default: +To run the above docker image ``vllm-rocm``, use the below command: .. code-block:: console - $ docker build -f Dockerfile.rocm -t vllm-rocm . $ docker run -it \ --network=host \ --group-add=video \ @@ -133,7 +74,13 @@ To build vllm on ROCm 6.0, you can use the default: vllm-rocm \ bash -Alternatively, if you plan to install vLLM-ROCm on a local machine or start from a fresh docker image (e.g. rocm/pytorch), you can follow the steps below: +Where the `` is the location where the model is stored, for example, the weights for llama2 or llama3 models. + + +.. _build_from_source_rocm: + +Option 2: Build from source +--------------------------- 0. Install prerequisites (skip if you are already in an environment/docker with the following installed): @@ -141,32 +88,50 @@ Alternatively, if you plan to install vLLM-ROCm on a local machine or start from - `Pytorch `_ - `hipBLAS `_ -1. Install `flash attention for ROCm `_ +For installing PyTorch, you can start from a fresh docker image, e.g, `rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2`, `rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1`, `rocm/pytorch-nightly`. - Install ROCm's flash attention (v2.0.4) following the instructions from `ROCmSoftwarePlatform/flash-attention `_ +Alternatively, you can install pytorch using pytorch wheels. You can check Pytorch installation guild in Pytorch `Getting Started `_ + +For rocm6.0: + +.. code-block:: console + + $ pip3 install torch --index-url https://download.pytorch.org/whl/rocm6.0 + + +For rocm5.7: + +.. code-block:: console + + $ pip install torch --index-url https://download.pytorch.org/whl/rocm5.7 + + +1. Install `Triton flash attention for ROCm `_ + +Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from `ROCm/triton `_ + +2. Optionally, if you choose to use CK flash attention, you can install `flash attention for ROCm `_ + +Install ROCm's flash attention (v2.0.4) following the instructions from `ROCm/flash-attention `_ .. note:: - If you are using rocm5.7 with pytorch 2.1.0 onwards, you don't need to apply the `hipify_python.patch`. You can build the ROCm flash attention directly. - - If you fail to install `ROCmSoftwarePlatform/flash-attention`, try cloning from the commit `6fd2f8e572805681cd67ef8596c7e2ce521ed3c6`. + - If you fail to install `ROCm/flash-attention`, try cloning from the commit `6fd2f8e572805681cd67ef8596c7e2ce521ed3c6`. - ROCm's Flash-attention-2 (v2.0.4) does not support sliding windows attention. - You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`) -2. Setup `xformers==0.0.23` without dependencies, and apply patches to adapt for ROCm flash attention - - .. code-block:: console - - $ pip install xformers==0.0.23 --no-deps - $ bash patch_xformers.rocm.sh - 3. Build vLLM. - .. code-block:: console +.. code-block:: console - $ cd vllm - $ pip install -U -r requirements-rocm.txt - $ python setup.py install # This may take 5-10 minutes. + $ cd vllm + $ pip install -U -r requirements-rocm.txt + $ python setup.py install # This may take 5-10 minutes. Currently, `pip install .`` does not work for ROCm installation -.. note:: - - You may need to turn on the ``--enforce-eager`` flag if you experience process hang when running the `benchmark_thoughput.py` script to test your installation. +.. tip:: + - You may need to turn on the ``--enforce-eager`` flag if you experience process hang when running the `benchmark_thoughput.py` script to test your installation. + - Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers. + - To use CK flash-attention, please use this flag ``export VLLM_USE_FLASH_ATTN_TRITON=0`` to turn off triton flash attention. + - The ROCm version of pytorch, ideally, should match the ROCm driver version.