Document vLLM integration as a first class citizen in Triton (#88)

--------- Co-authored-by: Neelay Shah <[email protected]> Co-authored-by: Ryan McCormick <[email protected]>
triton-inference-server · Oct 20, 2023 · 1336ebc · 1336ebc
1 parent 504abc9
commit 1336ebc
Show file tree

Hide file tree

Showing 3 changed files with 147 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -66,7 +66,7 @@ Triton release.
 
 **TensorRT**: The TensorRT backend is used to execute TensorRT
 models. The
-[server](https://github.com/triton-inference-server/tensorrt_backend)
+[tensorrt_backend](https://github.com/triton-inference-server/tensorrt_backend)
 repo contains the source for the backend.
 
 **ONNX Runtime**: The ONNX Runtime backend is used to execute ONNX
@@ -115,6 +115,14 @@ random forest models. The
 [fil_backend](https://github.com/triton-inference-server/fil_backend) repo
 contains the documentation and source for the backend.
 
+**vLLM**: The vLLM backend is designed to run
+[supported models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
+on a [vLLM engine](https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py).
+This backend depends on [python_backend](https://github.com/triton-inference-server/python_backend)
+to load and serve models. The
+[vllm_backend](https://github.com/triton-inference-server/vllm_backend) repo
+contains the documentation and source for the backend.
+
 **Important Note!** Not all the above backends are supported on every platform
 supported by Triton. Look at the
 [Backend-Platform Support Matrix](docs/backend_platform_support_matrix.md)
@@ -567,3 +575,15 @@ but the listed CMake argument can be used to override.
 * triton-inference-server/core: -DTRITON_CORE_REPO_TAG=[tag]
 
 See the [CMakeLists.txt](CMakeLists.txt) file for other build options.
+
+## Python-based Backends
+
+Triton also provides an option to create [Python-based backends](docs/python_based_backends.md).
+These backends should implement the
+[`TritonPythonModel` interface](https://github.com/triton-inference-server/python_backend#usage),
+which could be re-used as a backend by multiple models.
+While the only required function is `execute`,
+you may find it helpful to enhance your implementation by adding `initialize`,
+`finalize`, and any other helper functions. For examples, please refer to
+the [vLLM backend](https://github.com/triton-inference-server/vllm_backend),
+which provides a common python script to serve models supported by vLLM.
diff --git a/docs/backend_platform_support_matrix.md b/docs/backend_platform_support_matrix.md
@@ -1,5 +1,5 @@
 <!--
-# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
 # modification, are permitted provided that the following conditions
@@ -38,7 +38,7 @@ GPU in this document refers to Nvidia GPU. See
 [GPU, Driver, and CUDA Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)
 to learn more about supported GPUs.
 
-## Ubuntu 20.04
+## Ubuntu 22.04
 
 The table below describes target device(s) supported for inference by
 each backend on different platforms.
@@ -53,7 +53,7 @@ each backend on different platforms.
 | Python[^1]   |  :heavy_check_mark: GPU <br/> :heavy_check_mark: CPU  |  :heavy_check_mark: GPU <br/> :heavy_check_mark: CPU  |
 | DALI         |  :heavy_check_mark: GPU <br/> :heavy_check_mark: CPU  | :heavy_check_mark: GPU[^2] <br/> :heavy_check_mark: CPU[^2] |
 | FIL          |  :heavy_check_mark: GPU <br/> :heavy_check_mark: CPU  |  Unsupported  |
-
+| vLLM         |  :heavy_check_mark: GPU <br/> :heavy_check_mark: CPU  |  Unsupported  |
 
 
 ## Windows 10

diff --git a/docs/python_based_backends.md b/docs/python_based_backends.md
@@ -0,0 +1,122 @@
+<!--
+# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Python-based Backends
+
+Python-based backend is a special type of Triton's backends, which does
+not require any C++ code. However, this type of backends depends on
+[Python backend](https://github.com/triton-inference-server/python_backend)
+and requires the following artifacts being present:
+`libtriton_python.so`, `triton_python_backend_stub`,
+and `triton_python_backend_utils.py`.
+
+## Usage
+To implement and use a Python-based backend, make sure to follow these steps.
+* Implement the
+[`TritonPythonModel` interface](https://github.com/triton-inference-server/python_backend#usage),
+which could be re-used as a backend by multiple models.
+This script should be named `model.py`.
+* Create a folder for your custom backend under the backends directory
+(ex: /opt/tritonserver/backends) with the corresponding backend name,
+containing the `model.py`. For example, for a backend named
+`my_python_based_backend`, Triton would expect to find the full path
+`/opt/tritonserver/backends/my_python_based_backend/model.py`.
+* Make sure that `libtriton_python.so`, `triton_python_backend_stub`,
+and `triton_python_backend_utils.py` are present either under
+`/opt/tritonserver/backends/my_python_based_backend/` or
+`/opt/tritonserver/backends/python/`. When both locations contain
+mentioned artifacts, custom backend's artifacts will take priority over Python
+backend's artifacts. This way, if custom backends needs to use a different
+Python version than what is shipped by default, it can easily be done. Please,
+refer to [customization](#customization) section for more details.
+* Specify `my_python_based_backend` as a backend in `config.pbtxt`
+for any model, that should use this backend.
+
+```
+...
+backend: "my_python_based_backend"
+...
+```
+
+Since Triton uses Python backend under the hood, it is expected,
+to see `python` backend entry in server logs, even when Python backend
+is not explicitly used.
+
+```
+I1013 21:52:45.756456 18668 server.cc:619]
++-------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
+| Backend                 | Path                                                        | Config                                                                                                              |
++-------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
+| python                  | /opt/tritonserver/backends/python/libtriton_python.so       | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability" |
+|                         |                                                             | :"6.000000","default-max-batch-size":"4"}}                                                                          |
+| my_python_based_backend | /opt/tritonserver/backends/my_python_based_backend/model.py | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability" |
+|                         |                                                             | :"6.000000","default-max-batch-size":"4"}}                                                                          |
++-------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
+```
+
+## Customization
+
+Python backend shipped in the NVIDIA GPU Cloud containers uses Python 3.10.
+Python backend is able to use the libraries that exist in the
+current Python environment. These libraries can be installed in a virtualenv,
+conda environment, or the global system Python, and
+will only be used if the Python version matches the Python version
+of the Python backend's stub executable (`triton_python_backend_stub`).
+For example, if you install a set of libraries in a Python 3.9 environment
+and your Python backend stub is compiled with Python 3.10 these libraries
+will *NOT* be available. You would need to
+[compile](https://github.com/triton-inference-server/python_backend#building-custom-python-backend-stub)
+the stub executable with Python 3.9.
+
+If you want to create a tar file that contains all your Python dependencies
+or you want to use different Python environments for each Python model
+you need to create a
+[Custom Execution Environment](https://github.com/triton-inference-server/python_backend#creating-custom-execution-environments)
+in Python backend.
+
+## Background
+
+In some use cases, it is sufficient to implement
+[`TritonPythonModel` interface](https://github.com/triton-inference-server/python_backend#usage)
+only once and re-use it across multiple models. As an example, please refer
+to the [vLLM backend](https://github.com/triton-inference-server/vllm_backend),
+which provides a common python script to serve models supported by vLLM.
+
+Triton Inference Server can handle this special case and treats common
+`model.py` script as a Python-based backend. In the scenario, when model
+relies on a custom Python-based backend, Triton loads `libtriton_python.so`
+first, this ensures that Triton knows how to send requests to the backend
+for execution and the backend knows how to communicate with Triton. Then,
+Triton makes sure to use common `model.py` from the backend's repository,
+and not look for it in the model repository.
+
+While the only required function is `execute`, it is typically helpful
+to enhance your implementation by adding `initialize`, `finalize`,
+and any other helper functions. Users are also encouraged to make use of the
+[`auto_complete_config`](https://github.com/triton-inference-server/python_backend#auto_complete_config)
+function to define standardized input and output properties upfront.