Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document vLLM integration as a first class citizen in Triton (#88) #89

Merged
merged 1 commit into from
Oct 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 22 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2021-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -66,7 +66,7 @@ Triton release.

**TensorRT**: The TensorRT backend is used to execute TensorRT
models. The
[server](https://github.com/triton-inference-server/tensorrt_backend)
[tensorrt_backend](https://github.com/triton-inference-server/tensorrt_backend)
repo contains the source for the backend.

**ONNX Runtime**: The ONNX Runtime backend is used to execute ONNX
Expand Down Expand Up @@ -115,6 +115,14 @@ random forest models. The
[fil_backend](https://github.com/triton-inference-server/fil_backend) repo
contains the documentation and source for the backend.

**vLLM**: The vLLM backend is designed to run
[supported models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
on a [vLLM engine](https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py).
This backend depends on [python_backend](https://github.com/triton-inference-server/python_backend)
to load and serve models. The
[vllm_backend](https://github.com/triton-inference-server/vllm_backend) repo
contains the documentation and source for the backend.

**Important Note!** Not all the above backends are supported on every platform
supported by Triton. Look at the
[Backend-Platform Support Matrix](docs/backend_platform_support_matrix.md)
Expand Down Expand Up @@ -567,3 +575,15 @@ but the listed CMake argument can be used to override.
* triton-inference-server/core: -DTRITON_CORE_REPO_TAG=[tag]

See the [CMakeLists.txt](CMakeLists.txt) file for other build options.

## Python-based Backends

Triton also provides an option to create [Python-based backends](docs/python_based_backends.md).
These backends should implement the
[`TritonPythonModel` interface](https://github.com/triton-inference-server/python_backend#usage),
which could be re-used as a backend by multiple models.
While the only required function is `execute`,
you may find it helpful to enhance your implementation by adding `initialize`,
`finalize`, and any other helper functions. For examples, please refer to
the [vLLM backend](https://github.com/triton-inference-server/vllm_backend),
which provides a common python script to serve models supported by vLLM.
6 changes: 3 additions & 3 deletions docs/backend_platform_support_matrix.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!--
# Copyright 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# Copyright 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -38,7 +38,7 @@ GPU in this document refers to Nvidia GPU. See
[GPU, Driver, and CUDA Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)
to learn more about supported GPUs.

## Ubuntu 20.04
## Ubuntu 22.04

The table below describes target device(s) supported for inference by
each backend on different platforms.
Expand All @@ -53,7 +53,7 @@ each backend on different platforms.
| Python[^1] | :heavy_check_mark: GPU <br/> :heavy_check_mark: CPU | :heavy_check_mark: GPU <br/> :heavy_check_mark: CPU |
| DALI | :heavy_check_mark: GPU <br/> :heavy_check_mark: CPU | :heavy_check_mark: GPU[^2] <br/> :heavy_check_mark: CPU[^2] |
| FIL | :heavy_check_mark: GPU <br/> :heavy_check_mark: CPU | Unsupported |

| vLLM | :heavy_check_mark: GPU <br/> :heavy_check_mark: CPU | Unsupported |


## Windows 10
Expand Down
122 changes: 122 additions & 0 deletions docs/python_based_backends.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
<!--
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->

# Python-based Backends

Python-based backend is a special type of Triton's backends, which does
not require any C++ code. However, this type of backends depends on
[Python backend](https://github.com/triton-inference-server/python_backend)
and requires the following artifacts being present:
`libtriton_python.so`, `triton_python_backend_stub`,
and `triton_python_backend_utils.py`.

## Usage
To implement and use a Python-based backend, make sure to follow these steps.
* Implement the
[`TritonPythonModel` interface](https://github.com/triton-inference-server/python_backend#usage),
which could be re-used as a backend by multiple models.
This script should be named `model.py`.
* Create a folder for your custom backend under the backends directory
(ex: /opt/tritonserver/backends) with the corresponding backend name,
containing the `model.py`. For example, for a backend named
`my_python_based_backend`, Triton would expect to find the full path
`/opt/tritonserver/backends/my_python_based_backend/model.py`.
* Make sure that `libtriton_python.so`, `triton_python_backend_stub`,
and `triton_python_backend_utils.py` are present either under
`/opt/tritonserver/backends/my_python_based_backend/` or
`/opt/tritonserver/backends/python/`. When both locations contain
mentioned artifacts, custom backend's artifacts will take priority over Python
backend's artifacts. This way, if custom backends needs to use a different
Python version than what is shipped by default, it can easily be done. Please,
refer to [customization](#customization) section for more details.
* Specify `my_python_based_backend` as a backend in `config.pbtxt`
for any model, that should use this backend.

```
...
backend: "my_python_based_backend"
...
```

Since Triton uses Python backend under the hood, it is expected,
to see `python` backend entry in server logs, even when Python backend
is not explicitly used.

```
I1013 21:52:45.756456 18668 server.cc:619]
+-------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability" |
| | | :"6.000000","default-max-batch-size":"4"}} |
| my_python_based_backend | /opt/tritonserver/backends/my_python_based_backend/model.py | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability" |
| | | :"6.000000","default-max-batch-size":"4"}} |
+-------------------------+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
```

## Customization

Python backend shipped in the NVIDIA GPU Cloud containers uses Python 3.10.
Python backend is able to use the libraries that exist in the
current Python environment. These libraries can be installed in a virtualenv,
conda environment, or the global system Python, and
will only be used if the Python version matches the Python version
of the Python backend's stub executable (`triton_python_backend_stub`).
For example, if you install a set of libraries in a Python 3.9 environment
and your Python backend stub is compiled with Python 3.10 these libraries
will *NOT* be available. You would need to
[compile](https://github.com/triton-inference-server/python_backend#building-custom-python-backend-stub)
the stub executable with Python 3.9.

If you want to create a tar file that contains all your Python dependencies
or you want to use different Python environments for each Python model
you need to create a
[Custom Execution Environment](https://github.com/triton-inference-server/python_backend#creating-custom-execution-environments)
in Python backend.

## Background

In some use cases, it is sufficient to implement
[`TritonPythonModel` interface](https://github.com/triton-inference-server/python_backend#usage)
only once and re-use it across multiple models. As an example, please refer
to the [vLLM backend](https://github.com/triton-inference-server/vllm_backend),
which provides a common python script to serve models supported by vLLM.

Triton Inference Server can handle this special case and treats common
`model.py` script as a Python-based backend. In the scenario, when model
relies on a custom Python-based backend, Triton loads `libtriton_python.so`
first, this ensures that Triton knows how to send requests to the backend
for execution and the backend knows how to communicate with Triton. Then,
Triton makes sure to use common `model.py` from the backend's repository,
and not look for it in the model repository.

While the only required function is `execute`, it is typically helpful
to enhance your implementation by adding `initialize`, `finalize`,
and any other helper functions. Users are also encouraged to make use of the
[`auto_complete_config`](https://github.com/triton-inference-server/python_backend#auto_complete_config)
function to define standardized input and output properties upfront.