Related Pages | Server Docs |
---|
Conceptual guides have been designed as an onboarding experience to Triton Inference Server. These guides will cover:
- Part 1: Model Deployment: This guide talks about deploying and managing multiple models.
- Part 2: Improving Resource Utilization: This guide discusses two popular features/techniques used to maximize a GPU's utilization whilst deploying models.
- Part 3: Optimizing Triton Configuration: Each deployment has requirements specific to the use case. This guide walks users through the process of tailoring deployment configurations to match the SLAs.
- Part 4: Accelerating Models: Another path towards achieving higher throughput is to accelerate the underlying models. This guide covers SDKs and tools which can be used to accelerate the models.
- Part 5: Building Model Ensembles: Models are rarely used standalone. This guide will cover "how to build a deep learning inference pipeline?"
- Part 6: Using the BLS API to build complex pipelines: Often times there are scenarios where the pipeline requires control flows. Learn how to work with complex pipelines with models deployed on different backends.
- Part 7: Iterative Scheduling Tutorial: Shows how to use the Triton Iterative Scheduler with a GPT2 model using HuggingFace Transformers.
- Part 8: Semantic Caching: Shows benefits of adding semantic caching to you LLM-based workflow.