Skip to content

Commit

Permalink
Clarify decoupled models documentation (#90)
Browse files Browse the repository at this point in the history
Co-authored-by: Neelay Shah <[email protected]>
  • Loading branch information
dyastremsky and nnshah1 authored Nov 14, 2023
1 parent 4ec906b commit 9ff8e19
Showing 1 changed file with 25 additions and 22 deletions.
47 changes: 25 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -479,52 +479,55 @@ performance.

##### Decoupled Responses

It is also possible for a backend to send multiple responses for a
request or not send any responses for a request. A backend may also
It is also possible for a backend to send multiple responses
for a request. A backend may also
send responses out-of-order relative to the order that the request
batches are executed. Such backends are called *decoupled* backends.
The decoupled backends use one `ResponseFactory` object per request to keep
creating and sending any number of responses for the request. For this
kind of backend, executing a single inference request typically requires

The decoupled backends use one `ResponseFactory` object per request to
create and send any number of responses for the request. They must send at
least one final response per request (even if it is a flags-only response).
You can send a flags-only response with TRITONBACKEND_ResponseFactorySendFlags.
For this kind of backend, executing a single inference request typically requires
the following steps:

* For each request input tensor use TRITONBACKEND_InputProperties to
1. For each request input tensor, use TRITONBACKEND_InputProperties to
get shape and datatype of the input as well as the buffer(s)
containing the tensor contents.

* Create a `ResponseFactory` object for the request using
2. Create a `ResponseFactory` object for the request using
TRITONBACKEND_ResponseFactoryNew.

1. Create a response from the `ResponseFactory` object using
TRITONBACKEND_ResponseNewFromFactory. As long as you have
`ResponseFactory` object you can continue creating responses.
3. Create a response from the `ResponseFactory` object using
TRITONBACKEND_ResponseNewFromFactory. As long as you have the
`ResponseFactory` object, you can continue creating responses.

2. For each output tensor which the request expects to be returned, use
4. For each output tensor which the request expects to be returned, use
TRITONBACKEND_ResponseOutput to create the output tensor of the
required datatype and shape. Use TRITONBACKEND_OutputBuffer to get a
pointer to the buffer where the tensor's contents should be written.

3. Use the inputs to perform the inference computation that produces
5. Use the inputs to perform the inference computation that produces
the requested output tensor contents into the appropriate output
buffers.

4. Optionally set parameters in the response.
6. Optionally set parameters in the response.

5. Send the response using TRITONBACKEND_ResponseSend. If this is the
last request then use TRITONSERVER_ResponseCompleteFlag with
TRITONBACKEND_ResponseSend. Otherwise continue with Step 1 for
sending next request
7. Send the response using TRITONBACKEND_ResponseSend.

* Release the request using TRITONBACKEND_RequestRelease.
8. Repeat steps 3-7 until there are no more responses.

9. Send the last response for a request using either TRIONBACKEND_ResponseSend
with a TRITONSERVER_ResponseCompleteFlag or after all responses have been
sent for a request using TRITONBACKEND_ResponseFactorySendFlags.
This is required for every request.

10. Release the request using TRITONBACKEND_RequestRelease.

###### Special Cases

The decoupled API is powerful and supports various special cases:

* If the backend should not send any more responses for the request,
TRITONBACKEND_ResponseFactorySendFlags can be used to send
TRITONSERVER_RESPONSE_COMPLETE_FINAL using the `ResponseFactory`.

* The model can also send responses out-of-order in which it received
requests.

Expand Down

0 comments on commit 9ff8e19

Please sign in to comment.