Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify decoupled models documentation #90

Merged
merged 15 commits into from
Nov 14, 2023
46 changes: 25 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -479,52 +479,56 @@ performance.

##### Decoupled Responses

It is also possible for a backend to send multiple responses for a
request or not send any responses for a request. A backend may also
It is also possible for a backend to send multiple responses
for a request. A backend may also
send responses out-of-order relative to the order that the request
batches are executed. Such backends are called *decoupled* backends.
The decoupled backends use one `ResponseFactory` object per request to keep
creating and sending any number of responses for the request. For this

The decoupled backends use one `ResponseFactory` object per request to
tanmayv25 marked this conversation as resolved.
Show resolved Hide resolved
create and send any number of responses for the request. They must send at
least one final response per request (even if it contains no output tensors).
tanmayv25 marked this conversation as resolved.
Show resolved Hide resolved
For this
kind of backend, executing a single inference request typically requires
the following steps:

* For each request input tensor use TRITONBACKEND_InputProperties to
1. For each request input tensor, use TRITONBACKEND_InputProperties to
get shape and datatype of the input as well as the buffer(s)
containing the tensor contents.

* Create a `ResponseFactory` object for the request using
2. Create a `ResponseFactory` object for the request using
TRITONBACKEND_ResponseFactoryNew.

1. Create a response from the `ResponseFactory` object using
TRITONBACKEND_ResponseNewFromFactory. As long as you have
`ResponseFactory` object you can continue creating responses.
3. Create a response from the `ResponseFactory` object using
TRITONBACKEND_ResponseNewFromFactory. As long as you have the
`ResponseFactory` object, you can continue creating responses.

2. For each output tensor which the request expects to be returned, use
4. For each output tensor which the request expects to be returned, use
TRITONBACKEND_ResponseOutput to create the output tensor of the
required datatype and shape. Use TRITONBACKEND_OutputBuffer to get a
pointer to the buffer where the tensor's contents should be written.

3. Use the inputs to perform the inference computation that produces
5. Use the inputs to perform the inference computation that produces
the requested output tensor contents into the appropriate output
buffers.

4. Optionally set parameters in the response.
6. Optionally set parameters in the response.

5. Send the response using TRITONBACKEND_ResponseSend. If this is the
last request then use TRITONSERVER_ResponseCompleteFlag with
TRITONBACKEND_ResponseSend. Otherwise continue with Step 1 for
sending next request
7. Send the response using TRITONBACKEND_ResponseSend.

* Release the request using TRITONBACKEND_RequestRelease.
8. Repeat steps 3-7 until there are no more responses.

9. Send the last response for a request using either TRIONBACKEND_ResponseSend
with a TRITONSERVER_ResponseCompleteFlag or after all responses have been
sent for a request using TRITONBACKEND_ResponseFactorySendFlags.
TRITONSERVER_RESPONSE_COMPLETE_FINAL flag using the
request's `ResponseFactory`. This is required for every request.

10. Release the request using TRITONBACKEND_RequestRelease.

###### Special Cases

The decoupled API is powerful and supports various special cases:

* If the backend should not send any more responses for the request,
TRITONBACKEND_ResponseFactorySendFlags can be used to send
TRITONSERVER_RESPONSE_COMPLETE_FINAL using the `ResponseFactory`.

* The model can also send responses out-of-order in which it received
requests.

Expand Down