Back to Triton Inference Server

Copyright (c) 2021-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

docs/user_guide/ragged_batching.md

2.68.05.8 KB
Original Source
<!-- # Copyright (c) 2021-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # * Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # * Neither the name of NVIDIA CORPORATION nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -->

Ragged Batching

Triton provides dynamic batching feature, which combines multiple requests for the same model execution to provide larger throughput. By default, the requests can be dynamically batched only if each input has the same shape across the requests. In order to exploit dynamic batching for cases where input shapes often vary, the client would need to pad the input tensors in the requests to the same shape.

Ragged batching is a feature to avoid explicit padding by allowing user to specify which of the inputs doesn't require the shape check. User can specify such input (ragged input) by setting allow_ragged_batch field in the model config:

...
input [
  {
    name: "input0"
    data_type: TYPE_FP32
    dims: [ 16 ]
    allow_ragged_batch: true
  }
]
...

How ragged input are processed in a batch of requests depends on the backend implementation. The backends, such as ONNX Runtime backend, TensorFlow backend, PyTorch backend, and TensorRT backend, require models to accept ragged inputs as 1-dimensional tensors. These backends concatenates the request inputs into the 1-dimensional tensor.

Because the concatenated input doesn't track the start and end index for each request, the backends often require the model to have additional input(s), batch input, that describe various information about the batch formed.

Batch Input

Batch input is often used in combination with ragged input to provide information about each batch element, such as the element count of an input for each request in the batch. A batch input is generated by Triton instead of being provided in the request, because the information can only be finalized after the dynamic batch is formed.

Besides element count, there are other batch input kinds that the user can specify, see the protobuf documentation for details.

Example on Ragged Input and Batch Input

If you have a model that accepts 1 variable length input tensor, INPUT, with shape [ -1, -1 ]. The first dimension is the batch dimension, and the second dimension is the variable-length content. When the client sends 3 requests of shapes [ 1, 3 ], [ 1, 4 ], [ 1, 5 ]. To exploit dynamic batching, the straight-forward way to implement this model would expect INPUT shape [ -1, -1 ] and assume that all inputs were padded to same length so that all requests become shape [ 1, 5 ] and thus Triton can batch and send them to the model as a single [ 3, 5 ] tensor. In this case, there will be overhead on padding the tensor and on extra model computation on the padded content. Below is the input config:

max_batch_size: 16
input [
  {
    name: "INPUT"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]

With triton ragged batching, the model will be implemented to expect INPUT shape [ -1 ] and an additional batch input, INDEX, shape [ -1 ] which the model should use to interpret the batch elements in INPUT. For such model, the client requests don't need to be padded and they can be sent as they are (with shapes [ 1, 3 ], [ 1, 4 ], [ 1, 5 ]). The backends discussed above will batch the input into a tensor of shape [ 12 ] which contains the 3 + 4 + 5 concatenation of the requests. Triton also creates the batch input tensor of shape [ 3 ] with value [ 3, 7, 12 ] which gives the offset into the input tensor where each batch element ends. Below is the input config:

max_batch_size: 16
input [
  {
    name: "INPUT"
    data_type: TYPE_FP32
    dims: [ -1 ]
    allow_ragged_batch: true
  }
]
batch_input [
  {
    kind: BATCH_ACCUMULATED_ELEMENT_COUNT
    target_name: "INDEX"
    data_type: TYPE_FP32
    source_input: "INPUT"
  }
]

The above example uses BATCH_ACCUMULATED_ELEMENT_COUNT type of ragged batching. Other types described in protobuf documentation operate similarly.