Ragged Batching

Triton provides dynamic batching feature, which combines multiple requests for the same model execution to provide larger throughput. By default, the requests can be dynamically batched only if each input has the same shape across the requests. In order to exploit dynamic batching for cases where input shapes often vary, the client would need to pad the input tensors in the requests to the same shape.

Ragged batching is a feature to avoid explicit padding by allowing user to specify which of the inputs doesn't require the shape check. User can specify such input (ragged input) by setting allow_ragged_batch field in the model config:

...
input [
  {
    name: "input0"
    data_type: TYPE_FP32
    dims: [ 16 ]
    allow_ragged_batch: true
  }
]
...

How ragged input are processed in a batch of requests depends on the backend implementation. The backends, such as ONNX Runtime backend, TensorFlow backend, PyTorch backend, and TensorRT backend, require models to accept ragged inputs as 1-dimensional tensors. These backends concatenates the request inputs into the 1-dimensional tensor.

Because the concatenated input doesn't track the start and end index for each request, the backends often require the model to have additional input(s), batch input, that describe various information about the batch formed.

Batch Input

Batch input is often used in combination with ragged input to provide information about each batch element, such as the element count of an input for each request in the batch. A batch input is generated by Triton instead of being provided in the request, because the information can only be finalized after the dynamic batch is formed.

Besides element count, there are other batch input kinds that the user can specify, see the protobuf documentation for details.

Example on Ragged Input and Batch Input

If you have a model that accepts 1 variable length input tensor, INPUT, with shape [ -1, -1 ]. The first dimension is the batch dimension, and the second dimension is the variable-length content. When the client sends 3 requests of shapes [ 1, 3 ], [ 1, 4 ], [ 1, 5 ]. To exploit dynamic batching, the straight-forward way to implement this model would expect INPUT shape [ -1, -1 ] and assume that all inputs were padded to same length so that all requests become shape [ 1, 5 ] and thus Triton can batch and send them to the model as a single [ 3, 5 ] tensor. In this case, there will be overhead on padding the tensor and on extra model computation on the padded content. Below is the input config:

max_batch_size: 16
input [
  {
    name: "INPUT"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]

With triton ragged batching, the model will be implemented to expect INPUT shape [ -1 ] and an additional batch input, INDEX, shape [ -1 ] which the model should use to interpret the batch elements in INPUT. For such model, the client requests don't need to be padded and they can be sent as they are (with shapes [ 1, 3 ], [ 1, 4 ], [ 1, 5 ]). The backends discussed above will batch the input into a tensor of shape [ 12 ] which contains the 3 + 4 + 5 concatenation of the requests. Triton also creates the batch input tensor of shape [ 3 ] with value [ 3, 7, 12 ] which gives the offset into the input tensor where each batch element ends. Below is the input config:

max_batch_size: 16
input [
  {
    name: "INPUT"
    data_type: TYPE_FP32
    dims: [ -1 ]
    allow_ragged_batch: true
  }
]
batch_input [
  {
    kind: BATCH_ACCUMULATED_ELEMENT_COUNT
    target_name: "INDEX"
    data_type: TYPE_FP32
    source_input: "INPUT"
  }
]

The above example uses BATCH_ACCUMULATED_ELEMENT_COUNT type of ragged batching. Other types described in protobuf documentation operate similarly.