Runtimes: What are my options? How do I choose?

Remember that TensorRT consists of two main components - 1. A series of parsers and integrations to convert your model to an optimized engine and 2. An series of TensorRT runtime APIs with several associated tools for deployment.

In this notebook, we will focus on the latter - various runtime options for TensorRT engines.

The runtimes have different use cases for running TRT engines.

Considerations when picking a runtime:

Generally speaking, there are a few major considerations when picking a runtime:

Framework - Some options, like Torch-TRT, are only relevant to PyTorch
Time-to-solution - Torch-TRT is much more likely to work 'out-of-the-box' if a quick solution is required and ONNX fails
Serving needs - Torch-TRT can use TorchServe to serve models over Cloud as a simple solution. For other frameworks (or for more advanced features) TRITON is framework agnostic, allows for concurrent model execution or multiple copies within a GPU to reduce latency, and can accept engines created through both the ONNX and Torch-TRT paths
Performance - Different TensorRT runtimes offer varying levels of performance. For example, Torch-TRT is generally going to be slower than using ONNX or the C++ API directly.

Python API:

Use this when:

You can accept some performance overhead, and
You are most familiar with Python, or
You are performing initial debugging and testing with TRT

More info:

The TensorRT Python API gives you fine-grained control over the execution of your engine using a Python interface. It makes memory allocation, kernel execution, and copies to and from the GPU explicit - which can make integration into high performance applications easier. It is also great for testing models in a Python environment - such as in a Jupyter notebook.

The ONNX notebook for PyTorch is a good example of using TensorRT to get great performance while staying in Python.

C++ API:

Use this when:

You want the least amount of overhead possible to maximize the performance of your models and achieve better latency
You are not using Torch-TRT (though Torch-TRT graph conversions that only generate a single engine can still be exported to C++)
You are most familiar with C++
You want to optimize your inference pipeline as much as possible

More info:

The TensorRT C++ API gives you fine-grained control over the execution of your engine using a C++ interface. It makes memory allocation, kernel execution, and copies to and from the GPU explicit - which can make integration into high performance C++ applications easier. The C++ API is generally the most performant option for running TensorRT engines, with the least overhead.

This NVIDIA Developer blog is a good example of taking an ONNX model and running it with dynamic batch size support using the C++ API.

Torch-TRT Runtime:

Use this when:

You are using Torch-TRT, and
Your model converts to more than one TensorRT engine

More info:

Torch-TRT is the standard runtime used with models that were converted in Torch-TRT. It works by taking groups of nodes at once in the PyTorch graph, and replacing them with a singular optimized engine that calls the TensorRT Python API behind the scenes. This optimized engine is in the form of a PyTorch operation - which means that your graph is still in PyTorch and will essentially function like any other PyTorch model.

If your graph entirely converts to a single Torch-TRT engine, it can be more efficient to export the engine node and run it using one of the other APIs. You can find instructions to do this in the Torch-TRT documentation.

As an example, the Torch-TRT notebooks included with this guide use the Torch-TRT runtime.

TRITON Inference Server

Use this when:

You want to serve your models over HTTP or gRPC
You want to load balance across multiple models or copies of models across GPUs to minimze latency and make better use of the GPU
You want to have multiple models running efficiently on a single GPU at the same time
You want to serve a variety of models converted using a variety of converters and frameworks (including Torch-TRT and ONNX) through a uniform interface
You need serving support but are using PyTorch, another framework, or the ONNX path in general

More info:

TRITON is an open source inference serving software that lets teams deploy trained AI models from any framework (TensorFlow, TensorRT, PyTorch, ONNX Runtime, or a custom framework), from local storage or Google Cloud Platform or AWS S3 on any GPU- or CPU-based infrastructure (cloud, data center, or edge). It is a flexible project with several unique features - such as concurrent model execution of both heterogeneous models and multiple copies of the same model (multiple model copies can reduce latency further) as well as load balancing and model analysis. It is a good option if you need to serve your models over HTTP - such as in a cloud inferencing solution.

You can find the TRITON home page here, and the documentation here.