Getting Started with TensorRT

TensorRT is an SDK for optimizing trained deep learning models to enable high-performance inference. TensorRT contains a deep learning inference optimizer for trained deep learning models and an optimized runtime for execution. After you have trained your deep learning model in a framework of your choice, TensorRT enables you to run it with higher throughput and lower latency.

The TensorRT ecosystem breaks broadly down into two parts:

Essentially,

The various paths users can follow to convert their models to optimized TensorRT engines
The various runtimes users can target with TensorRT when deploying their optimized TensorRT engines

If you have a model in PyTorch and want to run inference as efficiently as possible - with low latency, high throughput, and less memory consumption - this guide will help you achieve just that!!

How Do I Use TensorRT:

TensorRT is a large and flexible project. It can handle a variety of workflows, and which workflow is best for you will depend on your specific use-case and problem setting. Abstractly, the process for deploying a model from a deep learning framework to TensorRT looks like this:

To help you get there, this guide will help you answer five key questions:

What format should I save my model in?
What batch size(s) am I running inference at?
What precision am I running inference at?
What TensorRT path am I using to convert my model?
What runtime am I targeting?

This guide will walk you broadly through all of these decision points while giving you an overview of your options at each step.

We could talk about these points in isolation, but they are best understood in the context of an actual end-to-end workflow. Let's get started on a simple one here, using a TensorRT API wrapper written for this guide. Once you understand the basic workflow, you can dive into the more in depth notebooks on the Torch-TRT and ONNX converters!

Simple TensorRT Demonstration through ONNX:

There are several ways of approaching TensorRT conversion and deployment. Here, we will take a pretrained ResNet50 model, convert it to an optimized TensorRT engine, and run it in the TensorRT runtime.

For this simple demonstration we will focus the ONNX path - one of the two main automatic approaches for TensorRT conversion. We will then run the model in the TensorRT Python API using a simplified wrapper written for this guide. Essentially, we will follow this path to convert and deploy our model:

We will follow the five questions above. For a more in depth discussion, the section following this demonstration will cover options available at these steps in more detail.

IMPORTANT NOTE: Please shutdown all other notebooks and PyTorch processes before running these steps. TensorRT and PyTorch can not be loaded into your Python processes at the same time.

1. What format should I save my model in?

The two main automatic conversion paths for TensorRT require different model formats to successfully convert a model. Torch-TRT uses PyTorch saved models and the ONNX path requires models be saved in ONNX. Here, we will use ONNX.

We are going to use ResNet50 - a basic backbone vision model that can be used for a variety of purposes. For the sake of demonstration, here we will perform classification using a pretrained ResNet50 ONNX model included with the ONNX model zoo.

We can download a pretrained ResNet50 from the ONNX model zoo and untar it by doing the following:

python

!wget https://download.onnxruntime.ai/onnx/models/resnet50.tar.gz -O resnet50.tar.gz
!tar xzf resnet50.tar.gz

See how to export ONNX models that will work with this same trtexec command in the PyTorch through ONNX notebook.

2. What precision will I use?

Inference typically requires less numeric precision than training. With some care, lower precision can give you faster computation and lower memory consumption without sacrificing any meaningful accuracy. TensorRT supports TF32, FP32, FP16, and INT8 precisions.

FP32 is the default training precision of most frameworks, so we will start by using FP32 for inference here. Let's create a "dummy" batch to work with in order to test our model. TensorRT will use the precision of the input batch throughout the rest of the network by default.

python

import numpy as np
PRECISION = np.float32

# The input tensor shape of the ONNX model.
input_shape = (1, 3, 224, 224)

dummy_input_batch = np.zeros(input_shape, dtype=PRECISION)

3. What TensorRT path am I using to convert my model?

The ONNX conversion path is one of the most universal and performant paths for automatic TensorRT conversion. It works for PyTorch and many other frameworks. There are several tools to help users convert models from ONNX to a TensorRT engine.

One common approach is to use trtexec - a command line tool included with TensorRT that can, among other things, convert ONNX models to TensorRT engines and profile them.

python

!trtexec --onnx=resnet50/model.onnx --saveEngine=resnet_engine_intro.engine

Notes on the flags above:

Tell trtexec where to find our ONNX model:

--onnx=resnet50/model.onnx

Tell trtexec where to save our optimized TensorRT engine:

--saveEngine=resnet_engine_intro.engine

4. What runtime will I use?

After we have our TensorRT engine created successfully, we need to decide how to run it with TensorRT.

There are two types of TensorRT runtimes: a standalone runtime which has C++ and Python bindings, and a native integration into PyTorch. In this section, we will use a simplified wrapper (ONNXClassifierWrapper) which calls the standalone runtime.

python

!python3 -m pip install "numpy<2.0"

python

# If you get an error in this cell, restart your notebook (possibly your whole machine) and do not run anything that imports/uses PyTorch

from onnx_helper import ONNXClassifierWrapper
trt_model = ONNXClassifierWrapper("resnet_engine_intro.engine", target_dtype = PRECISION)

Note: If this conversion fails, please restart your Jupyter notebook kernel (in menu bar Kernel->Restart Kernel) and run steps 3 to 5 again. If you get an error like 'TypeError: pybind11::init(): factory function returned nullptr' there is likely some dangling process on the GPU - restart your machine and try again.

We will feed our batch of randomized dummy data into our ONNXClassifierWrapper to run inference on that batch:

python

# Warm up:
trt_model.predict(dummy_input_batch)[:10] # softmax probability predictions for the first 10 classes of the first sample

We can get a rough sense of performance using %%timeit:

python

%%timeit
trt_model.predict(dummy_input_batch)[:10]

Applying TensorRT to Your Model:

This is a simple example applied to a single model, but how should you go about answering these questions for your workload?

First and foremost, it is a good idea to get an understanding of what your options are, and where you can learn more about them!

Compatible Models: MLP/CNN/RNN/Transformer/Embedding/Etc

TensorRT is compatible with models consisting of these layers. Using only supported layers ensures optimal performance without having to write any custom plugin code.

In terms of framework, TensorRT is integrated directly with PyTorch - and most other major deep learning frameworks are supported by first converting to ONNX format.

Conversion Methods: ONNX/TensorRT API

The ONNX path is the most performant and framework-agnostic automatic way of converting models. It's main disadvantage is that it must convert networks completely - if a network has an unsupported layer ONNX can't convert it unless you write a custom plugin.

You can see an example of how to use TensorRT with ONNX:

Here in this guide for PyTorch

There is the TensorRT API. The TensorRT ONNX path automatically convert models to TensorRT engines for you. Sometimes, however, we want to convert something complex, or have the maximum amount of control in how our TensorRT engine is created. This let's us do things like creating custom plugins for layers that TensorRT doesn't support.

When using this approach, we create TensorRT engine manually operation-by-operation using the TensorRT API's available in Python and C++. This process involves building a network identical in structure to your target network using the TensorRT API, and then loading in the weights directly in proper format. You can find more details on this in the TensorRT documentation.

Precision: TF32/FP32/FP16/INT8/FP8

TensorRT feature support - such as precision - for NVIDIA GPUs is determined by their compute capability. You can check the compute cabapility of your card on the NVIDIA website.

TensorRT supports different precisions depending on said compute capability. You can check what features are supported by your compute capability in the TensorRT documentation.

TF32 is the default training precision on cards with compute cabapilities 8.0 and higher (e.g. NVIDIA A100 and later) - use when you want to replicate your original model performance as closely as possible on cards with compute capability of 8.0 or higher.

TF32 is a precision designed to preserve the range of FP32 with the precision of FP16. In practice, this means that TF32 models train faster than FP32 models while still converging to the same accuracy. This feature is only available on newer GPUs.

FP32 is the default training precision on cards with compute cabapilities of less than 8.0 (e.g. pre-NVIDIA A100) - use when you want to replicate your original model performance as closely as possible on cards with compute capability of less than 8.0

FP16 is an inference focused reduced precision. It gives up some accuracy for faster models with lower latency and lower memory footprint. In practice, the accuracy loss is generally negligible in FP16 - so FP16 is a fairly safe bet in most cases for inference. Cards that are focused on deep learning training often have strong FP16 capabilities, making FP16 a great choice for GPUs that are expected to be used for both training and inference.

INT8 is an inference focused reduced precision. It further reduces memory requirements and latency compared to FP16. INT8 has the potential to lose more accuracy than FP16 - but TensorRT provides tools to help you quantize your network's INT8 weights to avoid this as much as possible. INT8 requires the extra step of calibrating how TensorRT should quantize your weights to integers - requiring some sample data. With careful tuning and a good calibration dataset, accuracy loss from INT8 is often minimal. This makes INT8 a great precision for lower-power environments such as those using T4 GPUs or AGX Jetson modules - both of which have strong INT8 capabilities.

FP8 8-bit floating point type with 1-bit for sign, 4-bits for exponent, 3-bits for mantissa is an inference focused reduced precision. Similar to INT8, it further reduces memory requirements and latency compared to FP16 with potential suffer from precision loss, especially in deep models with large ranges of values. Extra steps of calibration are needed to minimize accuracy loss. The FP8 precision is supported on NVIDIA GPUs with compute cabapilities of 9.0 and higher (e.g. NVIDIA H100 and later).

Runtime: Torch-TRT/Python API/C++ API/TRITON

For a more in depth discussion of these options and how they compare see this notebook on TensorRT Runtimes!

What do I do if I run into issues with conversion?

Here are several steps you can try if your model is not converting to TensorRT properly:

Check the logs - if you are using a tool such as trtexec to convert your model, it will tell you which layer is problematic
Write a custom plugin - you can find more information on it here.
Use alternative implementations of the layers or operations in question in your network definition - for example, it can be easier to use the padding argument in your convolutional layers instead of adding an explicit padding layer to the network.
ONNX to TensorRT conversion can be hard to debug, but tools like graph surgeon https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/graphsurgeon/graphsurgeon.html can help you fix specific nodes in your graph as well as pull it apart for analysis or patch specific nodes in your graph
Ask on the NVIDIA developer forums - we have many active TensorRT experts at NVIDIA who who browse the forums and can help
Post an issue on the TensorRT OSS Github

Next Steps:

You have now taken a model saved in ONNX format, converted it to an optimized TensorRT engine, and deployed it using the Python runtime. This is a great first step towards getting better performance out of your deep learning models at inference time!

Now, you can check out the remaining notebooks in this guide. See:

<h4> Profiling </h4>

This is a great next step for further optimizing and debugging models you are working on productionizing

You can find it here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#performance

Main documentation page for the ONNX, layer builder, C++, and legacy APIs

You can find it here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html

<h4> TRT OSS GitHub </h4>

Contains OSS TRT components, sample applications, and plugin examples

You can find it here: https://github.com/NVIDIA/TensorRT