quickstart/IntroNotebooks/3. Understanding TensorRT Runtimes.ipynb
Remember that TensorRT consists of two main components - 1. A series of parsers and integrations to convert your model to an optimized engine and 2. An series of TensorRT runtime APIs with several associated tools for deployment.
In this notebook, we will focus on the latter - various runtime options for TensorRT engines.
The runtimes have different use cases for running TRT engines.
Generally speaking, there are a few major considerations when picking a runtime:
Use this when:
More info:
The TensorRT Python API gives you fine-grained control over the execution of your engine using a Python interface. It makes memory allocation, kernel execution, and copies to and from the GPU explicit - which can make integration into high performance applications easier. It is also great for testing models in a Python environment - such as in a Jupyter notebook.
The ONNX notebook for PyTorch is a good example of using TensorRT to get great performance while staying in Python.
Use this when:
More info:
The TensorRT C++ API gives you fine-grained control over the execution of your engine using a C++ interface. It makes memory allocation, kernel execution, and copies to and from the GPU explicit - which can make integration into high performance C++ applications easier. The C++ API is generally the most performant option for running TensorRT engines, with the least overhead.
This NVIDIA Developer blog is a good example of taking an ONNX model and running it with dynamic batch size support using the C++ API.
Use this when:
More info:
Torch-TRT is the standard runtime used with models that were converted in Torch-TRT. It works by taking groups of nodes at once in the PyTorch graph, and replacing them with a singular optimized engine that calls the TensorRT Python API behind the scenes. This optimized engine is in the form of a PyTorch operation - which means that your graph is still in PyTorch and will essentially function like any other PyTorch model.
If your graph entirely converts to a single Torch-TRT engine, it can be more efficient to export the engine node and run it using one of the other APIs. You can find instructions to do this in the Torch-TRT documentation.
As an example, the Torch-TRT notebooks included with this guide use the Torch-TRT runtime.
Use this when:
More info:
TRITON is an open source inference serving software that lets teams deploy trained AI models from any framework (TensorFlow, TensorRT, PyTorch, ONNX Runtime, or a custom framework), from local storage or Google Cloud Platform or AWS S3 on any GPU- or CPU-based infrastructure (cloud, data center, or edge). It is a flexible project with several unique features - such as concurrent model execution of both heterogeneous models and multiple copies of the same model (multiple model copies can reduce latency further) as well as load balancing and model analysis. It is a good option if you need to serve your models over HTTP - such as in a cloud inferencing solution.
You can find the TRITON home page here, and the documentation here.