TensorRT-LLM

TensorRT-LLM is optimizes LLM inference on NVIDIA GPUs. It compiles models into a TensorRT engine with in-flight batching, paged KV caching, and tensor parallelism. AutoDeploy accepts Transformers models without requiring any changes. It automatically converts the model to an optimized runtime.

Pass a model id from the Hub to build_and_run_ad.py to run a Transformers model.

bash

cd examples/auto_deploy
python build_and_run_ad.py --model meta-llama/Llama-3.2-1B

Under the hood, AutoDeploy creates an LLM class. It loads the model configuration with [AutoConfig.from_pretrained] and extracts any parallelism metadata stored in tp_plan. [AutoModelForCausalLM.from_pretrained] loads the model with the config and enables Transformers' built-in tensor parallelism.

from tensorrt_llm._torch.auto_deploy import LLM

llm = LLM(model="meta-llama/Llama-3.2-1B")

TensorRT-LLM extracts the model graph with torch.export and applies optimizations. It replaces Transformers attention with TensorRT-LLM attention kernels and compiles the model into an optimized execution backend.

Resources

TensorRT-LLM docs for more detailed usage guides.
AutoDeploy guide explains how it works with advanced examples.