docs/en/guides/triton-inference-server.md
The Triton Inference Server (formerly known as TensorRT Inference Server) is an open-source software solution developed by NVIDIA. It provides a cloud inference solution optimized for NVIDIA GPUs. Triton simplifies the deployment of AI models at scale in production. Integrating Ultralytics YOLO26 with Triton Inference Server allows you to deploy scalable, high-performance deep learning inference workloads. This guide provides steps to set up and test the integration.
<p align="center"> <iframe loading="lazy" width="720" height="405" src="https://www.youtube.com/embed/NQDtfSi5QF4" title="Getting Started with NVIDIA Triton Inference Server" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen> </iframe><strong>Watch:</strong> Getting Started with NVIDIA Triton Inference Server.
</p>Triton Inference Server is designed to deploy a variety of AI models in production. It supports a wide range of deep learning and machine learning frameworks, including PyTorch, TensorFlow, ONNX, OpenVINO, TensorRT and many others. Its primary use cases are:
Using Triton Inference Server with Ultralytics YOLO26 provides several advantages:
Ensure you have the following prerequisites before proceeding:
ultralytics:
pip install ultralytics
tritonclient:
pip install tritonclient[all]
Run this full setup block to export Ultralytics YOLO26 to ONNX, build the Triton model repository, and start Triton Inference Server:
!!! note
Use the `runtime` switch in the script to choose your container engine:
- Set `runtime = "docker"` for Docker
- Set `runtime = "podman"` for Podman
import contextlib
import subprocess
import time
from pathlib import Path
from tritonclient.http import InferenceServerClient
from ultralytics import YOLO
runtime = "docker" # set to "podman" to use Podman
# 1) Exporting YOLO26 to ONNX Format
# Load a model
model = YOLO("yolo26n.pt") # load an official model
# Retrieve metadata during export. Metadata needs to be added to config.pbtxt. See next section.
metadata = []
def export_cb(exporter):
metadata.append(exporter.metadata)
model.add_callback("on_export_end", export_cb)
# Export the model
onnx_file = model.export(format="onnx", dynamic=True)
# 2) Setting Up Triton Model Repository
# Define paths
model_name = "yolo"
triton_repo_path = Path("tmp") / "triton_repo"
triton_model_path = triton_repo_path / model_name
# Create directories
(triton_model_path / "1").mkdir(parents=True, exist_ok=True)
# Move ONNX model to Triton Model path
Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")
# Create config file
(triton_model_path / "config.pbtxt").touch()
data = """
# Add metadata
parameters {
key: "metadata"
value {
string_value: "%s"
}
}
# (Optional) Enable TensorRT for GPU inference
# First run will be slow due to TensorRT engine conversion
optimization {
execution_accelerators {
gpu_execution_accelerator {
name: "tensorrt"
parameters {
key: "precision_mode"
value: "FP16"
}
parameters {
key: "max_workspace_size_bytes"
value: "3221225472"
}
parameters {
key: "trt_engine_cache_enable"
value: "1"
}
parameters {
key: "trt_engine_cache_path"
value: "/models/yolo/1"
}
}
}
}
""" % metadata[0] # noqa
with open(triton_model_path / "config.pbtxt", "w") as f:
f.write(data)
# 3) Running Triton Inference Server
# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:26.02-py3" # 16.17 GB (Compressed Size)
subprocess.call(f"{runtime} pull {tag}", shell=True)
# GPU flags differ between Docker and Podman
gpu_flags = "--device nvidia.com/gpu=all" if runtime == "podman" else "--runtime=nvidia --gpus all"
container_name = "triton_server"
# Note: The :z flag on the volume mount is necessary for systems with SELinux (like Fedora/RHEL)
subprocess.call(
f"{runtime} run -d --rm --name {container_name} {gpu_flags} -v {triton_repo_path.absolute()}:/models:z -p 8000:8000 {tag} tritonserver --model-repository=/models",
shell=True,
)
# Wait for the Triton server to start
triton_client = InferenceServerClient(url="127.0.0.1:8000", verbose=False, ssl=False)
# Wait until model is ready
for _ in range(10):
with contextlib.suppress(Exception):
assert triton_client.is_model_ready(model_name)
break
time.sleep(1)
Run inference using the Triton Server model:
from ultralytics import YOLO
# Load the Triton Server model
model = YOLO("http://127.0.0.1:8000/yolo", task="detect")
# Run inference on the server
results = model("path/to/image.jpg")
Cleanup the container (runtime and container_name are defined in the setup block above):
import subprocess
runtime = "docker" # set to "podman" to use Podman
container_name = "triton_server" # Kill the named container
subprocess.call(f"{runtime} kill {container_name}", shell=True)
For even greater performance, you can use TensorRT with Triton Inference Server. TensorRT is a high-performance deep learning optimizer built specifically for NVIDIA GPUs that can significantly increase inference speed.
Key benefits of using TensorRT with Triton include:
To use TensorRT directly, you can export your Ultralytics YOLO26 model to TensorRT format:
from ultralytics import YOLO
# Load the YOLO26 model
model = YOLO("yolo26n.pt")
# Export the model to TensorRT format
model.export(format="engine") # creates 'yolo26n.engine'
For more information on TensorRT optimization, see the TensorRT integration guide.
By following the above steps, you can deploy and run Ultralytics YOLO26 models efficiently on Triton Inference Server, providing a scalable and high-performance solution for deep learning inference tasks. If you face any issues or have further queries, refer to the official Triton documentation or reach out to the Ultralytics community for support.
Setting up Ultralytics YOLO26 with NVIDIA Triton Inference Server involves a few key steps:
Export YOLO26 to ONNX format:
from ultralytics import YOLO
# Load a model
model = YOLO("yolo26n.pt") # load an official model
# Export the model to ONNX format
onnx_file = model.export(format="onnx", dynamic=True)
Set up Triton Model Repository:
from pathlib import Path
# Define paths
model_name = "yolo"
triton_repo_path = Path("tmp") / "triton_repo"
triton_model_path = triton_repo_path / model_name
# Create directories
(triton_model_path / "1").mkdir(parents=True, exist_ok=True)
Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")
(triton_model_path / "config.pbtxt").touch()
Run the Triton Server:
import contextlib
import subprocess
import time
from tritonclient.http import InferenceServerClient
# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:26.02-py3"
runtime = "docker" # set to "podman" to use Podman
subprocess.call(f"{runtime} pull {tag}", shell=True)
# GPU flags differ between Docker and Podman
gpu_flags = "--device nvidia.com/gpu=all" if runtime == "podman" else "--runtime=nvidia --gpus all"
container_name = "triton_server"
subprocess.call(
f"{runtime} run -d --rm --name {container_name} {gpu_flags} -v {triton_repo_path.absolute()}:/models:z -p 8000:8000 {tag} tritonserver --model-repository=/models",
shell=True,
)
triton_client = InferenceServerClient(url="127.0.0.1:8000", verbose=False, ssl=False)
for _ in range(10):
with contextlib.suppress(Exception):
assert triton_client.is_model_ready(model_name)
break
time.sleep(1)
This setup can help you efficiently deploy Ultralytics YOLO26 models at scale on Triton Inference Server for high-performance AI model inference.
Integrating Ultralytics YOLO26 with NVIDIA Triton Inference Server provides several advantages:
For detailed instructions on setting up and running Ultralytics YOLO26 with Triton, refer to Setting Up Triton Inference Server and Running Inference.
Using ONNX (Open Neural Network Exchange) format for your Ultralytics YOLO26 model before deploying it on NVIDIA Triton Inference Server offers several key benefits:
To export your model, use:
from ultralytics import YOLO
model = YOLO("yolo26n.pt")
onnx_file = model.export(format="onnx", dynamic=True)
You can follow the steps in the ONNX integration guide to complete the process.
Yes, you can run inference using the Ultralytics YOLO26 model on NVIDIA Triton Inference Server. Once your model is set up in the Triton Model Repository and the server is running, you can load and run inference on your model as follows:
from ultralytics import YOLO
# Load the Triton Server model
model = YOLO("http://127.0.0.1:8000/yolo", task="detect")
# Run inference on the server
results = model("path/to/image.jpg")
This approach allows you to leverage Triton's optimizations while using the familiar Ultralytics YOLO interface.
Ultralytics YOLO26 offers several unique advantages compared to TensorFlow and PyTorch models for deployment:
For more details, compare the deployment options in the model export guide.