docs/user_guide/performance_tuning.md
Given a trained model, how do I deploy it at-scale with an optimal configuration using Triton Inference Server? This document is here to help answer that.
For those who like a high level overview, below is the common flow for most use cases.
For those who wish to jump right in, skip to the end-to-end example.
For additional material, see the Triton Conceptual Guide tutorial.
Is my model compatible with Triton?
config.pbtxt may still be provided, but is not required
unless you want to explicitly set certain parameters.
Additionally, by enabling verbose logging via --log-verbose=1, you can see
the complete config that Triton sees internally in the server log output.
For other backends, refer to the
Minimal Model Configuration
required to get started.Can I run inference on my served model?
# NOTE: "my_model" represents a model currently being served by Triton
$ perf_analyzer -m my_model
...
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 482.8 infer/sec, latency 12613 usec
config.pbtxt inputs/outputs match what the model expects. If the config
is correct, check that the model runs successfully using its original
framework directly. If you don't have your own script or tool to do so,
Polygraphy
is a useful tool to run sample inferences on your model via various
frameworks. Currently, Polygraphy supports ONNXRuntime, TensorRT, and
TensorFlow 1.x.config.pbtxt) to obtain different results.How can I improve my model performance?
My model performs slowly when it is first loaded by Triton (cold-start penalty), what do I do?
Why doesn't my model perform significantly faster on GPU?
Note If you have never worked with Triton before, you may be interested in first checking out the Quickstart example. Some basic understanding of Triton may be useful for the following section, but this example is meant to be straightforward enough without prior experience.
Let's take an ONNX model as our example since ONNX is designed to be a format that can be easily exported from most other frameworks.
densenet_onnx model into it.# Create model repository with placeholder for model and version 1
mkdir -p ./models/densenet_onnx/1
# Download model and place it in model repository
wget -O models/densenet_onnx/1/model.onnx \
https://github.com/onnx/models/raw/main/validated/vision/classification/densenet-121/model/densenet-7.onnx
densenet_onnx model in our Model Repository at
./models/densenet_onnx/config.pbtxt.Note This is a slightly simplified version of another example config that utilizes other Model Configuration features not necessary for this example.
name: "densenet_onnx"
backend: "onnxruntime"
max_batch_size: 0
input: [
{
name: "data_0",
data_type: TYPE_FP32,
dims: [ 1, 3, 224, 224]
}
]
output: [
{
name: "prob_1",
data_type: TYPE_FP32,
dims: [ 1, 1000, 1, 1 ]
}
]
Note As of the 22.07 release, both Triton and Model Analyzer support fully auto-completing the config file for backends that support it. So for an ONNX model, for example, this step can be skipped unless you want to explicitly set certain parameters.
To serve our model, we will use the server container which comes pre-installed
with a tritonserver binary.
# Start server container
docker run -ti --rm --gpus=all --network=host -v $PWD:/mnt --name triton-server nvcr.io/nvidia/tritonserver:26.04-py3
# Start serving your models
tritonserver --model-repository=/mnt/models
Note The
-v $PWD:/mntis mounting your current directory on the host into the/mntdirectory inside the container. So if you created your model repository in$PWD/models, you will find it inside the container at/mnt/models. You can change these paths as needed. See docker volume docs for more information on how this works.
To check if the model loaded successfully, we expect to see our model in a
READY state in the output of the previous command:
...
I0802 18:11:47.100537 135 model_repository_manager.cc:1345] successfully loaded 'densenet_onnx' version 1
...
+---------------+---------+--------+
| Model | Version | Status |
+---------------+---------+--------+
| densenet_onnx | 1 | READY |
+---------------+---------+--------+
...
To verify our model can perform inference, we will use the triton-client
container that we already started which comes with perf_analyzer
pre-installed.
In a separate shell, we use Perf Analyzer to sanity check that we can run inference and get a baseline for the kind of performance we expect from this model.
In the example below, Perf Analyzer is sending requests to models served on the
same machine (localhost from the server container via --network=host).
However, you may also test models being served remotely at some <IP>:<PORT>
by setting the -u flag, such as perf_analyzer -m densenet_onnx -u 127.0.0.1:8000.
# Start the SDK container interactively
docker run -ti --rm --gpus=all --network=host -v $PWD:/mnt --name triton-client nvcr.io/nvidia/tritonserver:26.04-py3-sdk
# Benchmark model being served from step 3
perf_analyzer -m densenet_onnx --concurrency-range 1:4
...
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 265.147 infer/sec, latency 3769 usec
Concurrency: 2, throughput: 890.793 infer/sec, latency 2243 usec
Concurrency: 3, throughput: 937.036 infer/sec, latency 3199 usec
Concurrency: 4, throughput: 965.21 infer/sec, latency 4142 usec
While Model Analyzer comes pre-installed in the SDK (client) container and
supports various modes of connecting to a Triton server, for simplicity we will
use install Model Analyzer in our server container to use the local
(default) mode.
To learn more about other methods of connecting Model Analyzer to a running
Triton Server, see the --triton-launch-mode Model Analyzer flag.
# Enter server container interactively
docker exec -ti triton-server bash
# Stop existing tritonserver process if still running
# because model-analyzer will start its own server
SERVER_PID=`ps | grep tritonserver | awk '{ printf $1 }'`
kill ${SERVER_PID}
# Install model analyzer
pip install --upgrade pip
pip install triton-model-analyzer wkhtmltopdf
# Profile the model using local (default) mode
# NOTE: This may take some time, in this example it took ~10 minutes
model-analyzer profile \
--model-repository=/mnt/models \
--profile-models=densenet_onnx \
--output-model-repository-path=results
# Summarize the profiling results
model-analyzer analyze --analysis-models=densenet_onnx
Example Model Analyzer output summary:
In 51 measurements across 6 configurations,
densenet_onnx_config_3provides the best throughput: 323 infer/sec.This is a 92% gain over the default configuration (168 infer/sec), under the given constraints.
| Model Config Name | Max Batch Size | Dynamic Batching | Instance Count | p99 Latency (ms) | Throughput (infer/sec) | Max GPU Memory Usage (MB) | Average GPU Utilization (%) |
|---|---|---|---|---|---|---|---|
| densenet_onnx_config_3 | 0 | Enabled | 4/GPU | 35.8 | 323.13 | 3695 | 58.6 |
| densenet_onnx_config_2 | 0 | Enabled | 3/GPU | 59.575 | 295.82 | 3615 | 58.9 |
| densenet_onnx_config_4 | 0 | Enabled | 5/GPU | 69.939 | 291.468 | 3966 | 58.2 |
| densenet_onnx_config_default | 0 | Disabled | 1/GPU | 12.658 | 167.549 | 3116 | 51.3 |
In the table above, we see that setting our GPU Instance Count to 4 allows us to achieve the highest throughput and almost lowest latency on this system.
Also, note that this densenet_onnx model has a fixed batch-size that is
explicitly specified in the first dimension of the Input/Output dims,
therefore the max_batch_size parameter is set to 0 as described
here.
For models that support dynamic batch size, Model Analyzer would also tune the
max_batch_size parameter.
Warning These results are specific to the system running the Triton server, so for example, on a smaller GPU we may not see improvement from increasing the GPU instance count. In general, running the same configuration on systems with different hardware (CPU, GPU, RAM, etc.) may provide different results, so it is important to profile your model on a system that accurately reflects where you will deploy your models for your use case.
In our example above, densenet_onnx_config_3 was the optimal configuration.
So let's extract that config.pbtxt and put it back in our model repository for future use.
# (optional) Backup our original config.pbtxt (if any) to another directory
cp /mnt/models/densenet_onnx/config.pbtxt /tmp/original_config.pbtxt
# Copy over the optimal config.pbtxt from Model Analyzer results to our model repository
cp ./results/densenet_onnx_config_3/config.pbtxt /mnt/models/densenet_onnx/
Now that we have an optimized Model Configuration, we are ready to take our model to deployment. For further manual tuning, read the Model Configuration and Optimization docs to learn more about Triton's complete set of capabilities.
In this example, we happened to get both the highest throughput and almost lowest latency from the same configuration, but in some cases this is a tradeoff that must be made. Certain models or configurations may achieve a higher throughput but also incur a higher latency in return. It is worthwhile to fully inspect the reports generated by Model Analyzer to ensure your model performance meets your requirements.