TensorRT Engine Explorer Tutorial

Use this notebook to learn how to use trex to explore the structure and characteristics of a TensorRT Engine plan. Starting with TensorRT 8.2, engine-plan graph and profiling data can be exported to JSON files. trex loads these files and queries their content using a simple API that wraps Pandas' API

1. Create Engine Plan And Profiling JSON Files

Using trtexec it is easy to create the plan graph and profiling JSON files. Three flags are required:

 --exportProfile=$profile_json
 --exportLayerInfo=$graph_json
 --profilingVerbosity=detailed

A utility Python script utils/process_engine.py can be used to create the JSON files. This script executes trtexec and therefore the script can only be invoked within an environment that has trtexec installed and included in $PATH.

python

# !python3 ../utils/process_engine.py ../tests/inputs/mobilenet.qat.onnx ../tests/inputs best

2. Load JSON Files

python

%matplotlib inline
import matplotlib.pyplot as plt
import os
import pandas as pd
import trex
import trex.notebook
import trex.plotting
import trex.graphing
import trex.df_preprocessing

# Configure a wider output (for the wide graphs)
trex.notebook.set_wide_display()

# Choose an engine file to load.  This notebook assumes that you've saved the engine to the following paths.
engine_name = "../tests/inputs/mobilenet.qat.onnx.engine"
engine_name = "../tests/inputs/mobilenet_v2_residuals.qat.onnx.engine"

3. Create an `EnginePlan` instance and start exploring

python

assert engine_name is not None
plan = trex.EnginePlan(f'{engine_name}.graph.json', f'{engine_name}.profile.json', f'{engine_name}.profile.metadata.json')

Summary

It is helpful to look at a high-level summary of the engine plan before diving into the details.

By default an EnginePlan's name is derived from the graph file name, but you can also set the name explicitly at any time. This is useful when there are multiple plans with the same file name.
The data is summarized from the input JSON files, and may include results from profiling layers separately as well as from profiling the entire engine. The latter provides more accurate values for latency and inferences throughput, but the former provides per-layer latency information for understanding the engine behavior.
"Average time": refers to the sum of the layer latencies, when profiling layers separately.
"Latency": refers to the [min, max, mean, median, 99% percentile] of the engine latency measurements, when timing the engine w/o profiling layers.
"Throughput": is measured in inferences per second (IPS).

python

print(f"Summary for {plan.name}:\n")
plan.summary()

Get to know the plan dataframe

An EnginePlan is an object that wraps a Pandas DataFrame data-structure. Most of the examples below utilize this dataframe (df) for querying, slicing and rendering information about the EnginePlan.

The dataframe captures the information from the plan engine graph and profiling JSON files. If both JSON files are available the latency data of each layer is added as three new columns: latency.time (the total latency of the layer summed across all measurement iterations), latency.avg_time (the average latency of the layer), latency.pct_time (the latency of the layer as a proportion of the overall engine latency).

When the dataframe is constructed several new layers that summarize footprint information are added: total_io_size_bytes, weights_size, total_footprint_bytes.

A few of the column names are changed from the original JSON file, to give them clearer names.

Accessing the dataframe is straight-forward:

python

df = plan.df

You can print the names of the columns in the dataframe.

python

available_cols = df.columns
print(f"These are the column names in the plan\n: {available_cols}")

A dataframe can be rendered as a table. The columns are from various layers so the dataframe is very sparse.

Use the column controls to sort or filter layers.

An interesting view sorts the layers by latency.pct_time.

The dtale toolbar makes it easy to open the table in a new tab (useful for large tables) and to export the data to CSV and HTML.

python

trex.notebook.display_df(plan.df)

When rendering engine plan dataframes we usually want to reduce the visual clutter and render only the important columns.

The function clean_for_display does exactly that.

The column order is changed, in order to bring important columns to the front.

Columns Inputs and Outputs are reformatted to reduce verbosity.

Finally, a few columns are dropped and NaNs are replaced with zeros.

python

df = trex.df_preprocessing.clean_for_display(plan.df)
print(f"These are the column names in the plan\n: {df.columns}")
trex.notebook.display_df(df)

Layer Types

This example shows how to create a bar diagram of the count of each layer type.

trex provides a utility wrapper around Pandas' API, but you can freely use the Pandas API to extract data from the plan dataframe.

python

layer_types = trex.group_count(plan.df, 'type')

# Simple DF print
print(layer_types)

# dtale DF display
trex.notebook.display_df(layer_types)

trex provides wrappers to plotly's plotting API. plotly_bar2 is the main utility for creating bar charts.

python

trex.plotting.plotly_bar2(
    df=layer_types, 
    title='Layer Count By Type', 
    values_col='count', 
    names_col='type',
    orientation='v',
    color='type',
    colormap=trex.colors.layer_colormap,
    show_axis_ticks=(True, True));

Performance

Pandas' powerful API can be used on the Plan dataframe. For example, we can easily query for the 3 layers that consume the most time:

python

top3 = plan.df.nlargest(3, 'latency.pct_time')
trex.notebook.display_df(top3)

The chart below provides a quick view of the layers latencies. The values of df[values_col] set the bar height and the values of df[names_col] provide the bar name. In this case, the latency of each layer is plotted vs the name of the layer. The colors of the bars are determined by color and colormap, if provided.

For example in the statement colormap[df['type']], the bar colors are determined by the layer type and layer_colormap, which is a trex dictionary which maps layer types to preset colors.

python

trex.plotting.plotly_bar2(
    df=plan.df, 
    title="% Latency Budget Per Layer",
    values_col="latency.pct_time",
    names_col="Name",
    color='type',
    use_slider=False,
    colormap=trex.colors.layer_colormap);

plotly_hist is a wrapper of Plotly's histograms chart. It has arguments similar to plotly_bar2, but not as many. plotly_hist plots the histogram of df[values_col].

Here's a look at how the layer latencies distribute:

python

trex.plotting.plotly_hist(
    df=plan.df, 
    title="Layer Latency Distribution", 
    values_col="latency.pct_time",
    xaxis_title="Latency (ms)",
    color='type',
    colormap=trex.colors.layer_colormap);

Pandas' aggregation and reductions can be used to provide interesting information.

Here we group the layer latencies by the layer types. The data can be displayed as a chart or as a summary table, like the one below.

python

time_pct_by_type = plan.df.groupby(["type"])[["latency.pct_time", "latency.avg_time"]].sum().reset_index()
trex.notebook.display_df(time_pct_by_type)
trex.plotting.plotly_bar2(
    df=time_pct_by_type,
    title="% Latency Budget Per Layer Type",
    values_col="latency.pct_time",
    names_col="type",
    orientation='h',
    color='type',
    colormap=trex.colors.layer_colormap);

Treemaps provide a different view of the profiling data.

In this example we use a Plotly Express Treemap directly, without any wrappers.

python

import plotly.express as px

fig = px.treemap(
    plan.df,
    path=['type', 'Name'],
    values='latency.pct_time',
    title='Treemap Of Layer Latencies (Size & Color Indicate Latency)',
    color='latency.pct_time')
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

This is another view of how layer latencies interact with layer data size.

python

fig = px.treemap(
    plan.df,
    path=['type', 'Name'],
    values='latency.pct_time',
    title='Treemap Of Layer Latencies (Size Indicates Latency. Color Indicates Activations Size)',
    color='total_io_size_bytes')
fig.update_traces(root_color="white")
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

Memory Traffic

python

trex.plotting.plotly_bar2(
    plan.df, 
    "Weights Sizes Per Layer", 
    "weights_size", "Name", 
    color='type', 
    colormap=trex.colors.layer_colormap)

trex.plotting.plotly_bar2(
    plan.df, 
    "Activations Sizes Per Layer", 
    "total_io_size_bytes", 
    "Name", 
    color='type', 
    colormap=trex.colors.layer_colormap)

trex.plotting.plotly_hist(
    plan.df, 
    "Layer Activations Sizes Distribution", 
    "total_io_size_bytes", 
    "Size (bytes)", 
    color='type', 
    colormap=trex.colors.layer_colormap)

plan.df["total_io_size_bytes"].describe()

Layers Precision

trex provides a wrapper to Plotly's pie charts as well. Several charts can be plotted in a grid.

precision_colormap color the pie slices by the value of df['precision'].

python

charts = []
layer_precisions = trex.group_count(plan.df, 'precision')
charts.append((layer_precisions, 'Layer Count By Precision', 'count', 'precision'))

layers_time_pct_by_precision = trex.group_sum_attr(plan.df, grouping_attr='precision', reduced_attr='latency.pct_time')
display(layers_time_pct_by_precision)

charts.append((layers_time_pct_by_precision, '% Latency Budget By Precision', 'latency.pct_time', 'precision'))
trex.plotting.plotly_pie2("Precision Statistics", charts, colormap=trex.colors.precision_colormap);

python

trex.plotting.plotly_bar2(
    plan.df, 
    "% Latency Budget Per Layer
(bar color indicates precision)", 
    "latency.pct_time", 
    "Name",
    color='precision',
    colormap=trex.colors.precision_colormap);

Graph Rendering

It is very helpful to draw the graph of the engine plan.

A formatter can be used to configure the colors of nodes. trex provides layer_type_formatter which paints graph nodes by their layer type, and precision_formatter which paints graph nodes according to their precision.

to_dot converts an EnginePlan to dot file which can be rendered to SVG or PNG.

SVG files render faster than PNG, they are searchable and provide sharp and crisp graphs in all resolutions. Because graphs are large, it is recommended to view the rendered graph file in another browser window.

python

formatter = trex.graphing.layer_type_formatter if True else trex.graphing.precision_formatter
graph = trex.graphing.to_dot(plan, formatter)
svg_name = trex.graphing.render_dot(graph, engine_name, 'svg')

PNG files can be rendered inside the notebook, but the graphs are usually very large and resolution suffers.

python

png_name = trex.graphing.render_dot(graph, engine_name, 'png')
from IPython.display import Image
display(Image(filename=png_name))

Convolution Layers

Sometimes it is interesting to look at all layers of a certain type. You can use Pandas' API (e.g. query) to slice the dataframe by layer type:

python

convs1 = plan.df.query("type == 'Convolution'")
convs2 = df[df.type == 'Convolution']

However, trex provides a get_layers_by_type API which performs some layer-type-specific preprocessing which is often useful. In the case of convolutions,

python

convs = plan.get_layers_by_type('Convolution')
trex.notebook.display_df(convs)

trex.plotting.plotly_bar2(
    convs, 
    "Latency Per Layer (%)
(bar color indicates precision)",
    "attr.arithmetic_intensity", "Name",
    color='precision', 
    colormap=trex.colors.precision_colormap)

trex.plotting.plotly_bar2(
    convs,
    "Convolution Data Sizes
(bar color indicates latency)",
    "total_io_size_bytes", 
    "Name", 
    color='latency.pct_time');

Arithmetic Intensity

Arithmetic intensity (AI) is a measure of the amount of compute expended per data byte. Layers with higher AI are in general more efficient because moving data is much slower than computing an operation. For each unit of data fetched from memory we want to perform many computations, because the GPU can compute much faster than it can fetch data from memory. From more on AI, see https://en.wikipedia.org/wiki/Roofline_model#Arithmetic_intensity.

This is a simplistic model which assumes that we read the data only once, and write it out once. In practice, when computing Convolution and GEMM operations, memory tiles are read several times, usually from fast shared-memory, L1 or L2 cached memory.

python

trex.plotting.plotly_bar2(
    convs, 
    "Convolution Arithmetic Intensity
(bar color indicates activations size)",
    "attr.arithmetic_intensity", 
    "Name",
    color='total_io_size_bytes')

trex.plotting.plotly_bar2(
    convs, 
    "Convolution Arithmetic Intensity
(bar color indicates latency)", 
    "attr.arithmetic_intensity", 
    "Name",
    color='latency.pct_time');

Efficiency

Another simplistic model measures the compute and memory efficiency. These indicators are calculated by taking the number of operations (or memory bytes) and dividing by the layer's execution time.

python

# Memory accesses per ms (assuming one time read/write penalty)
trex.plotting.plotly_bar2(
    convs, 
    "Convolution Memory Efficiency
(bar color indicates latency)", 
    "attr.memory_efficiency", 
    "Name", 
    color='latency.pct_time')

# Compute operations per ms (assuming one time read/write penalty)
trex.plotting.plotly_bar2(
    convs, 
    "Convolution Compute Efficiency
(bar color indicates latency)",
    "attr.compute_efficiency",
    "Name",
    color='latency.pct_time');

Statistics

python

convs = plan.get_layers_by_type('Convolution')

charts = []
convs_count_by_type = trex.group_count(convs, 'subtype')
charts.append((convs_count_by_type, 'Count', 'count', 'subtype'))

convs_time_pct_by_type = trex.group_sum_attr(convs, grouping_attr='subtype', reduced_attr='latency.pct_time')
charts.append((convs_time_pct_by_type, '% Latency Budget', 'latency.pct_time', 'subtype'))
trex.plotting.plotly_pie2("Convolutions Statistics (Subtype)", charts)

charts = []
convs_count_by_group_size = trex.group_count(convs, 'attr.groups')
charts.append((convs_count_by_group_size, 'Count', 'count', 'attr.groups'))

convs_time_pct_by_grp_size = trex.group_sum_attr(convs, grouping_attr='attr.groups', reduced_attr='latency.pct_time')
charts.append((convs_time_pct_by_grp_size, '% Latency Budget', 'latency.pct_time', 'attr.groups'))
trex.plotting.plotly_pie2("Convolutions Statistics (Number of Groups)", charts)

charts = []
convs_count_by_precision = trex.group_count(convs, 'precision')
charts.append((convs_count_by_precision, 'Count', 'count', 'precision'))

convs_time_pct_by_precision = trex.group_sum_attr(convs, grouping_attr='precision', reduced_attr='latency.pct_time')
charts.append((convs_time_pct_by_precision, '% Latency Budget', 'latency.pct_time', 'precision'))

trex.plotting.plotly_pie2("Convolutions Statistics (Precision)", charts, colormap=trex.colors.precision_colormap);

TensorRT Engine Explorer Tutorial

TensorRT Engine Explorer Tutorial

1. Create Engine Plan And Profiling JSON Files

2. Load JSON Files

3. Create an EnginePlan instance and start exploring

Summary

Get to know the plan dataframe

Layer Types

Performance

Memory Traffic

Layers Precision

Graph Rendering

Convolution Layers

Arithmetic Intensity

Efficiency

Statistics

3. Create an `EnginePlan` instance and start exploring