site/en/guide/profiler.md
[TOC]
This guide demonstrates how to use the tools available with the TensorFlow Profiler to track the performance of your TensorFlow models. You will learn how to understand how your model performs on the host (CPU), the device (GPU), or on a combination of both the host and device(s).
Profiling helps understand the hardware resource consumption (time and memory) of the various TensorFlow operations (ops) in your model and resolve performance bottlenecks and, ultimately, make the model execute faster.
This guide will walk you through how to install the Profiler, the various tools available, the different modes of how the Profiler collects performance data, and some recommended best practices to optimize model performance.
If you want to profile your model performance on Cloud TPUs, refer to the Cloud TPU guide.
Install the Profiler plugin for TensorBoard with pip. Note that the Profiler requires the latest versions of TensorFlow and TensorBoard (>=2.2).
pip install -U tensorboard_plugin_profile
To profile on the GPU, you must:
Meet the NVIDIA® GPU drivers and CUDA® Toolkit requirements listed on TensorFlow GPU support software requirements.
Make sure the NVIDIA® CUDA® Profiling Tools Interface (CUPTI) exists on the path:
/sbin/ldconfig -N -v $(sed 's/:/ /g' <<< $LD_LIBRARY_PATH) | \
grep libcupti
If you don't have CUPTI on the path, prepend its installation directory to the
$LD_LIBRARY_PATH environment variable by running:
export LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
Then, run the ldconfig command above again to verify that the CUPTI library is
found.
When you run profiling with CUDA® Toolkit in a Docker environment or on Linux,
you may encounter issues related to insufficient CUPTI privileges
(CUPTI_ERROR_INSUFFICIENT_PRIVILEGES). Go to the
NVIDIA Developer Docs
to learn more about how you can resolve these issues on Linux.
To resolve CUPTI privilege issues in a Docker environment, run
docker run option '--privileged=true'
<a name="profiler_tools"></a>
Access the Profiler from the Profile tab in TensorBoard, which appears only after you have captured some model data.
Note: The Profiler requires internet access to load the Google Chart libraries. Some charts and tables may be missing if you run TensorBoard entirely offline on your local machine, behind a corporate firewall, or in a data center.
The Profiler has a selection of tools to help with performance analysis:
<a name="overview_page"></a>
The overview page provides a top level view of how your model performed during a profile run. The page shows you an aggregated overview page for your host and all devices, and some recommendations to improve your model training performance. You can also select individual hosts in the Host dropdown.
The overview page displays data as follows:
Performance Summary: Displays a high-level summary of your model performance. The performance summary has two parts:
Step-time breakdown: Breaks down the average step time into multiple categories of where time is spent:
Device compute precisions - Reports the percentage of device compute time that uses 16 and 32-bit computations.
Step-time Graph: Displays a graph of device step time (in milliseconds) over all the steps sampled. Each step is broken into the multiple categories (with different colors) of where time is spent. The red area corresponds to the portion of the step time the devices were sitting idle waiting for input data from the host. The green area shows how much of time the device was actually working.
Top 10 TensorFlow operations on device (e.g. GPU): Displays the on-device ops that ran the longest.
Each row displays an op's self time (as the percentage of time taken by all ops), cumulative time, category, and name.
Run Environment: Displays a high-level summary of the model run environment including:
Recommendation for Next Step: Reports when a model is input bound and recommends tools you can use to locate and resolve model performance bottlenecks.
<a name="input_pipeline_analyzer"></a>
When a TensorFlow program reads data from a file it begins at the top of the TensorFlow graph in a pipelined manner. The read process is divided into multiple data processing stages connected in series, where the output of one stage is the input to the next one. This system of reading data is called the input pipeline.
A typical pipeline for reading records from files has the following stages:
An inefficient input pipeline can severely slow down your application. An application is considered input bound when it spends a significant portion of time in the input pipeline. Use the insights obtained from the input pipeline analyzer to understand where the input pipeline is inefficient.
The input pipeline analyzer tells you immediately whether your program is input bound and walks you through device- and host-side analysis to debug performance bottlenecks at any stage in the input pipeline.
Check the guidance on input pipeline performance for recommended best practices to optimize your data input pipelines.
To open the input pipeline analyzer, select Profile, then select input_pipeline_analyzer from the Tools dropdown.
The dashboard contains three sections:
The Summary reports if your program is input bound by presenting the percentage of device time spent on waiting for input from the host. If you are using a standard input pipeline that has been instrumented, the tool reports where most of the input processing time is spent.
The device-side analysis provides insights on time spent on the device versus on the host and how much device time was spent waiting for input data from the host.
The host-side analysis reports a breakdown of the input processing time (the
time spent on tf.data API ops) on the host into several categories:
Expand Input Op Statistics to inspect the statistics for individual input ops and their categories broken down by execution time.
A source data table will appear with each entry containing the following information:
<a name="tf_stats"></a>
The TensorFlow Stats tool displays the performance of every TensorFlow op (op) that is executed on the host or device during a profiling session.
The tool displays performance information in two panes:
The upper pane displays up to four pie charts:
The lower pane shows a table that reports data about TensorFlow ops with one row for each op and one column for each type of data (sort columns by clicking the heading of the column). Click the Export as CSV button on the right side of the upper pane to export the data from this table as a CSV file.
Note that:
If any ops have child ops:
If an op executes on the host:
If an op executes on the device:
You can choose to include or exclude Idle time in the pie charts and table.
<a name="trace_viewer"></a>
The trace viewer displays a timeline that shows:
The trace viewer allows you to identify performance problems in your model, then take steps to resolve them. For example, at a high level, you can identify whether input or model training is taking the majority of the time. Drilling down, you can identify which ops take the longest to execute. Note that the trace viewer is limited to 1 million events per device.
When you open the trace viewer, it appears displaying your most recent run:
This screen contains the following main elements:
The Timeline pane contains the following elements:
The trace viewer contains the following sections:
/device:GPU:0 (pid 0)). Each device node section contains the following tracks:
Events within the timeline are displayed in different colors; the colors themselves have no specific meaning.
The trace viewer can also display traces of Python function calls in your
TensorFlow program. If you use the tf.profiler.experimental.start API, you can
enable Python tracing by using the ProfilerOptions namedtuple when starting
profiling. Alternatively, if you use the sampling mode for profiling, you can
select the level of tracing by using the dropdown options in the Capture
Profile dialog.
<a name="gpu_kernel_stats"></a>
This tool shows performance statistics and the originating op for every GPU accelerated kernel.
The tool displays information in two panes:
The upper pane displays a pie chart which shows the CUDA kernels that have the highest total time elapsed.
The lower pane displays a table with the following data for each unique kernel-op pair:
blockDim.x, blockDim.y, blockDim.z.gridDim.x, gridDim.y, gridDim.z.<a name="memory_profile_tool"></a>
The Memory Profile tool monitors the memory usage of your device during the profiling interval. You can use this tool to:
The memory profile tool displays data in three sections:
This section displays a high-level summary of the memory profile of your TensorFlow program as shown below:
The memory profile summary has six fields:
(1 - Size of the largest chunk of free memory / Total free memory).This section displays a plot of the memory usage (in GiBs) and the percentage of fragmentation versus time (in ms).
The X-axis represents the timeline (in ms) of the profiling interval. The Y-axis on the left represents the memory usage (in GiBs) and the Y-axis on the right represents the percentage of fragmentation. At each point in time on the X-axis, the total memory is broken down into three categories: stack (in red), heap (in orange), and free (in green). Hover over a specific timestamp to view the details about the memory allocation/deallocation events at that point like below:
The pop-up window displays the following information:
temp for temporaries, output for activations and
gradients, and persist/dynamic for weights and constants.This table shows the active memory allocations at the point of peak memory usage in the profiling interval.
There is one row for each TensorFlow Op and each row has the following columns:
temp for temporaries, output for activations and
gradients, and persist/dynamic for weights and constants.Note: You can sort any column in the table and also filter rows by op name.
<a name="pod_viewer"></a>
The Pod Viewer tool shows the breakdown of a training step across all workers.
<a name="tf_data_bottleneck_analysis"></a>
Warning: This tool is experimental. Please open a GitHub Issue if the analysis result seems incorrect.
The tf.data bottleneck analysis tool automatically detects bottlenecks in
tf.data input pipelines in your program and provides recommendations on how to
fix them. It works with any program using tf.data regardless of the platform
(CPU/GPU/TPU). Its analysis and recommendations are based on this
guide.
It detects a bottleneck by following these steps:
tf.data input pipeline.The UI is divided into three sections: Performance Analysis Summary, Summary of All Input Pipelines and Input Pipeline Graph.
This section provides the summary of the analysis. It reports on slow tf.data
input pipelines detected in the profile. This section also shows the most input
bound host and its slowest input pipeline with the max latency. Most
importantly, it identifies which part of the input pipeline is the bottleneck
and how to fix it. The bottleneck information is provided with the iterator type
and its long name.
A long name is formatted as Iterator::<Dataset_1>::...::<Dataset_n>. In the
long name, <Dataset_n> matches the iterator type and the other datasets in the
long name represent downstream transformations.
For example, consider the following input pipeline dataset:
dataset = tf.data.Dataset.range(10).map(lambda x: x).repeat(2).batch(5)
The long names for the iterators from the above dataset will be:
| Iterator Type | Long Name |
|---|---|
| Range | Iterator::Batch::Repeat::Map::Range |
| Map | Iterator::Batch::Repeat::Map |
| Repeat | Iterator::Batch::Repeat |
| Batch | Iterator::Batch |
This section provides the summary of all input pipelines across all hosts.
Typically there is one input pipeline. When using the distribution strategy,
there is one host input pipeline running the program's tf.data code and
multiple device input pipelines retrieving data from the host input pipeline and
transferring it to the devices.
For each input pipeline, it shows the statistics of its execution time. A call is counted as slow if it takes longer than 50 μs.
This section shows the input pipeline graph with the execution time information. You can use "Host" and "Input Pipeline" to choose which host and input pipeline to see. Executions of the input pipeline are sorted by the execution time in descending order which you can choose using the Rank dropdown.
The nodes on the critical path have bold outlines. The bottleneck node, which is the node with the longest self time on the critical path, has a red outline. The other non-critical nodes have gray dashed outlines.
In each node,Start Time indicates the start time of the execution. The same
node may be executed multiple times, for example, if there is a Batch op in
the input pipeline. If it is executed multiple times, it is the start time of
the first execution.
Total Duration is the wall time of the execution. If it is executed multiple times, it is the sum of the wall times of all executions.
Self Time is Total Time without the overlapped time with its immediate child nodes.
"# Calls" is the number of times the input pipeline is executed.
<a name="collect_performance_data"></a>
The TensorFlow Profiler collects host activities and GPU traces of your TensorFlow model. You can configure the Profiler to collect performance data through either the programmatic mode or the sampling mode.
You can use the following APIs to perform profiling.
Programmatic mode using the TensorBoard Keras Callback
(tf.keras.callbacks.TensorBoard)
# Profile from batches 10 to 15
tb_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir,
profile_batch='10, 15')
# Train the model and use the TensorBoard Keras callback to collect
# performance profiling data
model.fit(train_data,
steps_per_epoch=20,
epochs=5,
callbacks=[tb_callback])
Programmatic mode using the tf.profiler Function API
tf.profiler.experimental.start('logdir')
# Train the model here
tf.profiler.experimental.stop()
Programmatic mode using the context manager
with tf.profiler.experimental.Profile('logdir'):
# Train the model here
pass
Note: Running the Profiler for too long can cause it to run out of memory. It is recommended to profile no more than 10 steps at a time. Avoid profiling the first few batches to avoid inaccuracies due to initialization overhead.
<a name="sampling_mode"></a>
Sampling mode: Perform on-demand profiling by using
tf.profiler.experimental.server.start to start a gRPC server with your
TensorFlow model run. After starting the gRPC server and running your model,
you can capture a profile through the Capture Profile button in the
TensorBoard profile plugin. Use the script in the Install profiler section
above to launch a TensorBoard instance if it is not already running.
As an example,
# Start a profiler server before your model runs.
tf.profiler.experimental.server.start(6009)
# (Model code goes here).
# Send a request to the profiler server to collect a trace of your model.
tf.profiler.experimental.client.trace('grpc://localhost:6009',
'gs://your_tb_logdir', 2000)
An example for profiling multiple workers:
# E.g., your worker IP addresses are 10.0.0.2, 10.0.0.3, 10.0.0.4, and you
# would like to profile for a duration of 2 seconds.
tf.profiler.experimental.client.trace(
'grpc://10.0.0.2:8466,grpc://10.0.0.3:8466,grpc://10.0.0.4:8466',
'gs://your_tb_logdir',
2000)
<a name="capture_dialog"></a>
Use the Capture Profile dialog to specify:
To profile custom training loops in your TensorFlow code, instrument the
training loop with the tf.profiler.experimental.Trace API to mark the step
boundaries for the Profiler.
The name argument is used as a prefix for the step names, the step_num
keyword argument is appended in the step names, and the _r keyword argument
makes this trace event get processed as a step event by the Profiler.
As an example,
for step in range(NUM_STEPS):
with tf.profiler.experimental.Trace('train', step_num=step, _r=1):
train_data = next(dataset)
train_step(train_data)
This will enable the Profiler's step-based performance analysis and cause the step events to show up in the trace viewer.
Make sure that you include the dataset iterator within the
tf.profiler.experimental.Trace context for accurate analysis of the input
pipeline.
The code snippet below is an anti-pattern:
Warning: This will result in inaccurate analysis of the input pipeline.
for step, train_data in enumerate(dataset):
with tf.profiler.experimental.Trace('train', step_num=step, _r=1):
train_step(train_data)
The profiler covers a number of use cases along four different axes. Some of the combinations are currently supported and others will be added in the future. Some of the use cases are:
The table below provides a quick overview of the TensorFlow-supported use cases mentioned above:
<a name="profiling_api_table"></a>
| Profiling API | Local | Remote | Multiple | Hardware |
: : : : workers : Platforms :
| :--------------------------- | :-------- | :-------- | :-------- | :-------- |
| TensorBoard Keras | Supported | Not | Not | CPU, GPU |
: Callback : : Supported : Supported : :
| tf.profiler.experimental | Supported | Not | Not | CPU, GPU |
: start/stop API : : Supported : Supported : :
| tf.profiler.experimental | Supported | Supported | Supported | CPU, GPU, |
: client.trace API : : : : TPU :
| Context manager API | Supported | Not | Not | CPU, GPU |
: : : supported : Supported : :
<a name="performance_best_practices"></a>
Use the following recommendations as applicable for your TensorFlow models to achieve optimal performance.
In general, perform all transformations on the device and ensure that you use the latest compatible version of libraries like cuDNN and Intel MKL for your platform.
Use the data from the [#input_pipeline_analyzer] to optimize your data input pipeline. An efficient data input pipeline can drastically improve the speed of your model execution by reducing device idle time. Try to incorporate the best practices detailed in the Better performance with the tf.data API guide and below to make your data input pipeline more efficient.
In general, parallelizing any ops that do not need to be executed sequentially can significantly optimize the data input pipeline.
In many cases, it helps to change the order of some calls or to tune the arguments such that it works best for your model. While optimizing the input data pipeline, benchmark only the data loader without the training and backpropagation steps to quantify the effect of the optimizations independently.
Try running your model with synthetic data to check if the input pipeline is a performance bottleneck.
Use tf.data.Dataset.shard for multi-GPU training. Ensure you shard very
early on in the input loop to prevent reductions in throughput. When working
with TFRecords, ensure you shard the list of TFRecords and not the contents
of the TFRecords.
Parallelize several ops by dynamically setting the value of
num_parallel_calls using tf.data.AUTOTUNE.
Consider limiting the usage of tf.data.Dataset.from_generator as it is
slower compared to pure TensorFlow ops.
Consider limiting the usage of tf.py_function as it cannot be serialized
and is not supported to run in distributed TensorFlow.
Use tf.data.Options to control static optimizations to the input pipeline.
Also read the tf.data performance analysis
guide for more
guidance on optimizing your input pipeline.
When working with image data, make your data augmentation more efficient by casting to different data types <b><i>after</i></b> applying spatial transformations, such as flipping, cropping, rotating, etc.
Note: Some ops like tf.image.resize transparently change the dtype to
fp32. Make sure you normalize your data to lie between 0 and 1 if it's not
done automatically. Skipping this step could lead to NaN errors if you have
enabled AMP.
In some instances, such as when you have a system with a high GPU to CPU ratio, all of the above optimizations may not be enough to eliminate bottlenecks in the data loader caused due to limitations of CPU cycles.
If you are using NVIDIA® GPUs for computer vision and audio deep learning applications, consider using the Data Loading Library (DALI) to accelerate the data pipeline.
Check the NVIDIA® DALI: Operations documentation for a list of supported DALI ops.
Run ops on multiple CPU threads with the tf.config.threading API to execute
them faster.
TensorFlow automatically sets the number of parallelism threads by default. The thread pool available for running TensorFlow ops depends on the number of CPU threads available.
Control the maximum parallel speedup for a single op by using
tf.config.threading.set_intra_op_parallelism_threads. Note that if you run
multiple ops in parallel, they will all share the available thread pool.
If you have independent non-blocking ops (ops with no directed path between them
on the graph), use tf.config.threading.set_inter_op_parallelism_threads to run
them concurrently using the available thread pool.
When working with smaller models on NVIDIA® GPUs, you can set
tf.compat.v1.ConfigProto.force_gpu_compatible=True to force all CPU tensors to
be allocated with CUDA pinned memory to give a significant boost to model
performance. However, exercise caution while using this option for unknown/very
large models as this might negatively impact the host (CPU) performance.
Follow the best practices detailed here and in the GPU performance optimization guide to optimize on-device TensorFlow model performance.
If you are using NVIDIA GPUs, log the GPU and memory utilization to a CSV file by running:
nvidia-smi
--query-gpu=utilization.gpu,utilization.memory,memory.total,
memory.free,memory.used --format=csv
When working with data that contains channel information (like images), optimize the data layout format to prefer channels last (NHWC over NCHW).
Channel-last data formats improve Tensor Core utilization and provide significant performance improvements especially in convolutional models when coupled with AMP. NCHW data layouts can still be operated on by Tensor Cores, but introduce additional overhead due to automatic transpose ops.
You can optimize the data layout to prefer NHWC layouts by setting
data_format="channels_last" for layers such as tf.keras.layers.Conv2D,
tf.keras.layers.Conv3D, and
tf.keras.layers.RandomRotation.
Use tf.keras.backend.set_image_data_format to set the default data layout
format for the Keras backend API.
When working with NVIDIA® GPUs, execute the code snippet below before the training loop to max out the L2 fetch granularity to 128 bytes.
import ctypes
_libcudart = ctypes.CDLL('libcudart.so')
# Set device limit on the current device
# cudaLimitMaxL2FetchGranularity = 0x05
pValue = ctypes.cast((ctypes.c_int*1)(), ctypes.POINTER(ctypes.c_int))
_libcudart.cudaDeviceSetLimit(ctypes.c_int(0x05), ctypes.c_int(128))
_libcudart.cudaDeviceGetLimit(pValue, ctypes.c_int(0x05))
assert pValue.contents.value == 128
The GPU thread mode decides how GPU threads are used.
Set the thread mode to gpu_private to make sure that preprocessing does not
steal all the GPU threads. This will reduce the kernel launch delay during
training. You can also set the number of threads per GPU. Set these values using
environment variables.
import os
os.environ['TF_GPU_THREAD_MODE']='gpu_private'
os.environ['TF_GPU_THREAD_COUNT']='1'
In general, increase the batch size and scale the model to better utilize GPUs and get higher throughput. Note that increasing the batch size will change the model’s accuracy so the model needs to be scaled by tuning hyperparameters like the learning rate to meet the target accuracy.
Also, use tf.config.experimental.set_memory_growth to allow GPU memory to grow
to prevent all the available memory from being fully allocated to ops that
require only a fraction of the memory. This allows other processes which consume
GPU memory to run on the same device.
To learn more, check out the Limiting GPU memory growth guidance in the GPU guide to learn more.
Increase the training mini-batch size (number of training samples used per device in one iteration of the training loop) to the maximum amount that fits without an out of memory (OOM) error on the GPU. Increasing the batch size impacts the model's accuracy—so make sure you scale the model by tuning hyperparameters to meet the target accuracy.
Disable reporting OOM errors during tensor allocation in production code.
Set report_tensor_allocations_upon_oom=False in tf.compat.v1.RunOptions.
For models with convolution layers, remove bias addition if using batch normalization. Batch normalization shifts values by their mean and this removes the need to have a constant bias term.
Use TF Stats to find out how efficiently on-device ops run.
Use tf.function to perform computations and optionally, enable the
jit_compile=True flag (tf.function(jit_compile=True). To learn more, go
to
Use XLA tf.function.
Minimize host Python operations between steps and reduce callbacks. Calculate metrics every few steps instead of at every step.
Keep the device compute units busy.
Send data to multiple devices in parallel.
Consider
using 16-bit numerical representations,
such as fp16—the half-precision floating point format specified by IEEE—or
the Brain floating-point
bfloat16 format.
TensorFlow 2.2 and 2.3 support multiple GPU profiling for single host systems
only; multiple GPU profiling for multi-host systems is not supported. To profile
multi-worker GPU configurations, each worker has to be profiled independently.
From TensorFlow 2.4 multiple workers can be profiled using the
tf.profiler.experimental.client.trace API.
CUDA® Toolkit 10.2 or later is required to profile multiple GPUs. As TensorFlow
2.2 and 2.3 support CUDA® Toolkit versions only up to 10.1, you need to create
symbolic links to libcudart.so.10.1 and libcupti.so.10.1:
sudo ln -s /usr/local/cuda/lib64/libcudart.so.10.2 /usr/local/cuda/lib64/libcudart.so.10.1
sudo ln -s /usr/local/cuda/extras/CUPTI/lib64/libcupti.so.10.2 /usr/local/cuda/extras/CUPTI/lib64/libcupti.so.10.1