docs/框架设计/软件核心/CyberRT/Cyber RT Performance Analysis Tool.md
As the complexity of autonomous driving system increases, more and more performance problems appear in the actual development and road test process, performance optimization has become an indispensable part of the whole process. However, how to find, analyze, locate and solve performance problems has a certain threshold and requires extra workload. In order to address such pain points, the new version of cyber has modified the framework, developed a performance visualization tool, and established the ability of performance analysis. The performance analysis toolchain of cyber has the following features:
Multi-dimensional data collection, comprehensive monitoring, no hiding place:
The cyber performance analysis tool monitors key performance metrics of autonomous driving systems, including:
<table> <thead> <tr> <th>type</td> <th colspan="2">metric</td> <th>system level</td> <th>process level</td> </tr> </thead> <tbody> <tr> <td>CPU</td> <td colspan="2">utilization</td> <td>√</td> <td>√</td> </tr> <tr> <td>GPU</td> <td colspan="2">utilization</td> <td>√</td> <td> </td> </tr> <tr> <td>memory</td> <td colspan="2">usage</td> <td>√</td> <td>√</td> </tr> <tr> <td>gpu memory</td> <td colspan="2">usage</td> <td>√</td> <td>√</td> </tr> <tr> <td rowspan="6">IO</td> <td rowspan="4">block device</td> <td>read/write rate</td> <td>√</td> <td>√</td> </tr> <tr> <td>device wait time</td> <td>√</td> <td> </td> </tr> <tr> <td>device IOPS</td> <td>√</td> <td> </td> </tr> <tr> <td>device wait queue length</td> <td>√</td> <td> </td> </tr> <tr> <td rowspan="2">ethernet</td> <td>read/write rate</td> <td>√</td> <td>√</td> </tr> <tr> <td>bandwidth</td> <td>√</td> <td> </td> </tr> </tbody> </table>By tracking these key metrics in real time, you can quickly identify where the bottleneck processes are that are impacting performance in your autonomous driving system.
Intuitive performance visualization interface:
Performance analysis toolchain of cyber provide a user-friendly visualization interface that gives you an intuitive view of all your performance data. With clear charts and reports, you can easily interpret performance analysis results and point the way to optimization.
Process level analysis, precision targeting:
The visualization interface provides real-time data allowing you to observe how each process is performing. This macro view allows you to determine which process or processes may be slowing down the entire application so you can adjust accordingly.
Analysis in function level, deep tuning:
Further, performance analysis toolchain of cyber provides process-level analysis of each module of the autonomous driving system. Through this feature, it can generate flame diagrams, directed graphs, etc. For each process for each cpu, memory, and gpu metrics (relying on the tools of nvidia nsight), so that you can identify the parts of the code that need to be optimized. This granularity of analysis helps developers to perform deep optimization.
To use the performance analytics features of the of cyber, simply deploy the latest version of Apollo, Installnation same as the previous Apollo Installnation process.
clone Apollo latest code of github
git clone https://github.com/ApolloAuto/apollo.git
use script to start and enter a docker container
bash docker/scripts/dev_start.sh
compiling Apollo
bash apollo.sh build
clone latest Apollo workspace of github
git clone https://github.com/ApolloAuto/application-core.git
install aem and start and enter a docker container
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://apollo-pkg-beta.cdn.bcebos.com/neo/beta/key/deb.gpg.key | sudo gpg --dearmor -o /etc/apt/keyrings/apolloauto.gpg
sudo chmod a+r /etc/apt/keyrings/apolloauto.gpg
echo \
"deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/apolloauto.gpg] https://apollo-pkg-beta.cdn.bcebos.com/apollo/core"\
$(. /etc/os-release && echo "$VERSION_CODENAME") "main" | \
sudo tee /etc/apt/sources.list.d/apolloauto.list
sudo apt-get update
sudo apt install apollo-neo-env-manager-dev --reinstall
aem start
install Apollo packages
buildtool build
cyber records all performance metrics of all cyber-based components, executable binary programs, and can be viewed in real-time. With cyber's visualization tools, it is easy to find and locate the modules in the autonomous driving system that consume high hardware resources, and it is simple to start the visualization interface:
cyber_performance
Once it's up and running, enter localhost:5000 into your browser and you'll see a page similar to the one below:
The button in the upper left corner can be used to select the process to be displayed, usually in the following format: mainboard.xxx.dag, where xxx.dag is the name of the module's startup dag, e.g., the control module is called mainboard.control.dag. There are also two exceptions, system represents the system's performance metrics, while bvar.python3 is the performance indicator for the software cyber_performance.
After selecting the process, there will be several charts named xx - yy - zz or xx - yy, where xx stands for the type of metrics, which currently are:
yy refers to the metric name, while zz will only exist in the BLOCK_DEVICE_IO and ETHERNET_DEVICE_IO types, representing the name of the device.
By using the above charts, you can quickly determine how much cpu, gpu, memory, gpu memory and IO are consumed by different processes in the autonomous driving system, which helps to locate the process bottlenecks.
The visualization interface shows real-time values, and cyber will dump the performance data to the data directory. If you need to calculate the mean value, you can parse it using the performance_parse.py script in /apollo/scripts:
python3 /apollo/scripts/performance_parse.py -f data/performance_dumps.07-29-2024.json
where:
- -f specifies the performance monitoring file, which is currently generated in the data directory of the workspace directory.
cyber's performance analysis feature can sample cpu and memory usage and generate heatmaps or DAG for analysis.
I0409 16:54:22.506175 18008 module_argument.cc:113] []command: mainboard -h
I0409 16:54:22.506390 18008 module_argument.cc:33] []Usage:
mainboard [OPTION]...
Description:
-h, --help: help information
-d, --dag_conf=CONFIG_FILE: module dag config file
-p, --process_group=process_group: the process namespace for running this module, default in manager process
-s, --sched_name=sched_name: sched policy conf for hole process, sched_name should be conf in cyber.pb.conf
--plugin=plugin_description_file_path: the description file of plugin
--disable_plugin_autoload : default enable autoload mode of plugins, use disable_plugin_autoload to ingore autoload
-c, --cpuprofile: enable gperftools cpu profile
-o, --profile_filename=filename: the filename to dump the profile to, default value is ${process_group}_cpu.prof. Only work with -c option
-H, --heapprofile: enable gperftools heap profile
-O, --heapprofile_filename=filename: the filename to dump the profile to, default value is ${process_group}_mem.prof. Only work with -c option
The -c statement enables cpu sampling and the -H statement enables memory sampling, and the module can be started with the following commands:
mainboard -d /apollo/modules/planning/planning_component/dag/planning.dag -c -H
By default, ${process_group}_cpu.prof and ${process_group}_mem.prof sampling files will be generated under the current path.
Sampling can also be added to the launch file, for example, the launch file of planning
<cyber>
<module>
<name>planning</name>
<dag_conf>/apollo/modules/planning/planning_component/dag/planning.dag</dag_conf>
<cpuprofile>/apollo/planning_cpu.prof</cpuprofile>
<memprofile>/apollo/planning_mem.prof</memprofile>
<process_name>planning</process_name>
</module>
</cyber>
Add cpuprofile and memprofile tags, running this launch will generate planning_cpu.prof, planning_mem.prof sample files under /apollo path.
While cpu sampling generates only one file, memory sampling generates multiple files, each representing the memory situation of the process at the time of generation:
Due to the performance impact of memory and cpu sampling, it is not recommended to enable both types of sampling at the same time.
CPU:Use the following command to visualize the sampling results and generate a flame graph:
buildtool sampling planning_cpu.prof
memory:Generate memory-sampled directed acyclic graphs or flame diagrams on demand
Analyzing memory hotspots:Find out what code or functions are causing large memory allocations
pprof --svg $(which mainboard) planning_mem.prof.0001.heap > a.svg # generated DAG
buildtool sampling planning_mem.prof.0001.heap # generated flame graph
DAG come with memory information, making it clearer which function allocates how much memory, while flame graphs are more intuitive, and developers can choose which type of graph to generate according to their habits
Analyzes how much memory was allocated by which function between two moments, often used to troubleshoot memory leaks.
pprof --svg --base planning_mem.prof.0005.heap $(which mainboard) planning_mem.prof.0001.heap > a.svg
cpu
The above is the flame graph of the cyber example communication program, from bottom to top represents the function call stack, and the length represents the percentage of CPU resources occupied by the function in the whole process, for example, ReadMessage in the above graph, which accounts for 81.82%, it can be seen that this part occupies the majority of the cpu time slice, so through the flame graph we can know that we need to optimize the top, the longer function (high percentage) of the graph to optimize, so as to reduce the process of cpu consumption.
memory
Analyzing memory hotspots
Finding memory hotspot allocations through flame graph or DAG
Both flame graph and DAG find memory hotspot allocations; similarly, the flame graph represents the function call stack from bottom to top, and the length represents the percentage of memory allocated by the function throughout the process;
while DAG:
The above approach allows you to quickly locate the function where the memory hotspot was allocated and the corresponding call chain.
Memory Leak Analysis:
Memory leaks are usually based on the following command comparing memory allocation diffs at different points in time:
pprof --svg --base planning_mem.prof.0005.heap $(which mainboard) planning_mem.prof.0001.heap > a.svg
The above command generates a DAG of the memory diff at two points in time. If you find that the size of a function's allocation keeps increasing at different points in time, it indicates that there may be a memory leak:
GPU analysis needs to be based on NVIDIA Nsight Systems tools, respectively:
Similar to gperftools, nsys can have a serious impact on process performance and should only be used in the R&D and analysis phases, and should not be used in road testing phase.
Launch tool samples the process (as an example of perception):
sudo -E nsys profile --gpu-metrics-device all --cuda-memory-usage true --delay=15 --duration=30 -f true -o preception --trace=cuda,cudnn,cublas,osrt,nvtx mainboard -d /apollo/modules/perception/msg_adapter/dag/msg_adapter.dag -d /apollo/modules/perception/pointcloud_preprocess/dag/pointcloud_preprocess.dag -d /apollo/modules/perception/pointcloud_map_based_roi/dag/pointcloud_map_based_roi.dag -d /apollo/modules/perception/pointcloud_ground_detection/dag/pointcloud_ground_detection.dag -d /apollo/modules/perception/lidar_detection/dag/lidar_detection.dag -d /apollo/modules/perception/lidar_detection_filter/dag/lidar_detection_filter.dag -d /apollo/modules/perception/lidar_tracking/dag/lidar_tracking.dag
Note: sampling requires sudo privileges
Open the nsys
nsys-ui
Open the file generated by the sampling
When it opens it will have the following columns:
Among them:
A brief introduction to the logic of executing a cuda kernel function, when a program calls a kernel function in the application, the hardware does not execute it immediately, but fires it into a queue waiting to be executed when the gpu is able to handle it
The horizontal axis is the timeline of the process run
when unfolded:
The timeline can be divided into two parts, the second part (green box) can be seen that the behavior of the gpu is constantly repeated, which can be seen to belong to the model inference; the first part (blue box) is that the model is undergoing initialization, and the performance analysis mainly focuses on the part of the model inference
Amplifying a particular inference:
As I mentioned earlier, the iGPU column contains several gpu metrics, and the one we are more concerned about is the SM occupancy (corresponding to the red box); in addition, the CUDA HW column records the kernel functions that are executed in hardware on the timeline; and the green box corresponds to the cuda API used to launch this kernel function
With the SM Warp Occupancy column, when SM occupancy is high (as shown in the above figure, occupancy reaches 95%), it is possible to quickly locate and obtain detailed information about the corresponding executing core function:
and the call chain at the launch of this kernel function:
With the above information, for the kernel function we wrote ourselves, we can quickly locate the bottleneck; for model inference, the situation is relatively complicated, because each kernel function in it is equivalent to an operator, and it is hard to precisely match to which layer of the model
For the model inference problem, if using
tensorRT:
#include <nvToolsExt.h>
// start a new NVTX
nvtxRangePush("LayerName");
// executing
// end NVTX
nvtxRangePop();
paddle: paddle built-in profiler, you can also view the operation of each layer of each operator, details can refer to the: https://www.paddlepaddle.org.cn/documentation/docs/zh/2.3/guides/performance_improving/profiling_model.html
For gpu memory analysis, the nsight system has a similar output:
Similarly, with the above information, the bottleneck can be quickly located for allocation functions such as cudaMalloc, which we wrote ourselves; and for model inference, it is necessary to determine which layer of allocated memory belongs to in a similar way to NVTX
The main idea of optimization is to optimize the top, longer functions (with a high percentage) in the flame graph, and in general the following can be considered:
Can this type of function be removed:
This type of function involves complex calculation:
Consider the possibility of using lower time complexity algorithms
Consider whether operations can be parallelized:
for (int j = 0; j < height_; j += 2) {
for (int i = 0; i < width_; i += 16) {
// 读取16个像素点到双通道寄存器中,其中通道1存储y分量,通道2存储uv分量 (32 bytes)
uint8x16x2_t row0 = vld2q_u8(usrc + j * width_ * 2 + i * 2);
uint8x16x2_t row1 = vld2q_u8(usrc + (j + 1) * width_ * 2 + i * 2);
// 提取y分量
uint8x16_t y0 = row0.val[0];
uint8x16_t y1 = row1.val[0];
// 保存y分量到数组中
vst1q_u8(yPlane + yIndex, y0);
vst1q_u8(yPlane + yIndex + width_, y1);
yIndex += 16;
// 提取uv分量
uint8x16_t u0 = row0.val[1];
uint8x16_t v0 = row1.val[1];
// 求uv分量均值(减轻下采样丢失数据的影响)
uint16x8_t u16_low = vaddl_u8(vget_low_u8(u0), vget_low_u8(v0));
uint16x8_t u16_high = vaddl_u8(vget_high_u8(u0), vget_high_u8(v0));
uint8x8_t u8_avg_low = vshrn_n_u16(u16_low, 1);
uint8x8_t u8_avg_high = vshrn_n_u16(u16_high, 1);
uint16x8_t v16_low = vaddl_u8(vget_low_u8(row0.val[1]), vget_low_u8(row1.val[1]));
uint16x8_t v16_high = vaddl_u8(vget_high_u8(row0.val[1]), vget_high_u8(row1.val[1]));
uint8x8_t v8_avg_low = vshrn_n_u16(v16_low, 1);
uint8x8_t v8_avg_high = vshrn_n_u16(v16_high, 1);
uint8x16_t u_avg = vcombine_u8(u8_avg_low, u8_avg_high);
uint8x16_t v_avg = vcombine_u8(v8_avg_low, v8_avg_high);
// 保存uv分量到数组中
vst1_u8(uPlane + uIndex, u8_avg_low);
vst1_u8(vPlane + vIndex, v8_avg_low);
uIndex += 8;
vIndex += 8;
}
yIndex += width_; // Skip to the next line
}
Consider whether operations can be performed on non-cpu hardware: for example, the nvjpeg chip on the orin can be specialized for tasks such as image compression.
Along the same lines as cpu, we need to see what functions we can optimize. flame graph and DAG allow us to see the approximate percentage of memory allocated by each function and the exact value, for example, to predict the process:
Generally speaking, you can consider the following points:
The gpu optimization is divided into latency optimization and utilization optimization, which correspond to different objectives:
Use of more efficient models/algorithms with lower time complexity
Phenomenon: blank space on timeline
Optimization approach: determine the behavior of the cpu during this blanking time and reduce the blanking time (refer to the cpu optimization approach)
Optimization effect on latency, little effect on utilization
Optimization approach:
unified memory code sample:
char* ptr = (char*) malloc(nums * sizeof(char))
char* gpu_ptr = nullptr;
cudaError_t err = cudaHostRegister(ptr, nums * sizeof(char), cudaHostRegisterMapped);
err = cudaHostGetDevicePointer(&gpu_ptr, ptr, 0);
Optimized for both latency and utilization
Optimization approach:
kernel fusion:Fusion of small kernel into large kernel to reduce the number of firings
cudaGraph:Combine multiple kernel into a cudaGraph:
cudaGraph sample:
#include <cuda_runtime.h>
__global__ void myKernel(int *data, int N) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
data[idx] += 1;
}
}
int main() {
const int N = 1024;
int *d_data;
cudaStream_t stream;
cudaGraph_t graph;
cudaGraphExec_t graphExec;
// allocate memory
cudaMalloc(&d_data, N * sizeof(int));
// create cuda stream
cudaStreamCreate(&stream);
// record cuda graph
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
dim3 blockSize(256);
dim3 gridSize((N + blockSize.x - 1) / blockSize.x);
myKernel<<<gridSize, blockSize, 0, stream>>>(d_data, N);
// end of recording cuda graph
cudaStreamEndCapture(stream, &graph);
// start a graph
cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
cudaGraphLaunch(graphExec, stream);
// sync cuda stream
cudaStreamSynchronize(stream);
// clean resources
cudaGraphExecDestroy(graphExec);
cudaGraphDestroy(graph);
cudaStreamDestroy(stream);
cudaFree(d_data);
return 0;
}
Optimized for latency
single stream:
multiple stream:
sample:
#include <cuda_runtime.h>
__global__ void myKernel(int *data, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
data[idx] = data[idx] + 1;
}
}
int main() {
const int N = 1024 * 1024;
const int numStreams = 4;
const int streamSize = N / numStreams;
const int blockSize = 256;
const int gridSize = streamSize / blockSize;
int *d_data;
cudaStream_t streams[numStreams];
cudaMalloc(&d_data, N * sizeof(int));
for (int i = 0; i < numStreams; ++i) {
cudaStreamCreate(&streams[i]);
}
for (int i = 0; i < numStreams; ++i) {
int offset = i * streamSize;
myKernel<<<gridSize, blockSize, 0, streams[i]>>>(d_data + offset, streamSize);
}
for (int i = 0; i < numStreams; ++i) {
cudaStreamSynchronize(streams[i]);
}
for (int i = 0; i < numStreams; ++i) {
cudaStreamDestroy(streams[i]);
}
cudaFree(d_data);
return 0;
}
Optimized for latency, while the utilization will be significantly increased
To reduce the latency of a kernel, you need to know the SOL (Speed Of Light) metric of the kernel, which can be sampled using the nsight compute::
Two general scenarios may occur:
Optimization approach:
Calculation is limited:
if (threadIdx.x < 16) {
// do something A
} else {
// do something B
}
The first half of the warp (threads 0-15) will perform operation A and the second half (threads 16-31) will perform operation B. The GPU will be first for the thread that performs operation A then for the thread that performs operation B. The GPU will be first for the thread that performs operation A then for the thread that performs operation B. In this case, the warp is only half as efficient as it could be. In this case, the efficiency of the warp is only half of its maximum, resulting in a limited computation.
Therefore, it is important to ensure that all threads within a warp execute the same instructions as much as possible, and reducing the number of intra-wrap branches improves the SOL of the SM.