docs/en/setup/backend/backend-ebpf-profiling.md
eBPF Profiling utilizes the eBPF technology to monitor applications without requiring any modifications to the application itself. Corresponds to Out-Process Profiling.
To use eBPF Profiling, the SkyWalking Rover application (eBPF Agent) needs to be installed on the host machine. When the agent receives a Profiling task, it starts the Profiling task for the specific application to analyze performance bottlenecks for the corresponding type of Profiling.
Lean more about the eBPF profiling in following blogs:
OAP and the agent use a brand-new protocol to exchange eBPF Profiling data, so it is necessary to start OAP with the following configuration:
receiver-ebpf:
selector: ${SW_RECEIVER_EBPF:default}
default:
eBPF Profiling leverages eBPF technology to provide support for the following types of tasks:
PERF_COUNT_SW_CPU_CLOCK.finish_task_switch.On CPU Profiling periodically samples the thread stacks of the target program while it's executing on the CPU and aggregates the thread stacks to create a flame graph. This helps users identify performance bottlenecks based on the flame graph information.
When creating an On CPU Profiling task, you need to specify which eligible processes need to be sampled. The required configuration information is as follows:
The eBPF agent would periodically request from the OAP whether there are any eligible tasks among all the processes collected by the current eBPF agent. When the eBPF agent receives a task, it would start the profiling task with the process.
Once the eBPF agent starts a profiling task for a specific process, it would periodically collect data and report it to the OAP. At this point, a scheduling of task is generated. The scheduling data contains the following information:
Once the schedule is created, we can use the existing scheduling ID and time range to query the CPU execution situation of the specified process within a specific time period. The query contains the following fields:
After the query, the following data would be returned. With the data, it's easy to generate a flame graph:
KERNEL_SPACE and USER_SPACE, which represent user mode and kernel mode, respectively.Off CPU Profiling can analyze the thread state when a thread switch occurs in the current process, thereby determining performance loss caused by blocked on I/O, locks, timers, paging/swapping, and other reasons. The execution flow between the eBPF agent and OAP in Off CPU Profiling is the same as in On CPU Profiling, but the data content being analyzed is different.
The process of creating an Off CPU Profiling task is the same as creating an On CPU Profiling task, with the only difference being that the Profiling task type is changed to OFF CPU Profiling. For specific parameters, please refer to the previous section.
When the eBPF agent receives a Off CPU Profiling task, it would also collect data and generate a schedule. When analyzing data, unlike On CPU Profiling, Off CPU Profiling can generate different flame graphs based on the following two aggregation methods:
Network Profiling can analyze and monitor network requests related to process, and based on the data, generate topology diagrams, metrics, and other information. Furthermore, it can be integrated with existing Tracing systems to enhance the data content.
Unlike On/Off CPU Profiling, Network Profiling requires specifying the instance entity information when creating a task. For example, in a Service Mesh, there may be multiple processes under a single instance(Pod), such as an application and Envoy. In network analysis, they usually work together, so analyzing them together can give you a better understanding of the network execution situation of the Pod. The following parameters are needed:
Sampling represents how the current system samples raw data and combines it with the existing Tracing system, allowing you to see the complete network data corresponding to a Span in Tracing Span. Currently, it supports sampling Raw information for Spans using HTTP/1.x as RPC and parsing SkyWalking and Zipkin protocols. The sampling information configuration is as follows:
After starting the task, the following data can be analyzed:
The topology can generate two types of data:
For external nodes, since eBPF can only collect remote IP and port information during data collection, OAP can use Kubernetes cluster information to recognize the corresponding Service or Pod names.
Between two nodes, data flow direction can be detected, and the following types of data protocols can be identified:
OpenSSL.In the TCP metrics, each metric includes both client-side and server-side data. The metrics are as follows:
| Name | Unit | Description |
|---|---|---|
| Write CPM | Count | Number of write requests initiated per minute |
| Write Total Bytes | B | Total data size written per minute |
| Write Avg Execute Time | ns | Average execution time for each write operation |
| Write RTT | ns | Round Trip Time (RTT) |
| Read CPM | Count | Number of read requests per minute |
| Read Total Bytes | B | Total data size read per minute |
| Read Avg Execute Time | ns | Average execution time for each read operation |
| Connect CPM | Count | Number of new connections established |
| Connect Execute Time | ns | Time taken to establish a connection |
| Close CPM | Count | Number of closed connections |
| Close Execute Time | ns | Time taken to close a connection |
| Retransmit CPM | Count | Number of data retransmissions per minute |
| Drop CPM | Count | Number of dropped packets per minute |
If there is HTTP/1.x protocol communication between two nodes, the eBPF agent can recognize the request data and parse the following metric information:
| Name | Unit | Description |
|---|---|---|
| Request CPM | Count | Number of requests received per minute |
| Response Status CPM | Count | Number of occurrences of each response status code per minute |
| Request Package Size | B | Average request package data size |
| Response Package Size | B | Average response package data size |
| Client Duration | ns | Time taken for the client to receive a response |
| Server Duration | ns | Time taken for the server to send a response |
If two nodes communicate using the HTTP/1.x protocol, and they employ a distributed tracing system, then eBPf agent can collect raw data according to the sampling rules configured in the previous sections.
When the sampling conditions are met, the original request or response data would be collected, including the following fields:
When sampling rules are applied, the related Syscall invocations for the request or response would also be collected, including the following information:
read, write, readv, writev, etc.