Documentation/networking/devlink/devlink-health.rst
.. SPDX-License-Identifier: GPL-2.0
The devlink health mechanism is targeted for Real Time Alerting, in
order to know when something bad happened to a PCI device.
The main idea is to unify and centralize driver health reports in the
generic devlink instance and allow the user to set different
attributes of the health reporting and recovery procedures.
The devlink health reporter:
Device driver creates a "health reporter" per each error/health type.
Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
or unknown (driver specific).
For each registered health reporter a driver can issue error/health reports
asynchronously. All health reports handling is done by devlink.
Device driver can provide specific callbacks for each "health reporter", e.g.:
Different parts of the driver can register different types of health reporters with different handlers.
Once an error is reported, devlink health will perform the following actions:
A log is being send to the kernel trace events buffer
Health status and statistics are being updated for the reporter instance
Object dump is being taken and saved at the reporter instance (as long as auto-dump is set and there is no other dump which is already stored)
Auto recovery attempt is being done. Depends on:
To handle devlink health diagnose and health dump requests, devlink creates a
formatted message structure devlink_fmsg and send it to the driver's callback
to fill the data in using the devlink fmsg API.
Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in json-like format. The API allows the driver to add nested attributes such as object, object pair and value array, in addition to attributes such as name and value.
Driver should use this API to fill the fmsg context in a format which will be translated by the devlink to the netlink message later. When it needs to send the data using SKBs to the netlink layer, it fragments the data between different SKBs. In order to do this fragmentation, it uses virtual nests attributes, to avoid actual nesting use which cannot be divided between different SKBs.
User can access/change each reporter's parameters and driver specific callbacks
via devlink, e.g per error type (per health reporter):
.. list-table:: List of devlink health interfaces :widths: 10 90
DEVLINK_CMD_HEALTH_REPORTER_GETDEVLINK_CMD_HEALTH_REPORTER_SETDEVLINK_CMD_HEALTH_REPORTER_RECOVERDEVLINK_CMD_HEALTH_REPORTER_TESTDEVLINK_CMD_HEALTH_REPORTER_DIAGNOSEDEVLINK_CMD_HEALTH_REPORTER_DUMP_GETDEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEARThe following diagram provides a general overview of devlink-health::
netlink
+--------------------------+
| |
| + |
| | |
+--------------------------+
|request for ops
|(diagnose,
driver devlink |recover,
|dump)
+--------+ +--------------------------+
| | | reporter| |
| | | +---------v----------+ |
| | ops execution | | | |
| <----------------------------------+ | |
| | | | | |
| | | + ^------------------+ |
| | | | request for ops |
| | | | (recover, dump) |
| | | | |
| | | +-+------------------+ |
| | health report | | health handler | |
| +-------------------------------> | |
| | | +--------------------+ |
| | health reporter create | |
| +----------------------------> |
+--------+ +--------------------------+