[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1546266733-9512-1-git-send-email-eranbe@mellanox.com>
Date: Mon, 31 Dec 2018 16:31:54 +0200
From: Eran Ben Elisha <eranbe@...lanox.com>
To: netdev@...r.kernel.org, "David S. Miller" <davem@...emloft.net>,
Jiri Pirko <jiri@...lanox.com>
Cc: Moshe Shemesh <moshe@...lanox.com>, Aya Levin <ayal@...lanox.com>,
Eran Ben Elisha <eranbe@...lanox.com>,
Tal Alon <talal@...lanox.com>,
Ariel Almog <ariela@...lanox.com>
Subject: [PATCH RFC net-next 00/19] Devlink health reporting and recovery system
The health mechanism is targeted for Real Time Alerting, in order to know when
something bad had happened to a PCI device
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed debugging
information.
The main idea is to unify and centralize driver health reports in the
generic devlink instance and allow the user to set different
attributes of the health reporting and recovery procedures.
The devlink health reporter:
Device driver creates a "health reporter" per each error/health type.
Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
or unknown (driver specific).
For each registered health reporter a driver can issue error/health reports
asynchronously. All health reports handling is done by devlink.
Device driver can provide specific callbacks for each "health reporter", e.g.
- Recovery procedures
- Diagnostics and object dump procedures
- OOB initial attributes
Different parts of the driver can register different types of health reporters
with different handlers.
Once an error is reported, devlink health will do the following actions:
* A log is being send to the kernel trace events buffer
* Health status and statistics are being updated for the reporter instance
* Object dump is being taken and saved at the reporter instance (as long as
there is no other Objdump which is already stored)
* Auto recovery attempt is being done. Depends on:
- Auto-recovery configuration
- Grace period vs. time passed since last recover
The user interface:
User can access/change each reporter attributes and driver specific callbacks
via devlink, e.g per error type (per health reporter)
- Configure reporter's generic attributes (like: Disable/enable auto recovery)
- Invoke recovery procedure
- Run diagnostics
- Object dump
The devlink health interface (via netlink):
DEVLINK_CMD_HEALTH_REPORTER_GET
Retrieves status and configuration info per DEV and reporter.
DEVLINK_CMD_HEALTH_REPORTER_SET
Allows reporter-related configuration setting.
DEVLINK_CMD_HEALTH_REPORTER_RECOVER
Triggers a reporter's recovery procedure.
DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
Retrieves diagnostics data from a reporter on a device.
DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_GET
Retrieves the last stored objdump. Devlink health
saves a single objdump. If an objdump is not already stored by the devlink
for this reporter, devlink generates a new objdump.
Objdump output is defined by the reporter.
DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_CLEAR
Clears the last saved objdump file for the specified reporter.
netlink
+--------------------------+
| |
| + |
| | |
+--------------------------+
|request for ops
|(diagnose,
mlx5_core devlink |recover,
|dump)
+--------+ +--------------------------+
| | | reporter| |
| | | +---------v----------+ |
| | ops execution | | | |
| <----------------------------------+ | |
| | | | | |
| | | + ^------------------+ |
| | | | request for ops |
| | | | (recover, dump) |
| | | | |
| | | +-+------------------+ |
| | health report | | health handler | |
| +-------------------------------> | |
| | | +--------------------+ |
| | health reporter create | |
| +----------------------------> |
+--------+ +--------------------------+
Available reporters:
In this patchset, three reporters of mlx5e driver are included. The FW
reporters implement devlink_health_reporter diagnostic, object dump and
recovery procedures. The TX reporter implements devlink_health_reporter
diagnostic and recovery procedures.
This RFC is based on the same concepts as previous RFC I sent earlier this
year: "[RFC PATCH iproute2-next] System specification health API". The API was
changed, also devlink code and mlx5e reporters were not available at the
previous RFC.
Aya Levin (1):
devlink: Add Documentation/networking/devlink-health.txt
Eran Ben Elisha (11):
devlink: Add health buffer support
devlink: Add health reporter create/destroy functionality
devlink: Add health report functionality
devlink: Add health get command
devlink: Add health set command
devlink: Add health recover command
devlink: Add health diagnose command
devlink: Add health objdump {get,clear} commands
net/mlx5e: Add TX reporter support
net/mlx5e: Add TX timeout support for mlx5e TX reporter
net/mlx5: Move all devlink related functions calls to devlink.c
Moshe Shemesh (7):
net/mlx5: Refactor print health info
net/mlx5: Create FW devlink_health_reporter
net/mlx5: Add core dump register access functions
net/mlx5: Add support for FW reporter objdump
net/mlx5: Report devlink health on FW issues
net/mlx5: Add FW fatal devlink_health_reporter
net/mlx5: Report devlink health on FW fatal issues
Documentation/networking/devlink-health.txt | 86 ++
.../net/ethernet/mellanox/mlx5/core/Makefile | 4 +-
.../net/ethernet/mellanox/mlx5/core/devlink.c | 310 +++++
.../net/ethernet/mellanox/mlx5/core/devlink.h | 22 +
.../mellanox/mlx5/core/diag/fw_tracer.c | 75 ++
.../mellanox/mlx5/core/diag/fw_tracer.h | 13 +
drivers/net/ethernet/mellanox/mlx5/core/en.h | 18 +-
.../ethernet/mellanox/mlx5/core/en/reporter.h | 15 +
.../mellanox/mlx5/core/en/reporter_tx.c | 356 ++++++
.../net/ethernet/mellanox/mlx5/core/en_main.c | 186 +--
.../net/ethernet/mellanox/mlx5/core/en_tx.c | 2 +-
.../net/ethernet/mellanox/mlx5/core/health.c | 79 +-
.../net/ethernet/mellanox/mlx5/core/main.c | 10 +-
.../ethernet/mellanox/mlx5/core/mlx5_core.h | 7 +
include/linux/mlx5/device.h | 1 +
include/linux/mlx5/driver.h | 5 +
include/linux/mlx5/mlx5_ifc.h | 21 +-
include/net/devlink.h | 142 +++
include/trace/events/devlink.h | 62 +
include/uapi/linux/devlink.h | 25 +
net/core/devlink.c | 1037 +++++++++++++++++
21 files changed, 2265 insertions(+), 211 deletions(-)
create mode 100644 Documentation/networking/devlink-health.txt
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.c
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.h
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/reporter.h
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
--
2.17.1
Powered by blists - more mailing lists