lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 17 Jan 2019 17:04:54 +0200
From:   Eran Ben Elisha <eranbe@...lanox.com>
To:     netdev@...r.kernel.org, Jiri Pirko <jiri@...lanox.com>,
        "David S. Miller" <davem@...emloft.net>,
        Ariel Almog <ariela@...lanox.com>,
        Aya Levin <ayal@...lanox.com>,
        Eran Ben Elisha <eranbe@...lanox.com>,
        Moshe Shemesh <moshe@...lanox.com>
Subject: [PATCH net-next 00/27] Devlink health reporting and recovery system

The health mechanism is targeted for Real Time Alerting, in order to know when
something bad had happened to a PCI device
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed debugging
  information.

The main idea is to unify and centralize driver health reports in the
generic devlink instance and allow the user to set different
attributes of the health reporting and recovery procedures.

The devlink health reporter:
Device driver creates a "health reporter" per each error/health type.
Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
or unknown (driver specific).
For each registered health reporter a driver can issue error/health reports
asynchronously. All health reports handling is done by devlink.
Device driver can provide specific callbacks for each "health reporter", e.g.
 - Recovery procedures
 - Diagnostics and object dump procedures
 - OOB initial attributes
Different parts of the driver can register different types of health reporters
with different handlers.

Once an error is reported, devlink health will do the following actions:
  * A log is being send to the kernel trace events buffer
  * Health status and statistics are being updated for the reporter instance
  * Object dump is being taken and saved at the reporter instance (as long as
    there is no other dump which is already stored)
  * Auto recovery attempt is being done. Depends on:
    - Auto-recovery configuration
    - Grace period vs. time passed since last recover

The user interface:
User can access/change each reporter attributes and driver specific callbacks
via devlink, e.g per error type (per health reporter)
 - Configure reporter's generic attributes (like: Disable/enable auto recovery)
 - Invoke recovery procedure
 - Run diagnostics
 - Object dump

The devlink health interface (via netlink):
DEVLINK_CMD_HEALTH_REPORTER_GET
  Retrieves status and configuration info per DEV and reporter.
DEVLINK_CMD_HEALTH_REPORTER_SET
  Allows reporter-related configuration setting.
DEVLINK_CMD_HEALTH_REPORTER_RECOVER
  Triggers a reporter's recovery procedure.
DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
  Retrieves diagnostics data from a reporter on a device.
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET
  Retrieves the last stored dump. Devlink health
  saves a single dump. If an dump is not already stored by the devlink
  for this reporter, devlink generates a new dump.
  dump output is defined by the reporter.
DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR
  Clears the last saved dump file for the specified reporter.


                                               netlink
                                      +--------------------------+
                                      |                          |
                                      |            +             |
                                      |            |             |
                                      +--------------------------+
                                                   |request for ops
                                                   |(diagnose,
 mlx5_core                             devlink     |recover,
                                                   |dump)
+--------+                            +--------------------------+
|        |                            |    reporter|             |
|        |                            |  +---------v----------+  |
|        |   ops execution            |  |                    |  |
|     <----------------------------------+                    |  |
|        |                            |  |                    |  |
|        |                            |  + ^------------------+  |
|        |                            |    | request for ops     |
|        |                            |    | (recover, dump)     |
|        |                            |    |                     |
|        |                            |  +-+------------------+  |
|        |     health report          |  | health handler     |  |
|        +------------------------------->                    |  |
|        |                            |  +--------------------+  |
|        |     health reporter create |                          |
|        +---------------------------->                          |
+--------+                            +--------------------------+

Available reporters:
In this patchset, three reporters of mlx5e driver are included. The FW
reporters implement devlink_health_reporter diagnostic, dump and
recovery procedures. The TX reporter implements devlink_health_reporter
diagnostic and recovery procedures.

In order to support CR space dump as part of FW fatal reporter, mlx5 support
for devlink region was added.

Alex Vesker (2):
  net/mlx5: Add Vendor Specific Capability access gateway
  net/mlx5: Add Crdump FW snapshot support

Aya Levin (1):
  devlink: Add Documentation/networking/devlink-health.txt

Eran Ben Elisha (11):
  devlink: Add health buffer support
  devlink: Add health reporter create/destroy functionality
  devlink: Add health report functionality
  devlink: Add health get command
  devlink: Add health set command
  devlink: Add health recover command
  devlink: Add health diagnose command
  devlink: Add health dump {get,clear} commands
  net/mlx5e: Add TX reporter support
  net/mlx5e: Add TX timeout support for mlx5e TX reporter
  net/mlx5: Move all devlink related functions calls to devlink.c

Feras Daoud (4):
  Documentation: mlx5: Update kernel documentation
  net/mlx5: Handle SW reset of FW in error flow
  net/mlx5: Control CR-space access by different PFs
  net/mlx5: Issue SW reset on FW assert

Moshe Shemesh (9):
  net/mlx5: Use devlink region_snapshot parameter
  net/mlx5: Refactor print health info
  net/mlx5: Create FW devlink_health_reporter
  net/mlx5: Add core dump register access functions
  net/mlx5: Add support for FW reporter dump
  net/mlx5: Report devlink health on FW issues
  net/mlx5: Add FW fatal devlink_health_reporter
  net/mlx5: Add support for FW fatal reporter dump
  net/mlx5: Report devlink health on FW fatal issues

 Documentation/networking/devlink-health.txt   |   86 ++
 Documentation/networking/mlx5.rst             |   39 +
 MAINTAINERS                                   |    1 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |    5 +-
 .../net/ethernet/mellanox/mlx5/core/devlink.c |  460 +++++++
 .../net/ethernet/mellanox/mlx5/core/devlink.h |   22 +
 .../ethernet/mellanox/mlx5/core/diag/crdump.c |  208 ++++
 .../mellanox/mlx5/core/diag/fw_tracer.c       |   76 ++
 .../mellanox/mlx5/core/diag/fw_tracer.h       |   11 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   18 +-
 .../ethernet/mellanox/mlx5/core/en/reporter.h |   15 +
 .../mellanox/mlx5/core/en/reporter_tx.c       |  356 ++++++
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  186 +--
 .../ethernet/mellanox/mlx5/core/en_selftest.c |    2 +-
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |    2 +-
 .../net/ethernet/mellanox/mlx5/core/health.c  |  291 ++++-
 .../ethernet/mellanox/mlx5/core/lib/mlx5.h    |    6 +
 .../ethernet/mellanox/mlx5/core/lib/pci_vsc.c |  311 +++++
 .../ethernet/mellanox/mlx5/core/lib/pci_vsc.h |   33 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |   18 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   15 +-
 include/linux/mlx5/device.h                   |   11 +-
 include/linux/mlx5/driver.h                   |   13 +-
 include/linux/mlx5/mlx5_ifc.h                 |   21 +-
 include/net/devlink.h                         |  144 +++
 include/trace/events/devlink.h                |   62 +
 include/uapi/linux/devlink.h                  |   25 +
 net/core/devlink.c                            | 1054 +++++++++++++++++
 28 files changed, 3262 insertions(+), 229 deletions(-)
 create mode 100644 Documentation/networking/devlink-health.txt
 create mode 100644 Documentation/networking/mlx5.rst
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/diag/crdump.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/reporter.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.h

-- 
2.17.1

Powered by blists - more mailing lists