[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240617200411.1426554-1-terry.bowman@amd.com>
Date: Mon, 17 Jun 2024 15:04:02 -0500
From: Terry Bowman <terry.bowman@....com>
To: <dan.j.williams@...el.com>, <ira.weiny@...el.com>, <dave@...olabs.net>,
<dave.jiang@...el.com>, <alison.schofield@...el.com>, <ming4.li@...el.com>,
<vishal.l.verma@...el.com>, <jim.harris@...sung.com>,
<ilpo.jarvinen@...ux.intel.com>, <ardb@...nel.org>,
<sathyanarayanan.kuppuswamy@...ux.intel.com>, <linux-cxl@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <terry.bowman@....com>,
<Yazen.Ghannam@....com>, <Robert.Richter@....com>
Subject: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
This patchset provides RAS logging for CXL root ports, CXL downstream
switch ports, and CXL upstream switch ports. This includes changes to
use a portdrv notifier chain to communicate CXL AER/RAS errors to a
cxl_pci callback.
The first 3 patches prepare for and add an atomic notifier chain to the
portdrv driver. The portdrv's notifier chain reports the port device's
AER internal errors to the registered callback(s). The preparation changes
include a portdrv update to call the uncorrectable handler for PCIe root
ports and PCIe downstream switch ports. Also, the AER correctable error
(CE) status is made available to the AER CE handler.
The next 4 patches are in preparation for adding an atomic notification
callback in the cxl_pci driver. This is for receiving AER internal error
events from the portdrv notifier chain. Preparation includes adding RAS
register block mapping, adding trace functions for logging, and
refactoring cxl_pci RAS functions for reuse.
The final 2 patches enable the AER internal error interrupts.
Testing RAS CE/UCE:
QEMU was used for testing CXL root port, CXL downstream switch port, and
CXL upstream switch port. The aer-inject tool was used to inject AER and
a test patch was used to set the AER CIE/UIE and RAS CE/UCE status during
testing. Testing passed with no issues.
An AMD platform with the AMD RAS error injection tool was used for
testing CXL root port injection. Testing passed with no issues.
TODO - regression test CXL1.1 RCH handling.
Solutions Considered (1-4):
Below are solutions that were considered. Solution #4 is
implemented in this patchset.
1.) Reassigning portdrv error handler for CXL port devices
This solution was based on reassigning the portdrv's CE/UCE err_handler
to be CXL cxl_pci driver functions.
I started with this solution and once the flow was working I realized
the endpoint removal would have to be addressed as well. While this
could be resolved it does highlight the odd coupling and dependency
between the CXL port devices error handling with cxl_pci endpoint's
handlers. Also, the err_handler re-assignment at runtime required
ignoring the 'const' definition. I don't believe this should be
considered as a possible solution.
2.) Update the AER driver to call cxl_pci driver's error handler before
calling pci_aer_handle_error()
This is similar to the existing RCH port error approach in aer.c.
In this solution the AER driver searches for a downstream CXL endpoint
to 'handle' detected CXL port protocol errors.
This is a good solution to consider if the one presented in this patchset
is not acceptable. I was initially reluctant to this approach because it
adds more CXL coupling to the AER driver. But, I think this solution
would technically work. I believe Ming was working towards this
solution.
3.) Refactor portdrv
The portdrv refactoring solution is to change the portdrv service drivers
into PCIe auxiliary drivers. With this change the facility drivers can be
associated with a PCIe driver instead fixed bound to the portdrv driver.
In this case the CXL port functionality would be added either as a CXL
auxiliary driver or as a CXL specific port driver
(PCI_CLASS_BRIDGE_PCI_NORMAL).
This solution has challenges in the interrupt allocation by separate
auxiliary drivers and in binding of a specific driver. Binding is
currently based on PCIe class and would require extending the binding
logic to support multiple drivers for the same class.
Jonathan Cameron is working towards this solution by initially solving
for the PMU service driver.[1] It is using the auxiliary bus to associate
what were service drivers with the portdrv driver. Using a CXL auxiliary
for handling CXL port RAS errors would result in RAS logic called from
the cxl_pci and CXL auxiliary drivers. This may need a library driver.
4.) Using a portdrv notifier chain/callback for CIE/UIE
(Implemented in this patchset)
This solution uses a portdrv atomic chain notifier and a cxl_pci
callback to handle and log CXL port RAS errors.
I chose this after trying solution#1 above. I see a couple advantages to
this solution are:
- Is general port implementation for CIE/UIE specific handling mentioned
in the PCIe spec.[2]
- Notifier is used in RAS MCE driver as an existing example.
- Does not introduce further CXL dependencies into the AER driver.
- The notifier chain provides registration/unregistration and
synchronization.
A disadvantage of this approach is coupling still exists between the CXL
port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
is handled by a notifier callback in the cxl_pci endpoint driver.
Most of the patches in this patchset could be reused to work with
solution#3 or solution#2. The atomic notifier could be dropped and
instead use an auxiliary device or AER driver awareness. The other
changes in this patchset could possibly be reused.
[1] Kernel.org -
https://lore.kernel.org/all/f4b23710-059a-51b7-9d27-b62e8b358b54@linux.intel.com
[2] PCI6.0 - 6.2.10 Internal errors
drivers/cxl/core/core.h | 4 +
drivers/cxl/core/pci.c | 153 ++++++++++++++++++++++++++++++++-----
drivers/cxl/core/port.c | 6 +-
drivers/cxl/core/trace.h | 34 +++++++++
drivers/cxl/cxl.h | 10 +++
drivers/cxl/cxlpci.h | 2 +
drivers/cxl/mem.c | 32 +++++++-
drivers/cxl/pci.c | 19 ++++-
drivers/pci/pcie/aer.c | 10 ++-
drivers/pci/pcie/err.c | 20 +++++
drivers/pci/pcie/portdrv.c | 32 ++++++++
drivers/pci/pcie/portdrv.h | 2 +
include/linux/aer.h | 6 ++
13 files changed, 303 insertions(+), 27 deletions(-)
base-commit: ca3d4767c8054447ac2a58356080e299a59e05b8
--
2.34.1
Powered by blists - more mailing lists