[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZsOMUi_dMhakCkit@fan>
Date: Mon, 19 Aug 2024 11:17:54 -0700
From: Fan Ni <nifan.cxl@...il.com>
To: Terry Bowman <Terry.Bowman@....com>
Cc: fan <nifan.cxl@...il.com>, dan.j.williams@...el.com,
ira.weiny@...el.com, dave@...olabs.net, dave.jiang@...el.com,
alison.schofield@...el.com, ming4.li@...el.com,
vishal.l.verma@...el.com, jim.harris@...sung.com,
ilpo.jarvinen@...ux.intel.com, ardb@...nel.org,
sathyanarayanan.kuppuswamy@...ux.intel.com,
linux-cxl@...r.kernel.org, linux-kernel@...r.kernel.org,
Yazen.Ghannam@....com, Robert.Richter@....com,
a.manzanares@...sung.com
Subject: Re: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL
downstream switch ports, and CXL upstream switch ports
On Mon, Aug 19, 2024 at 11:21:01AM -0500, Terry Bowman wrote:
> Hi Fan
>
> On 7/25/24 13:49, fan wrote:
> > On Mon, Jun 17, 2024 at 03:04:02PM -0500, Terry Bowman wrote:
> >> This patchset provides RAS logging for CXL root ports, CXL downstream
> >> switch ports, and CXL upstream switch ports. This includes changes to
> >> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
> >> cxl_pci callback.
> >>
> >> The first 3 patches prepare for and add an atomic notifier chain to the
> >> portdrv driver. The portdrv's notifier chain reports the port device's
> >> AER internal errors to the registered callback(s). The preparation changes
> >> include a portdrv update to call the uncorrectable handler for PCIe root
> >> ports and PCIe downstream switch ports. Also, the AER correctable error
> >> (CE) status is made available to the AER CE handler.
> >>
> >> The next 4 patches are in preparation for adding an atomic notification
> >> callback in the cxl_pci driver. This is for receiving AER internal error
> >> events from the portdrv notifier chain. Preparation includes adding RAS
> >> register block mapping, adding trace functions for logging, and
> >> refactoring cxl_pci RAS functions for reuse.
> >>
> >> The final 2 patches enable the AER internal error interrupts.
> >>
> >> Testing RAS CE/UCE:
> >> QEMU was used for testing CXL root port, CXL downstream switch port, and
> >> CXL upstream switch port. The aer-inject tool was used to inject AER and
> >> a test patch was used to set the AER CIE/UIE and RAS CE/UCE status during
> >> testing. Testing passed with no issues.
> >
> > Hi Terry,
> >
> > Could you share a little more about the qemu test setup?
> > From what I see, it seems currently qemu can only inject error to
> > type3 devices, is that true? Or how to do that for port devices?
> > Do we need a hack there?
> >
> > Also, is the aer-inject tool you mentioned the one currently in the kernel
> > or something else?
> > https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/aer_inject.c
> >
> > Thanks,
> > Fan
> >
> Sorry for the late response.
>
> I used AMD RAS injection for testing HW root ports.
>
> I used QEMU and the legacy aer-inject userspace tool to test switch ports (USP/DSP).[1]
> I added a couple test patches to set the AER UIE/CIE because the tool doesn't support
> injecting UIE or CIE bits. I used a test patch for assigning the RAS status as well.
>
> Regards,
> Terry
>
> [1] - https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git/about/
>
Hi Terry,
Thanks for the reply. I was able to do aer error inject through the aer
inject kernel module and the user space tool.
Trying to exercise the code in this patchset.
Fan
> >
> >>
> >> An AMD platform with the AMD RAS error injection tool was used for
> >> testing CXL root port injection. Testing passed with no issues.
> >>
> >> TODO - regression test CXL1.1 RCH handling.
> >>
> >> Solutions Considered (1-4):
> >> Below are solutions that were considered. Solution #4 is
> >> implemented in this patchset.
> >>
> >> 1.) Reassigning portdrv error handler for CXL port devices
> >>
> >> This solution was based on reassigning the portdrv's CE/UCE err_handler
> >> to be CXL cxl_pci driver functions.
> >>
> >> I started with this solution and once the flow was working I realized
> >> the endpoint removal would have to be addressed as well. While this
> >> could be resolved it does highlight the odd coupling and dependency
> >> between the CXL port devices error handling with cxl_pci endpoint's
> >> handlers. Also, the err_handler re-assignment at runtime required
> >> ignoring the 'const' definition. I don't believe this should be
> >> considered as a possible solution.
> >>
> >> 2.) Update the AER driver to call cxl_pci driver's error handler before
> >> calling pci_aer_handle_error()
> >>
> >> This is similar to the existing RCH port error approach in aer.c.
> >> In this solution the AER driver searches for a downstream CXL endpoint
> >> to 'handle' detected CXL port protocol errors.
> >>
> >> This is a good solution to consider if the one presented in this patchset
> >> is not acceptable. I was initially reluctant to this approach because it
> >> adds more CXL coupling to the AER driver. But, I think this solution
> >> would technically work. I believe Ming was working towards this
> >> solution.
> >>
> >> 3.) Refactor portdrv
> >> The portdrv refactoring solution is to change the portdrv service drivers
> >> into PCIe auxiliary drivers. With this change the facility drivers can be
> >> associated with a PCIe driver instead fixed bound to the portdrv driver.
> >>
> >> In this case the CXL port functionality would be added either as a CXL
> >> auxiliary driver or as a CXL specific port driver
> >> (PCI_CLASS_BRIDGE_PCI_NORMAL).
> >>
> >> This solution has challenges in the interrupt allocation by separate
> >> auxiliary drivers and in binding of a specific driver. Binding is
> >> currently based on PCIe class and would require extending the binding
> >> logic to support multiple drivers for the same class.
> >>
> >> Jonathan Cameron is working towards this solution by initially solving
> >> for the PMU service driver.[1] It is using the auxiliary bus to associate
> >> what were service drivers with the portdrv driver. Using a CXL auxiliary
> >> for handling CXL port RAS errors would result in RAS logic called from
> >> the cxl_pci and CXL auxiliary drivers. This may need a library driver.
> >>
> >> 4.) Using a portdrv notifier chain/callback for CIE/UIE
> >> (Implemented in this patchset)
> >>
> >> This solution uses a portdrv atomic chain notifier and a cxl_pci
> >> callback to handle and log CXL port RAS errors.
> >>
> >> I chose this after trying solution#1 above. I see a couple advantages to
> >> this solution are:
> >> - Is general port implementation for CIE/UIE specific handling mentioned
> >> in the PCIe spec.[2]
> >> - Notifier is used in RAS MCE driver as an existing example.
> >> - Does not introduce further CXL dependencies into the AER driver.
> >> - The notifier chain provides registration/unregistration and
> >> synchronization.
> >>
> >> A disadvantage of this approach is coupling still exists between the CXL
> >> port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
> >> is handled by a notifier callback in the cxl_pci endpoint driver.
> >>
> >> Most of the patches in this patchset could be reused to work with
> >> solution#3 or solution#2. The atomic notifier could be dropped and
> >> instead use an auxiliary device or AER driver awareness. The other
> >> changes in this patchset could possibly be reused.
> >>
> >> [1] Kernel.org -
> >> https://lore.kernel.org/all/f4b23710-059a-51b7-9d27-b62e8b358b54@linux.intel.com
> >> [2] PCI6.0 - 6.2.10 Internal errors
> >>
> >> drivers/cxl/core/core.h | 4 +
> >> drivers/cxl/core/pci.c | 153 ++++++++++++++++++++++++++++++++-----
> >> drivers/cxl/core/port.c | 6 +-
> >> drivers/cxl/core/trace.h | 34 +++++++++
> >> drivers/cxl/cxl.h | 10 +++
> >> drivers/cxl/cxlpci.h | 2 +
> >> drivers/cxl/mem.c | 32 +++++++-
> >> drivers/cxl/pci.c | 19 ++++-
> >> drivers/pci/pcie/aer.c | 10 ++-
> >> drivers/pci/pcie/err.c | 20 +++++
> >> drivers/pci/pcie/portdrv.c | 32 ++++++++
> >> drivers/pci/pcie/portdrv.h | 2 +
> >> include/linux/aer.h | 6 ++
> >> 13 files changed, 303 insertions(+), 27 deletions(-)
> >>
> >>
> >> base-commit: ca3d4767c8054447ac2a58356080e299a59e05b8
> >> --
> >> 2.34.1
> >>
Powered by blists - more mailing lists