[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6675cece52eaf_57ac294ea@dwillia2-xfh.jf.intel.com.notmuch>
Date: Fri, 21 Jun 2024 12:04:46 -0700
From: Dan Williams <dan.j.williams@...el.com>
To: Terry Bowman <terry.bowman@....com>, <dan.j.williams@...el.com>,
<ira.weiny@...el.com>, <dave@...olabs.net>, <dave.jiang@...el.com>,
<alison.schofield@...el.com>, <ming4.li@...el.com>,
<vishal.l.verma@...el.com>, <jim.harris@...sung.com>,
<ilpo.jarvinen@...ux.intel.com>, <ardb@...nel.org>,
<sathyanarayanan.kuppuswamy@...ux.intel.com>, <linux-cxl@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <Yazen.Ghannam@....com>,
<Robert.Richter@....com>
Subject: Re: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL
downstream switch ports, and CXL upstream switch ports
Terry Bowman wrote:
> This patchset provides RAS logging for CXL root ports, CXL downstream
> switch ports, and CXL upstream switch ports. This includes changes to
> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
> cxl_pci callback.
>
> The first 3 patches prepare for and add an atomic notifier chain to the
> portdrv driver. The portdrv's notifier chain reports the port device's
> AER internal errors to the registered callback(s). The preparation changes
> include a portdrv update to call the uncorrectable handler for PCIe root
> ports and PCIe downstream switch ports. Also, the AER correctable error
> (CE) status is made available to the AER CE handler.
>
> The next 4 patches are in preparation for adding an atomic notification
> callback in the cxl_pci driver. This is for receiving AER internal error
> events from the portdrv notifier chain. Preparation includes adding RAS
> register block mapping, adding trace functions for logging, and
> refactoring cxl_pci RAS functions for reuse.
>
> The final 2 patches enable the AER internal error interrupts.
[..]
>
> Solutions Considered (1-4):
> Below are solutions that were considered. Solution #4 is
> implemented in this patchset.
[..]
> 2.) Update the AER driver to call cxl_pci driver's error handler before
> calling pci_aer_handle_error()
>
> This is similar to the existing RCH port error approach in aer.c.
> In this solution the AER driver searches for a downstream CXL endpoint
> to 'handle' detected CXL port protocol errors.
>
> This is a good solution to consider if the one presented in this patchset
> is not acceptable. I was initially reluctant to this approach because it
> adds more CXL coupling to the AER driver. But, I think this solution
> would technically work. I believe Ming was working towards this
> solution.
I feel like the coupling is warranted because these things *are* PCIe
and CXL ports, but it means solving the interrupt distribution problem.
> 3.) Refactor portdrv
> The portdrv refactoring solution is to change the portdrv service drivers
> into PCIe auxiliary drivers. With this change the facility drivers can be
> associated with a PCIe driver instead fixed bound to the portdrv driver.
>
> In this case the CXL port functionality would be added either as a CXL
> auxiliary driver or as a CXL specific port driver
> (PCI_CLASS_BRIDGE_PCI_NORMAL).
>
> This solution has challenges in the interrupt allocation by separate
> auxiliary drivers and in binding of a specific driver. Binding is
> currently based on PCIe class and would require extending the binding
> logic to support multiple drivers for the same class.
>
> Jonathan Cameron is working towards this solution by initially solving
> for the PMU service driver.[1] It is using the auxiliary bus to associate
> what were service drivers with the portdrv driver. Using a CXL auxiliary
> for handling CXL port RAS errors would result in RAS logic called from
> the cxl_pci and CXL auxiliary drivers. This may need a library driver.
I don't think auxiliary bus is a fundamental step forward from pcie
portdrv, it's just a s/pcie_port_bus_type/auxiliary_bus_type/ rename,
but with all the same problems around how to distribute interrupt
services to different interested parties.
So I think notifiers are interesting from the perspective of a software
hack to enable interrupt distribution. However, given that dynamic MSI-X
support is within reach I am interested in exploring that path and
mandating that archs that want to handle CXL protocol errors natively
need to enable dynamic MSI-X. Otherwise, those platforms should disclaim
native protocol error handling support via CXL _OSC.
In other words, I expect native dynamic MSI-X support is more
maintainable in the sense of keeping all the code in one notification
domain.
> 4.) Using a portdrv notifier chain/callback for CIE/UIE
> (Implemented in this patchset)
>
> This solution uses a portdrv atomic chain notifier and a cxl_pci
> callback to handle and log CXL port RAS errors.
Oh, I will need to look that the cxl_pci tie in for this, I was
expecting cxl_pci only gets involved in the RCH case because the port
and the endpoint are one in the same object. in the VH case I would only
expect cxl_pci to get involved for its own observed protocol errors, not
those reported upstream from that endpoint.
> I chose this after trying solution#1 above. I see a couple advantages to
> this solution are:
> - Is general port implementation for CIE/UIE specific handling mentioned
> in the PCIe spec.[2]
> - Notifier is used in RAS MCE driver as an existing example.
> - Does not introduce further CXL dependencies into the AER driver.
> - The notifier chain provides registration/unregistration and
> synchronization.
>
> A disadvantage of this approach is coupling still exists between the CXL
> port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
> is handled by a notifier callback in the cxl_pci endpoint driver.
>
> Most of the patches in this patchset could be reused to work with
> solution#3 or solution#2. The atomic notifier could be dropped and
> instead use an auxiliary device or AER driver awareness. The other
> changes in this patchset could possibly be reused.
I appreciate the discussion of tradeoffs, thanks Terry!
Powered by blists - more mailing lists