lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6679dc345fd4c_5639294a5@dwillia2-xfh.jf.intel.com.notmuch>
Date: Mon, 24 Jun 2024 13:51:00 -0700
From: Dan Williams <dan.j.williams@...el.com>
To: Terry Bowman <Terry.Bowman@....com>, Dan Williams
	<dan.j.williams@...el.com>, <ira.weiny@...el.com>, <dave@...olabs.net>,
	<dave.jiang@...el.com>, <alison.schofield@...el.com>, <ming4.li@...el.com>,
	<vishal.l.verma@...el.com>, <jim.harris@...sung.com>,
	<ilpo.jarvinen@...ux.intel.com>, <ardb@...nel.org>,
	<sathyanarayanan.kuppuswamy@...ux.intel.com>, <linux-cxl@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <Yazen.Ghannam@....com>,
	<Robert.Richter@....com>
Subject: Re: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL
 downstream switch ports, and CXL upstream switch ports

Terry Bowman wrote:
> Hi Dan,
> 
> I added responses below.
> 
> On 6/21/24 14:04, Dan Williams wrote:
> > Terry Bowman wrote:
> >> This patchset provides RAS logging for CXL root ports, CXL downstream
> >> switch ports, and CXL upstream switch ports. This includes changes to
> >> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
> >> cxl_pci callback.
> >>
> >> The first 3 patches prepare for and add an atomic notifier chain to the
> >> portdrv driver. The portdrv's notifier chain reports the port device's
> >> AER internal errors to the registered callback(s). The preparation changes
> >> include a portdrv update to call the uncorrectable handler for PCIe root
> >> ports and PCIe downstream switch ports. Also, the AER correctable error
> >> (CE) status is made available to the AER CE handler.
> >>
> >> The next 4 patches are in preparation for adding an atomic notification
> >> callback in the cxl_pci driver. This is for receiving AER internal error
> >> events from the portdrv notifier chain. Preparation includes adding RAS
> >> register block mapping, adding trace functions for logging, and
> >> refactoring cxl_pci RAS functions for reuse.
> >>
> >> The final 2 patches enable the AER internal error interrupts.
> > [..] 
> >>
> >> Solutions Considered (1-4):
> >>   Below are solutions that were considered. Solution #4 is
> >>   implemented in this patchset. 
> > [..]
> >>  2.) Update the AER driver to call cxl_pci driver's error handler before
> >>  calling pci_aer_handle_error()
> >>
> >>  This is similar to the existing RCH port error approach in aer.c.
> >>  In this solution the AER driver searches for a downstream CXL endpoint
> >>  to 'handle' detected CXL port protocol errors.
> >>
> >>  This is a good solution to consider if the one presented in this patchset
> >>  is not acceptable. I was initially reluctant to this approach because it
> >>  adds more CXL coupling to the AER driver. But, I think this solution
> >>  would technically work. I believe Ming was working towards this
> >>  solution.
> > 
> > I feel like the coupling is warranted because these things *are* PCIe
> > and CXL ports, but it means solving the interrupt distribution problem.
> > 
> 
> I understand the service driver interrupt issue but it is not clear how it 
> applies to the CXL port error handling. Can you help me understand how the 
> interrupt issue affects CXL port AER UIE/CIE handling in the AER driver.

Just the case of the AER MSI/-X vector being multiplexed with other CXL
functionality on the same device. If the CXL interrupt vector is to be
enabled later then it means MSI/-X vector enabling needs to be dynamic.

...but yeah, not a problem now as we are only talking about PCIe AER
events and not multiplexing yet. I.e. that problem can be solved later.

> 
> 
> >>   3.) Refactor portdrv
> >>   The portdrv refactoring solution is to change the portdrv service drivers
> >>   into PCIe auxiliary drivers. With this change the facility drivers can be
> >>   associated with a PCIe driver instead fixed bound to the portdrv driver.
> >>
> >>   In this case the CXL port functionality would be added either as a CXL
> >>   auxiliary driver or as a CXL specific port driver
> >>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
> >>
> >>   This solution has challenges in the interrupt allocation by separate
> >>   auxiliary drivers and in binding of a specific driver. Binding is
> >>   currently based on PCIe class and would require extending the binding
> >>   logic to support multiple drivers for the same class.
> >>
> >>   Jonathan Cameron is working towards this solution by initially solving
> >>   for the PMU service driver.[1] It is using the auxiliary bus to associate
> >>   what were service drivers with the portdrv driver. Using a CXL auxiliary
> >>   for handling CXL port RAS errors would result in RAS logic called from
> >>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.
> > 
> > I don't think auxiliary bus is a fundamental step forward from pcie
> > portdrv, it's just a s/pcie_port_bus_type/auxiliary_bus_type/ rename,
> > but with all the same problems around how to distribute interrupt
> > services to different interested parties.
> > 
> > So I think notifiers are interesting from the perspective of a software
> > hack to enable interrupt distribution. However, given that dynamic MSI-X
> > support is within reach I am interested in exploring that path and
> > mandating that archs that want to handle CXL protocol errors natively
> > need to enable dynamic MSI-X. Otherwise, those platforms should disclaim
> > native protocol error handling support via CXL _OSC.
> > 
> > In other words, I expect native dynamic MSI-X support is more
> > maintainable in the sense of keeping all the code in one notification
> > domain.
> > 
> >>   4.) Using a portdrv notifier chain/callback for CIE/UIE
> >>   (Implemented in this patchset)
> >>
> >>   This solution uses a portdrv atomic chain notifier and a cxl_pci
> >>   callback to handle and log CXL port RAS errors.
> > 
> > Oh, I will need to look that the cxl_pci tie in for this, I was
> > expecting cxl_pci only gets involved in the RCH case because the port
> > and the endpoint are one in the same object. in the VH case I would only
> > expect cxl_pci to get involved for its own observed protocol errors, not
> > those reported upstream from that endpoint.
> > 
> 
> The CXL port error handling needs a place to live with few options at the moment.
> Where do you want the CXL port error handlers to reside? 

I need to go understand exactly why cxl_pci is involved in this current
proposal, but I was thinking it is probably more natural for cxl_port to
have error handlers.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ