lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241224183122.00004cf4@huawei.com>
Date: Tue, 24 Dec 2024 18:31:22 +0000
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: Terry Bowman <terry.bowman@....com>
CC: <linux-cxl@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<linux-pci@...r.kernel.org>, <nifan.cxl@...il.com>, <ming4.li@...el.com>,
	<dave@...olabs.net>, <dave.jiang@...el.com>, <alison.schofield@...el.com>,
	<vishal.l.verma@...el.com>, <dan.j.williams@...el.com>,
	<bhelgaas@...gle.com>, <mahesh@...ux.ibm.com>, <ira.weiny@...el.com>,
	<oohall@...il.com>, <Benjamin.Cheatham@....com>, <rrichter@....com>,
	<nathan.fontenot@....com>, <Smita.KoralahalliChannabasappa@....com>,
	<lukas@...ner.de>, <PradeepVineshReddy.Kodamati@....com>
Subject: Re: [PATCH v4 07/15] PCI/AER: Add CXL PCIe Port Uncorrectable Error
 recovery in AER service driver

On Wed, 11 Dec 2024 17:39:54 -0600
Terry Bowman <terry.bowman@....com> wrote:

> Existing recovery procedure for PCIe Uncorrectable Errors (UCE) does not
> apply to CXL devices. Recovery can not be used for CXL devices because of
> potential corruption on what can be system memory. Also, current PCIe UCE
> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
> does not begin at the RP/DSP but begins at the first downstream device.
> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
> CXL recovery is needed because of the different handling requirements
> 
> Add a new function, cxl_do_recovery() using the following.
> 
> Add cxl_walk_bridge() to iterate the detected error's sub-topology.
> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
> will begin iteration at the RP or DSP rather than beginning at the
> first downstream device.
I'm still not keen on this just from a subtlety making maintenance
harder point of view.  Guess I should find the time to make
the change to the PCI walker and see if anyone shouts that it is
a problem.

From other branch of thread sounds like you are reworking this patch
anyway so I'll wait for next version before reviewing closely.


Jonathan

> 
> Add cxl_report_error_detected() as an analog to report_error_detected().
> It will call pci_driver::cxl_err_handlers for each iterated downstream
> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
> indicating if there was a UCE error detected during handling.
> 
> cxl_do_recovery() uses the status from cxl_report_error_detected() to
> determine how to proceed. Non-fatal CXL UCE errors will be treated as
> fatal. If a UCE was present during handling then cxl_do_recovery()
> will kernel panic.
> 
> Signed-off-by: Terry Bowman <terry.bowman@....com>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ