[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250115113756.00005561@huawei.com>
Date: Wed, 15 Jan 2025 11:37:56 +0000
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: "Bowman, Terry" <terry.bowman@....com>
CC: <linux-cxl@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<linux-pci@...r.kernel.org>, <nifan.cxl@...il.com>, <dave@...olabs.net>,
<dave.jiang@...el.com>, <alison.schofield@...el.com>,
<vishal.l.verma@...el.com>, <dan.j.williams@...el.com>,
<bhelgaas@...gle.com>, <mahesh@...ux.ibm.com>, <ira.weiny@...el.com>,
<oohall@...il.com>, <Benjamin.Cheatham@....com>, <rrichter@....com>,
<nathan.fontenot@....com>, <Smita.KoralahalliChannabasappa@....com>,
<lukas@...ner.de>, <ming.li@...omail.com>,
<PradeepVineshReddy.Kodamati@....com>, <alucerop@....com>
Subject: Re: [PATCH v5 07/16] PCI/AER: Add CXL PCIe Port uncorrectable error
recovery in AER service driver
On Tue, 14 Jan 2025 14:28:13 -0600
"Bowman, Terry" <terry.bowman@....com> wrote:
> On 1/14/2025 5:33 AM, Jonathan Cameron wrote:
> > On Tue, 7 Jan 2025 08:38:43 -0600
> > Terry Bowman <terry.bowman@....com> wrote:
> >
> >> Existing recovery procedure for PCIe uncorrectable errors (UCE) does not
> >> apply to CXL devices. Recovery can not be used for CXL devices because of
> >> potential corruption on what can be system memory. Also, current PCIe UCE
> >> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
> >> does not begin at the RP/DSP but begins at the first downstream device.
> >> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
> >> CXL recovery is needed because of the different handling requirements
> >>
> >> Add a new function, cxl_do_recovery() using the following.
> >>
> >> Add cxl_walk_bridge() to iterate the detected error's sub-topology.
> >> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
> >> will begin iteration at the RP or DSP rather than beginning at the
> >> first downstream device.
> > I'm still holding out for making pci_walk_bridge() do the same and seeing
> > what if anything breaks.
>
> I can test AER fatal UCE on a PCIe device. Do you have any other ideas for specific
> testing? A specific device or topology in mind ?
It's the interaction with runtime power management usage that worries me and
might need wider testing. Maybe it is just a case of sending a patch marked
RFT.
The other paths are no-op where it matters.
Jonathan
>
> Regards,
> Terry
>
> > Other than that I'm fine with this patch.
> >
> >> Add cxl_report_error_detected() as an analog to report_error_detected().
> >> It will call pci_driver::cxl_err_handlers for each iterated downstream
> >> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
> >> indicating if there was a UCE error detected during handling.
> >>
> >> cxl_do_recovery() uses the status from cxl_report_error_detected() to
> >> determine how to proceed. Non-fatal CXL UCE errors will be treated as
> >> fatal. If a UCE was present during handling then cxl_do_recovery()
> >> will kernel panic.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@....com>
>
Powered by blists - more mailing lists