[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250129180459.00007e2e@huawei.com>
Date: Wed, 29 Jan 2025 18:04:59 +0000
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: "Bowman, Terry" <terry.bowman@....com>
CC: <linux-cxl@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<linux-pci@...r.kernel.org>, <nifan.cxl@...il.com>, <dave@...olabs.net>,
<dave.jiang@...el.com>, <alison.schofield@...el.com>,
<vishal.l.verma@...el.com>, <dan.j.williams@...el.com>,
<bhelgaas@...gle.com>, <mahesh@...ux.ibm.com>, <ira.weiny@...el.com>,
<oohall@...il.com>, <Benjamin.Cheatham@....com>, <rrichter@....com>,
<nathan.fontenot@....com>, <Smita.KoralahalliChannabasappa@....com>,
<lukas@...ner.de>, <ming.li@...omail.com>,
<PradeepVineshReddy.Kodamati@....com>, <alucerop@....com>, Shuai Xue
<xueshuai@...ux.alibaba.com>
Subject: Re: [PATCH v5 06/16] PCI/AER: Change AER driver to read UCE fatal
status for all CXL PCIe Port devices
On Tue, 28 Jan 2025 14:25:54 -0600
"Bowman, Terry" <terry.bowman@....com> wrote:
> On 1/14/2025 5:32 AM, Jonathan Cameron wrote:
> > On Tue, 7 Jan 2025 08:38:42 -0600
> > Terry Bowman <terry.bowman@....com> wrote:
> >
> >> The AER service driver's aer_get_device_error_info() function doesn't read
> >> uncorrectable (UCE) fatal error status from PCIe Upstream Port devices,
> >> including CXL Upstream Switch Ports. As a result, fatal errors are not
> >> logged or handled as needed for CXL PCIe Upstream Switch Port devices.
> >>
> >> Update the aer_get_device_error_info() function to read the UCE fatal
> >> status for all CXL PCIe devices. Make the change such that non-CXL devices
> >> are not affected.
> >>
> >> The fatal error status will be used in future patches implementing
> >> CXL PCIe Port uncorrectable error handling and logging.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@....com>
> > This clashes with Shuai's series adding link healthy checks.
> > Maybe we can reuse that logic to incorporate the condition we
> > care about here?
> >
>
> Hi Jonathan, et. al,
>
> After looking at this closer and considering the situation I believe
> we should remove this patch from the patchset and defer adding these
> changes to log USP AER and RAS UCE.
>
> I propose we reintroduce this later as a RFC or RFT in a future patchset.
> This will give more needed time for testing.
>
> The only downside to adding later is in the case of CXL USP fatal UCE. AER and
> RAS will not be logged but this was the AER driver's existing behavior and as a
> result isn't a regression.
If we have doubts and it is complex then sure. Let's do this in stages.
Jonathan
>
> Your thoughts?
>
> Regards,
> Terry
>
> >> ---
> >> drivers/pci/pcie/aer.c | 3 ++-
> >> 1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index 62be599e3bee..79c828bdcb6d 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -1253,7 +1253,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
> >> } else if (type == PCI_EXP_TYPE_ROOT_PORT ||
> >> type == PCI_EXP_TYPE_RC_EC ||
> >> type == PCI_EXP_TYPE_DOWNSTREAM ||
> >> - info->severity == AER_NONFATAL) {
> >> + info->severity == AER_NONFATAL ||
> >> + (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)) {
> >>
> >> /* Link is still healthy for IO reads */
> >> pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
>
>
Powered by blists - more mailing lists