[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3536d4ae-719c-4aab-b0bb-5c8a3781ca8f@amd.com>
Date: Wed, 27 Nov 2024 14:53:32 -0600
From: "Bowman, Terry" <terry.bowman@....com>
To: Jonathan Cameron <Jonathan.Cameron@...wei.com>
Cc: Lukas Wunner <lukas@...ner.de>, linux-cxl@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org,
nifan.cxl@...il.com, ming4.li@...el.com, dave@...olabs.net,
dave.jiang@...el.com, alison.schofield@...el.com, vishal.l.verma@...el.com,
dan.j.williams@...el.com, bhelgaas@...gle.com, mahesh@...ux.ibm.com,
ira.weiny@...el.com, oohall@...il.com, Benjamin.Cheatham@....com,
rrichter@....com, nathan.fontenot@....com,
Smita.KoralahalliChannabasappa@....com,
Shuai Xue <xueshuai@...ux.alibaba.com>, Keith Busch <kbusch@...nel.org>
Subject: Re: [PATCH v3 06/15] PCI/AER: Change AER driver to read UCE fatal
status for all CXL PCIe port devices
On 11/27/2024 11:05 AM, Jonathan Cameron wrote:
> On Thu, 21 Nov 2024 14:24:17 -0600
> "Bowman, Terry" <terry.bowman@....com> wrote:
>
>> On 11/15/2024 3:35 AM, Lukas Wunner wrote:
>>> On Wed, Nov 13, 2024 at 03:54:20PM -0600, Terry Bowman wrote:
>>>> The AER service driver's aer_get_device_error_info() function doesn't read
>>>> uncorrectable (UCE) fatal error status from PCIe upstream port devices,
>>>> including CXL upstream switch ports. As a result, fatal errors are not
>>>> logged or handled as needed for CXL PCIe upstream switch port devices.
>>>>
>>>> Update the aer_get_device_error_info() function to read the UCE fatal
>>>> status for all CXL PCIe port devices. Make the change to not affect
>>>> non-CXL PCIe devices.
>>>>
>>>> The fatal error status will be used in future patches implementing
>>>> CXL PCIe port uncorrectable error handling and logging.
>>> [...]
>>>> --- a/drivers/pci/pcie/aer.c
>>>> +++ b/drivers/pci/pcie/aer.c
>>>> @@ -1250,7 +1250,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>>>> } else if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>>> type == PCI_EXP_TYPE_RC_EC ||
>>>> type == PCI_EXP_TYPE_DOWNSTREAM ||
>>>> - info->severity == AER_NONFATAL) {
>>>> + info->severity == AER_NONFATAL ||
>>>> + (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)) {
>>>>
>>>> /* Link is still healthy for IO reads */
>>>> pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
>>> Just a heads-up, there's another patch pending by Shuai Xue (+cc)
>>> which touches the same code lines. It re-enables error reporting
>>> for PCIe Upstream Ports (as well as Endpoints) under certain
>>> conditions:
>>>
>>> https://lore.kernel.org/all/20241112135419.59491-3-xueshuai@linux.alibaba.com/
>>>
>>> That was originally disabled by Keith Busch (+cc) with commit
>>> 9d938ea53b26 ("PCI/AER: Don't read upstream ports below fatal errors").
>>>
>>> There's some merge conflict potential here if your series goes into
>>> the cxl tree and Shuai's patch into the pci tree in the next cycle.
>>>
>>> Thanks,
>>>
>>> Lukas
>> Thanks Lukas I took a look at the patchset and reached out to Shuai (you're CC'd). Sorry, I thought
>> I responded here earlier.
> I'm guessing we might not need this change if we can base querying on the
> link being good. If the error is on the CXL protocol side, the link should
> still be fine I think?
>
> Jonathan
Hi Jonathan,
Shuai is determining upstream link viability using a call to pciehp_check_link_active() in dpc.c. But, link viability is not determined dynamically for call to aer_get_device_error_info() in his patchset. I suppose we could add this for CXL devices and continue to isolate the new logic from PCIe devices. Your thoughts?
Link to the brief discussion with Shuai is here: https://lore.kernel.org/linux-pci/11282df5-9126-4b5b-82ae-5f1ef3b8aaf5@linux.alibaba.com/ Regards, Terry
>> Regards,
>> Terry
Powered by blists - more mailing lists