[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB674479B3386C96FF2CF85F7692C22@SJ0PR11MB6744.namprd11.prod.outlook.com>
Date: Fri, 14 Jun 2024 03:32:10 +0000
From: "Duan, Zhenzhong" <zhenzhong.duan@...el.com>
To: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@...ux.intel.com>,
"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>
CC: "linuxppc-dev@...ts.ozlabs.org" <linuxppc-dev@...ts.ozlabs.org>,
"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
"rafael@...nel.org" <rafael@...nel.org>, "lenb@...nel.org" <lenb@...nel.org>,
"james.morse@....com" <james.morse@....com>, "Luck, Tony"
<tony.luck@...el.com>, "bp@...en8.de" <bp@...en8.de>, "dave@...olabs.net"
<dave@...olabs.net>, "jonathan.cameron@...wei.com"
<jonathan.cameron@...wei.com>, "Jiang, Dave" <dave.jiang@...el.com>,
"Schofield, Alison" <alison.schofield@...el.com>, "Verma, Vishal L"
<vishal.l.verma@...el.com>, "Weiny, Ira" <ira.weiny@...el.com>,
"bhelgaas@...gle.com" <bhelgaas@...gle.com>, "helgaas@...nel.org"
<helgaas@...nel.org>, "mahesh@...ux.ibm.com" <mahesh@...ux.ibm.com>,
"oohall@...il.com" <oohall@...il.com>, "linmiaohe@...wei.com"
<linmiaohe@...wei.com>, "shiju.jose@...wei.com" <shiju.jose@...wei.com>,
"Preble, Adam C" <adam.c.preble@...el.com>, "lukas@...ner.de"
<lukas@...ner.de>, "Smita.KoralahalliChannabasappa@....com"
<Smita.KoralahalliChannabasappa@....com>, "rrichter@....com"
<rrichter@....com>, "linux-cxl@...r.kernel.org" <linux-cxl@...r.kernel.org>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Tsaur, Erwin"
<erwin.tsaur@...el.com>, "Williams, Dan J" <dan.j.williams@...el.com>,
"Wanyan, Feiting" <feiting.wanyan@...el.com>, "Wang, Yudong"
<yudong.wang@...el.com>, "Peng, Chao P" <chao.p.peng@...el.com>,
"qingshun.wang@...ux.intel.com" <qingshun.wang@...ux.intel.com>
Subject: RE: [PATCH v4 3/3] PCI/AER: Clear UNCOR_STATUS bits that might be
ANFE
>-----Original Message-----
>From: Kuppuswamy Sathyanarayanan
><sathyanarayanan.kuppuswamy@...ux.intel.com>
>Subject: Re: [PATCH v4 3/3] PCI/AER: Clear UNCOR_STATUS bits that might
>be ANFE
>
>
>On 6/13/24 7:40 PM, Duan, Zhenzhong wrote:
>>
>>> -----Original Message-----
>>> From: Kuppuswamy, Sathyanarayanan
>>> Subject: Re: [PATCH v4 3/3] PCI/AER: Clear UNCOR_STATUS bits that
>might
>>> be ANFE
>>>
>>>
>>> On 5/9/24 1:48 AM, Zhenzhong Duan wrote:
>>>> When processing an ANFE, ideally both correctable error(CE) status and
>>>> uncorrectable error(UE) status should be cleared. However, there is no
>>>> way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
>>>> Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
>>>> NFE will bring some issues, i.e., breaking softwore probing; treating
>>> /s/softwore/software
>> Good catch, will fix. It's strange 'checkpatch --codespell' doesn't catch this.
>>
>>> May be this is already discussed. But can you explain why treating
>>> AFNE as non-fatal error will bring probing issues?
>> Copied below from spec 6.1, 6.2.3.2.4, says it can results in a System Error.
>>
>> In some cases the detector of a non-fatal error is not the most appropriate
>agent to determine whether the error is
>> recoverable or not, or if it even needs any recovery action at all. For
>example, if software attempts to perform a
>> configuration read from a non-existent device or Function, the resulting UR
>Status in the Completion will signal the error
>> to software, and software does not need for the Completer in addition to
>signal the error by sending an ERR_NONFATAL
>> Message. In fact, on some platforms, signaling the error with
>ERR_NONFATAL results in a System Error, which breaks
>> normal software probing.
>>
>>>> NFE as ANFE will make us ignoring some UEs which need active recover
>>> /s/ignoring/ignore
>> Will fix.
>>
>>>> operation. To avoid clearing UEs that are not ANFE by accident, the
>>>> most conservative route is taken here: If any of the NFE Detected bits
>>>> is set in Device Status, do not touch UE status, they should be cleared
>>>> later by the UE handler. Otherwise, a specific set of UEs that may be
>>>> raised as ANFE according to the PCIe specification will be cleared if
>>>> their corresponding severity is Non-Fatal.
>>>>
>>>> For instance, previously when kernel receives an ANFE with Poisoned
>TLP
>>>> in OS native AER mode, only status of CE will be reported and cleared:
>>>>
>>>> AER: Correctable error message received from 0000:b7:02.0
>>>> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver
>ID)
>>>> device [8086:0db0] error status/mask=00002000/00000000
>>>> [13] NonFatalErr
>>>>
>>>> If the kernel receives a Malformed TLP after that, two UEs will be
>>>> reported, which is unexpected. Malformed TLP Header is lost since
>>>> the previous ANFE gated the TLP header logs:
>>>>
>>>> PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer,
>>> (Receiver ID)
>>>> device [8086:0db0] error status/mask=00041000/00180020
>>>> [12] TLP (First)
>>>> [18] MalfTLP
>>>>
>>>> Now, for the same scenario, both CE status and related UE status will be
>>>> reported and cleared after ANFE:
>>>>
>>>> AER: Correctable error message received from 0000:b7:02.0
>>>> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver
>ID)
>>>> device [8086:0db0] error status/mask=00002000/00000000
>>>> [13] NonFatalErr
>>>> Uncorrectable errors that may cause Advisory Non-Fatal:
>>>> [18] TLP
>>>>
>>>> Tested-by: Yudong Wang <yudong.wang@...el.com>
>>>> Co-developed-by: "Wang, Qingshun" <qingshun.wang@...ux.intel.com>
>>>> Signed-off-by: "Wang, Qingshun" <qingshun.wang@...ux.intel.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@...el.com>
>>>> ---
>>>> drivers/pci/pcie/aer.c | 7 ++++++-
>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>>> index ed435f09ac27..6a6a3a40569a 100644
>>>> --- a/drivers/pci/pcie/aer.c
>>>> +++ b/drivers/pci/pcie/aer.c
>>>> @@ -1115,9 +1115,14 @@ static void pci_aer_handle_error(struct
>>> pci_dev *dev, struct aer_err_info *info)
>>>> * Correctable error does not need software intervention.
>>>> * No need to go through error recovery process.
>>>> */
>>>> - if (aer)
>>>> + if (aer) {
>>>> pci_write_config_dword(dev, aer +
>>> PCI_ERR_COR_STATUS,
>>>> info->status);
>>>> + if (info->anfe_status)
>>>> + pci_write_config_dword(dev,
>>>> + aer +
>>> PCI_ERR_UNCOR_STATUS,
>>>> + info->anfe_status);
>>>> + }
>>> Why split the handling part and storing part into two patches? Why not
>>> merge
>>> this part of patch 1/3.
>> This is based on Bjorn's suggestion at https://www.spinics.net/lists/linux-
>pci/msg149012.html,
>> clearing UNCOR_STATUS might be more important, deserve to raise out.
>
>I think Bjorn's suggestion is to divide it into two logical patches.
>One for printing the error and another to clear the UNCOR_STATUS
>properly. But currently you have split the UNCOR_STATUS status caching and
>clearing process into two patches. IMO, your first patch can store ANFE
>status and clear it. You can add print support in the second patch.
OK, I'll merge patch1 with patch3 in next version.
>
>
>Code wise it looks fine to me. You can add my Reviewed-by after fixing
>the typos
>
>Reviewed-by: Kuppuswamy Sathyanarayanan
><sathyanarayanan.kuppuswamy@...ux.intel.com>
Thanks
zhenzhong
>
>>
>> Thanks
>> Zhenzhong
>
>--
>Sathyanarayanan Kuppuswamy
>Linux Kernel Developer
Powered by blists - more mailing lists