[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB6744795B3426AA104D504B9992F22@SJ0PR11MB6744.namprd11.prod.outlook.com>
Date: Wed, 29 May 2024 05:32:50 +0000
From: "Duan, Zhenzhong" <zhenzhong.duan@...el.com>
To: "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>
CC: "linuxppc-dev@...ts.ozlabs.org" <linuxppc-dev@...ts.ozlabs.org>,
"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
"rafael@...nel.org" <rafael@...nel.org>, "lenb@...nel.org" <lenb@...nel.org>,
"james.morse@....com" <james.morse@....com>, "Luck, Tony"
<tony.luck@...el.com>, "bp@...en8.de" <bp@...en8.de>, "dave@...olabs.net"
<dave@...olabs.net>, "jonathan.cameron@...wei.com"
<jonathan.cameron@...wei.com>, "Jiang, Dave" <dave.jiang@...el.com>,
"Schofield, Alison" <alison.schofield@...el.com>, "Verma, Vishal L"
<vishal.l.verma@...el.com>, "Weiny, Ira" <ira.weiny@...el.com>,
"bhelgaas@...gle.com" <bhelgaas@...gle.com>, "helgaas@...nel.org"
<helgaas@...nel.org>, "mahesh@...ux.ibm.com" <mahesh@...ux.ibm.com>,
"oohall@...il.com" <oohall@...il.com>, "linmiaohe@...wei.com"
<linmiaohe@...wei.com>, "shiju.jose@...wei.com" <shiju.jose@...wei.com>,
"Preble, Adam C" <adam.c.preble@...el.com>, "lukas@...ner.de"
<lukas@...ner.de>, "Smita.KoralahalliChannabasappa@....com"
<Smita.KoralahalliChannabasappa@....com>, "rrichter@....com"
<rrichter@....com>, "linux-cxl@...r.kernel.org" <linux-cxl@...r.kernel.org>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Tsaur, Erwin"
<erwin.tsaur@...el.com>, "Kuppuswamy, Sathyanarayanan"
<sathyanarayanan.kuppuswamy@...el.com>, "Williams, Dan J"
<dan.j.williams@...el.com>, "Wanyan, Feiting" <feiting.wanyan@...el.com>,
"Wang, Yudong" <yudong.wang@...el.com>, "Peng, Chao P"
<chao.p.peng@...el.com>, "qingshun.wang@...ux.intel.com"
<qingshun.wang@...ux.intel.com>
Subject: RE: [PATCH v4 0/3] PCI/AER: Handle Advisory Non-Fatal error
Hi,
Kindly ping.
Appreciate comments and suggestions so I could go ahead.
Thanks
Zhenzhong
>-----Original Message-----
>From: Duan, Zhenzhong <zhenzhong.duan@...el.com>
>Subject: [PATCH v4 0/3] PCI/AER: Handle Advisory Non-Fatal error
>
>Hi,
>
>This is a relay work of Qingshun's v2 [1], but changed to focus on ANFE
>processing as subject suggests and drops trace-event for now. I think it's
>a bit heavy to do extra IOes to get PCIe registers only for trace purpose
>and not see it a community request for now.
>
>According to PCIe Base Specification Revision 6.1, Sections 6.2.3.2.4 and
>6.2.4.3, certain uncorrectable errors will signal ERR_COR instead of
>ERR_NONFATAL, logged as Advisory Non-Fatal Error(ANFE), and set bits in
>both Correctable Error(CE) Status register and Uncorrectable Error(UE)
>Status register. Currently, when handling AER events the kernel will only
>look at CE status or UE status, but never both. In the ANFE case, bits set
>in the UE status register will not be reported and cleared until the next
>FE/NFE arrives.
>
>For instance, previously, when the kernel receives an ANFE with Poisoned
>TLP in OS native AER mode, only the status of CE will be reported and
>cleared:
>
> AER: Correctable error message received from 0000:b7:02.0
> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> device [8086:0db0] error status/mask=00002000/00000000
> [13] NonFatalErr
>
>If the kernel receives a Malformed TLP after that, two UEs will be
>reported, which is unexpected. The Malformed TLP Header is lost since
>the previous ANFE gated the TLP header logs:
>
> PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer,
>(Receiver ID)
> device [8086:0db0] error status/mask=00041000/00180020
> [12] TLP (First)
> [18] MalfTLP
>
>To handle this case properly, calculate potential ANFE related status bits
>and save in aer_err_info. Use this information to determine the status bits
>that need to be cleared.
>
>Now, for the previous scenario, both CE status and related UE status will
>be reported and cleared after ANFE:
>
> AER: Correctable error message received from 0000:b7:02.0
> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> device [8086:0db0] error status/mask=00002000/00000000
> [13] NonFatalErr
> Uncorrectable errors that may cause Advisory Non-Fatal:
> [18] TLP
>
>Note:
>checkpatch.pl will produce following warnings on PATCH2/3:
>
>WARNING: 'UE' may be misspelled - perhaps 'USE'?
>#22:
>uncorrectable error(UE) status should be cleared. However, there is no
>
>...similar warnings omitted...
>
>This is a false-positive, so not fixed.
>
>WARNING: Prefer a maximum 75 chars per line (possible unwrapped commit
>description?)
>#10:
> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>
>...similar warnings omitted...
>
>For readability reasons, these warnings are not fixed.
>
>
>
>[1] https://lore.kernel.org/linux-pci/20240125062802.50819-1-
>qingshun.wang@...ux.intel.com
>
>Thanks
>Qingshun, Zhenzhong
>
>Changelog:
>v4:
> - Fix a race in anfe_get_uc_status() (Jonathan)
> - Add a comment to explain side effect of processing ANFE as NFE (Jonathan)
> - Drop the check for PCI_EXP_DEVSTA_NFED
>
>v3:
> - Split ANFE print and processing to two patches (Bjorn)
> - Simplify ANFE handling, drop trace event
> - Polish comments and patch description
> - Add Tested-by
>
>v2:
> - Reference to the latest PCIe Specification in both commit messages
> and comments, as suggested by Bjorn Helgaas.
> - Describe the reason for storing additional information in
> aer_err_info in the commit message of PATCH 1, as suggested by Bjorn
> Helgaas.
> - Add more details of behavior changes in the commit message of PATCH
> 2, as suggested by Bjorn Helgaas.
>
>v3: https://lore.kernel.org/lkml/20240417061407.1491361-1-
>zhenzhong.duan@...el.com
>v2: https://lore.kernel.org/linux-pci/20240125062802.50819-1-
>qingshun.wang@...ux.intel.com
>v1: https://lore.kernel.org/linux-pci/20240111073227.31488-1-
>qingshun.wang@...ux.intel.com
>
>Zhenzhong Duan (3):
> PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info
> PCI/AER: Print UNCOR_STATUS bits that might be ANFE
> PCI/AER: Clear UNCOR_STATUS bits that might be ANFE
>
> drivers/pci/pci.h | 1 +
> drivers/pci/pcie/aer.c | 75
>+++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 75 insertions(+), 1 deletion(-)
>
>--
>2.34.1
Powered by blists - more mailing lists