[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <SJ0PR11MB67441DAC71325558C8881EEF92A62@SJ0PR11MB6744.namprd11.prod.outlook.com>
Date: Fri, 12 Jul 2024 09:56:36 +0000
From: "Duan, Zhenzhong" <zhenzhong.duan@...el.com>
To: "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
"bhelgaas@...gle.com" <bhelgaas@...gle.com>
CC: "mahesh@...ux.ibm.com" <mahesh@...ux.ibm.com>, "oohall@...il.com"
<oohall@...il.com>, "linuxppc-dev@...ts.ozlabs.org"
<linuxppc-dev@...ts.ozlabs.org>, "linux-acpi@...r.kernel.org"
<linux-acpi@...r.kernel.org>, "rafael@...nel.org" <rafael@...nel.org>,
"lenb@...nel.org" <lenb@...nel.org>, "james.morse@....com"
<james.morse@....com>, "Luck, Tony" <tony.luck@...el.com>, "bp@...en8.de"
<bp@...en8.de>, "dave@...olabs.net" <dave@...olabs.net>,
"jonathan.cameron@...wei.com" <jonathan.cameron@...wei.com>, "Jiang, Dave"
<dave.jiang@...el.com>, "Schofield, Alison" <alison.schofield@...el.com>,
"Verma, Vishal L" <vishal.l.verma@...el.com>, "Weiny, Ira"
<ira.weiny@...el.com>, "helgaas@...nel.org" <helgaas@...nel.org>,
"linmiaohe@...wei.com" <linmiaohe@...wei.com>, "shiju.jose@...wei.com"
<shiju.jose@...wei.com>, "Preble, Adam C" <adam.c.preble@...el.com>,
"lukas@...ner.de" <lukas@...ner.de>, "Smita.KoralahalliChannabasappa@....com"
<Smita.KoralahalliChannabasappa@....com>, "rrichter@....com"
<rrichter@....com>, "linux-cxl@...r.kernel.org" <linux-cxl@...r.kernel.org>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Tsaur, Erwin"
<erwin.tsaur@...el.com>, "Kuppuswamy, Sathyanarayanan"
<sathyanarayanan.kuppuswamy@...el.com>, "Williams, Dan J"
<dan.j.williams@...el.com>, "Wanyan, Feiting" <feiting.wanyan@...el.com>,
"Wang, Yudong" <yudong.wang@...el.com>, "Peng, Chao P"
<chao.p.peng@...el.com>, "qingshun.wang@...ux.intel.com"
<qingshun.wang@...ux.intel.com>
Subject: RE: [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error
Hi Bjorn,
Kindly ping, this series got Reviewed-by and no comments for a month.
Will you think about picking it or further improvements are needed.
Look forward to your suggestions.
Thanks
Zhenzhong
>-----Original Message-----
>From: Duan, Zhenzhong <zhenzhong.duan@...el.com>
>Subject: [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error
>
>Hi,
>
>This is a relay work of Qingshun's v2 [1], but changed to focus on ANFE
>processing as subject suggests and drops trace-event for now. I think it's
>a bit heavy to do extra IOes to get PCIe registers only for trace purpose
>and not see it a community request for now.
>
>According to PCIe Base Specification Revision 6.1, Sections 6.2.3.2.4 and
>6.2.4.3, certain uncorrectable errors will signal ERR_COR instead of
>ERR_NONFATAL, logged as Advisory Non-Fatal Error(ANFE), and set bits in
>both Correctable Error(CE) Status register and Uncorrectable Error(UE)
>Status register. Currently, when handling AER events the kernel will only
>look at CE status or UE status, but never both. In the ANFE case, bits set
>in the UE status register will not be reported and cleared until the next
>FE/NFE arrives.
>
>For instance, previously, when the kernel receives an ANFE with Poisoned
>TLP in OS native AER mode, only the status of CE will be reported and
>cleared:
>
> AER: Correctable error message received from 0000:b7:02.0
> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> device [8086:0db0] error status/mask=00002000/00000000
> [13] NonFatalErr
>
>If the kernel receives a Malformed TLP after that, two UEs will be
>reported, which is unexpected. The Malformed TLP Header is lost since
>the previous ANFE gated the TLP header logs:
>
> PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer,
>(Receiver ID)
> device [8086:0db0] error status/mask=00041000/00180020
> [12] TLP (First)
> [18] MalfTLP
>
>To handle this case properly, calculate potential ANFE related status bits
>and save in aer_err_info. Use this information to determine the status bits
>that need to be cleared.
>
>Now, for the previous scenario, both CE status and related UE status will
>be reported and cleared after ANFE:
>
> AER: Correctable error message received from 0000:b7:02.0
> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> device [8086:0db0] error status/mask=00002000/00000000
> [13] NonFatalErr
> Uncorrectable errors that may cause Advisory Non-Fatal:
> [12] TLP
>
>Note:
>checkpatch.pl will produce following warnings on PATCH1&2:
>
>WARNING: 'UE' may be misspelled - perhaps 'USE'?
>#22:
>uncorrectable error(UE) status should be cleared. However, there is no
>
>...similar warnings omitted...
>
>This is a false-positive, so not fixed.
>
>WARNING: Prefer a maximum 75 chars per line (possible unwrapped commit
>description?)
>#10:
> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>
>...similar warnings omitted...
>
>For readability reasons, these warnings are not fixed.
>
>
>
>[1] https://lore.kernel.org/linux-pci/20240125062802.50819-1-
>qingshun.wang@...ux.intel.com
>
>Thanks
>Qingshun, Zhenzhong
>
>Changelog:
>v5:
> - squash patch 1 and 3 (Kuppuswamy)
> - add comment about avoiding race and fix typo error (Kuppuswamy)
> - collect Jonathan and Kuppuswamy's R-b
>
>v4:
> - Fix a race in anfe_get_uc_status() (Jonathan)
> - Add a comment to explain side effect of processing ANFE as NFE (Jonathan)
> - Drop the check for PCI_EXP_DEVSTA_NFED
>
>v3:
> - Split ANFE print and processing to two patches (Bjorn)
> - Simplify ANFE handling, drop trace event
> - Polish comments and patch description
> - Add Tested-by
>
>v2:
> - Reference to the latest PCIe Specification in both commit messages
> and comments, as suggested by Bjorn Helgaas.
> - Describe the reason for storing additional information in
> aer_err_info in the commit message of PATCH 1, as suggested by Bjorn
> Helgaas.
> - Add more details of behavior changes in the commit message of PATCH
> 2, as suggested by Bjorn Helgaas.
>
>v4: https://lkml.org/lkml/2024/5/9/247
>v3: https://lore.kernel.org/lkml/20240417061407.1491361-1-
>zhenzhong.duan@...el.com
>v2: https://lore.kernel.org/linux-pci/20240125062802.50819-1-
>qingshun.wang@...ux.intel.com
>v1: https://lore.kernel.org/linux-pci/20240111073227.31488-1-
>qingshun.wang@...ux.intel.com
>
>
>Zhenzhong Duan (2):
> PCI/AER: Clear UNCOR_STATUS bits that might be ANFE
> PCI/AER: Print UNCOR_STATUS bits that might be ANFE
>
> drivers/pci/pci.h | 1 +
> drivers/pci/pcie/aer.c | 79
>+++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 79 insertions(+), 1 deletion(-)
>
>--
>2.34.1
Powered by blists - more mailing lists