lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240417061407.1491361-4-zhenzhong.duan@intel.com>
Date: Wed, 17 Apr 2024 14:14:07 +0800
From: Zhenzhong Duan <zhenzhong.duan@...el.com>
To: linux-pci@...r.kernel.org,
	linuxppc-dev@...ts.ozlabs.org,
	linux-acpi@...r.kernel.org
Cc: rafael@...nel.org,
	lenb@...nel.org,
	james.morse@....com,
	tony.luck@...el.com,
	bp@...en8.de,
	dave@...olabs.net,
	jonathan.cameron@...wei.com,
	dave.jiang@...el.com,
	alison.schofield@...el.com,
	vishal.l.verma@...el.com,
	ira.weiny@...el.com,
	bhelgaas@...gle.com,
	helgaas@...nel.org,
	mahesh@...ux.ibm.com,
	oohall@...il.com,
	linmiaohe@...wei.com,
	shiju.jose@...wei.com,
	adam.c.preble@...el.com,
	leoyang.li@....com,
	lukas@...ner.de,
	Smita.KoralahalliChannabasappa@....com,
	rrichter@....com,
	linux-cxl@...r.kernel.org,
	linux-edac@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	erwin.tsaur@...el.com,
	sathyanarayanan.kuppuswamy@...el.com,
	dan.j.williams@...el.com,
	feiting.wanyan@...el.com,
	yudong.wang@...el.com,
	chao.p.peng@...el.com,
	qingshun.wang@...ux.intel.com,
	Zhenzhong Duan <zhenzhong.duan@...el.com>
Subject: [PATCH v3 3/3] PCI/AER: Clear UNCOR_STATUS bits that might be ANFE

When processing an ANFE, ideally both correctable error(CE) status and
uncorrectable error(UE) status should be cleared. However, there is no
way to fully identify the UE associated with ANFE. Even worse, a Fatal
Error(FE) or Non-Fatal Error(NFE) may set the same UE status bit as
ANFE. Treating an ANFE as NFE will reproduce above mentioned issue,
i.e., breaking softwore probing; treating NFE as ANFE will make us
ignoring some UEs which need active recover operation. To avoid clearing
UEs that are not ANFE by accident, the most conservative route is taken
here: If any of the FE/NFE Detected bits is set in Device Status, do not
touch UE status, they should be cleared later by the UE handler. Otherwise,
a specific set of UEs that may be raised as ANFE according to the PCIe
specification will be cleared if their corresponding severity is Non-Fatal.

For instance, previously when kernel receives an ANFE with Poisoned TLP
in OS native AER mode, only status of CE will be reported and cleared:

  AER: Correctable error message received from 0000:b7:02.0
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr

If the kernel receives a Malformed TLP after that, two UEs will be
reported, which is unexpected. Malformed TLP Header is lost since
the previous ANFE gated the TLP header logs:

  PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00041000/00180020
     [12] TLP                    (First)
     [18] MalfTLP

Now, for the same scenario, both CE status and related UE status will be
reported and cleared after ANFE:

  AER: Correctable error message received from 0000:b7:02.0
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr
    Uncorrectable errors that may cause Advisory Non-Fatal:
     [18] TLP

Tested-by: Yudong Wang <yudong.wang@...el.com>
Co-developed-by: "Wang, Qingshun" <qingshun.wang@...ux.intel.com>
Signed-off-by: "Wang, Qingshun" <qingshun.wang@...ux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@...el.com>
---
 drivers/pci/pcie/aer.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 870e1d1a5159..6ebe320eb0f7 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1115,9 +1115,14 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 		 * Correctable error does not need software intervention.
 		 * No need to go through error recovery process.
 		 */
-		if (aer)
+		if (aer) {
 			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
 					info->status);
+			if (info->anfe_status)
+				pci_write_config_dword(dev,
+						       aer + PCI_ERR_UNCOR_STATUS,
+						       info->anfe_status);
+		}
 		if (pcie_aer_is_native(dev)) {
 			struct pci_driver *pdrv = dev->driver;
 
-- 
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ