lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171213015722.6722-60-alexander.levin@verizon.com>
Date:   Wed, 13 Dec 2017 01:57:43 +0000
From:   alexander.levin@...izon.com
To:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "stable@...r.kernel.org" <stable@...r.kernel.org>
Cc:     Gabriele Paoloni <gabriele.paoloni@...wei.com>,
        Dongdong Liu <liudongdong3@...wei.com>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        alexander.levin@...izon.com
Subject: [PATCH AUTOSEL for 4.9 085/100] PCI/AER: Report non-fatal errors only
 to the affected endpoint

From: Gabriele Paoloni <gabriele.paoloni@...wei.com>

[ Upstream commit 86acc790717fb60fb51ea3095084e331d8711c74 ]

Previously, if an non-fatal error was reported by an endpoint, we
called report_error_detected() for the endpoint, every sibling on the
bus, and their descendents.  If any of them did not implement the
.error_detected() method, do_recovery() failed, leaving all these
devices unrecovered.

For example, the system described in the bugzilla below has two devices:

  0000:74:02.0 [19e5:a230] SAS controller, driver has .error_detected()
  0000:74:03.0 [19e5:a235] SATA controller, driver lacks .error_detected()

When a device such as 74:02.0 reported a non-fatal error, do_recovery()
failed because 74:03.0 lacked an .error_detected() method.  But per PCIe
r3.1, sec 6.2.2.2.2, such an error does not compromise the Link and
does not affect 74:03.0:

  Non-fatal errors are uncorrectable errors which cause a particular
  transaction to be unreliable but the Link is otherwise fully functional.
  Isolating Non-fatal from Fatal errors provides Requester/Receiver logic
  in a device or system management software the opportunity to recover from
  the error without resetting the components on the Link and disturbing
  other transactions in progress.  Devices not associated with the
  transaction in error are not impacted by the error.

Report non-fatal errors only to the endpoint that reported them.  We really
want to check for AER_NONFATAL here, but the current code structure doesn't
allow that.  Looking for pci_channel_io_normal is the best we can do now.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=197055
Fixes: 6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver")
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@...wei.com>
Signed-off-by: Dongdong Liu <liudongdong3@...wei.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@...gle.com>

Signed-off-by: Sasha Levin <alexander.levin@...izon.com>
---
 drivers/pci/pcie/aer/aerdrv_core.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index b1303b32053f..057465adf0b6 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -390,7 +390,14 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
 		 * If the error is reported by an end point, we think this
 		 * error is related to the upstream link of the end point.
 		 */
-		pci_walk_bus(dev->bus, cb, &result_data);
+		if (state == pci_channel_io_normal)
+			/*
+			 * the error is non fatal so the bus is ok, just invoke
+			 * the callback for the function that logged the error.
+			 */
+			cb(dev, &result_data);
+		else
+			pci_walk_bus(dev->bus, cb, &result_data);
 	}
 
 	return result_data.result;
-- 
2.11.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ