lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 1 Sep 2016 07:52:00 +0000 From: Yuval Mintz <Yuval.Mintz@...gic.com> To: "Guilherme G. Piccoli" <gpiccoli@...ux.vnet.ibm.com> CC: netdev <netdev@...r.kernel.org>, Ariel Elior <Ariel.Elior@...gic.com> Subject: RE: [PATCH net v2] bnx2x: don't reset chip on cleanup if PCI function is offline > When PCI error is detected, in some architectures (like PowerPC) a slot reset is > performed - the driver's error handlers are in charge of "disable" > device before the reset, and re-enable it after a successful slot reset. > > There are two cases though that another path is taken on the code: if the slot > reset is not successful or if too many errors already happened in the specific > adapter (meaning that possibly the device is experiencing a HW failure that slot > reset is not able to solve), the core PCI error mechanism (called EEH in PowerPC) > will remove the adapter from the system, since it will consider this as a > permanent failure on device. In this case, a path is taken that leads to > bnx2x_chip_cleanup() calling bnx2x_reset_hw(), which then tries to perform a > HW reset on chip. This reset won't succeed since the HW is in a fault state, > which can be seen by multiple messages on kernel log like below: > > bnx2x: [bnx2x_issue_dmae_with_comp:552(eth1)]DMAE timeout! > bnx2x: [bnx2x_write_dmae:600(eth1)]DMAE returned failure -1 > > After some time, the PCI error mechanism gives up on waiting the driver's > correct removal procedure and forcibly remove the adapter from the system. > We can see soft lockup while core PCI error mechanism is waiting for driver to > accomplish the right removal process. > > This patch adds a verification to avoid a chip reset whenever the function is in > PCI error state - since this case is only reached when we have a device being > removed because of a permanent failure, the HW chip reset is not expected to > work fine neither is necessary. > > Also, as a minor improvement in error path, we avoid the MCP information > dump in case of non-recoverable PCI error (when adapter is about to be > removed), since it will certainly fail. > > Reported-by: Harsha Thyagaraja <hathyaga@...ibm.com> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@...ux.vnet.ibm.com> Thanks. Acked-By: Yuval Mintz <Yuval.Mintz@...gic.com>
Powered by blists - more mailing lists