lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ys2CeIqv//5ZGJTM@kbusch-mbp>
Date:   Tue, 12 Jul 2022 08:17:28 -0600
From:   Keith Busch <kbusch@...nel.org>
To:     Niklas Schnelle <schnelle@...ux.ibm.com>
Cc:     Christoph Hellwig <hch@....de>, Stefan Roese <sr@...x.de>,
        Matthew Rosato <mjrosato@...ux.ibm.com>,
        linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/1] nvme-pci: fix hang during error recovery when the
 PCI device is isolated

On Tue, Jul 12, 2022 at 02:44:53PM +0200, Niklas Schnelle wrote:
> On s390 and powerpc PCI devices are isolated when an error is detected
> and driver->err_handler->error_detected is called with an inaccessible
> PCI device and PCI channel state set to pci_channel_io_frozen
> (see Step 1 in Documentation/PCI/pci-error-recovery.rst).
> 
> In the case of NVMe devices nvme_error_detected() then calls
> nvme_dev_disable(dev, false) and requests a reset. After a successful
> reset the device is accessible again and nvme_slot_reset() resets the
> controller and queues nvme_reset_work() which then recovers the
> controller.
> 
> Since commit b98235d3a471 ("nvme-pci: harden drive presence detect in
> nvme_dev_disable()") however nvme_dev_disable() no longer freezes the
> queues if pci_device_is_present() returns false. This is the case for an
> isolated PCI device. In principle this makes sense as there are no
> accessible hardware queues to run. The problem though is that for
> a previously live reset controller with online queues nvme_reset_work()
> calls nvme_wait_freeze() which, without the freeze having been
> initiated, then hangs forever. Fix this by starting the freeze in
> nvme_slot_reset() which is the earliest point where we know the device
> should be accessible again.
> 
> Fixes: b98235d3a471 ("nvme-pci: harden drive presence detect in nvme_dev_disable()")
> Signed-off-by: Niklas Schnelle <schnelle@...ux.ibm.com>

Oh, we've messed up the expected sequence. The mistaken assumption is a device
not present means we're about to unbind from it, but it could appear that way
just for normal error handling and reset, so we need to preserve the previous
handling.

The offending commit really just wants to avoid the register access (which we
shouldn't have to do, but hey, broken hardware...). So let's keep the sequence
the same as before and just skip the register read. Does this work for you?

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fdfee3e590db..c40e82cee735 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
 static void nvme_dev_remove_admin(struct nvme_dev *dev)
@@ -2690,9 +2772,11 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
 	mutex_lock(&dev->shutdown_lock);
-	if (pci_device_is_present(pdev) && pci_is_enabled(pdev)) {
-		u32 csts = readl(dev->bar + NVME_REG_CSTS);
+	if (pci_is_enabled(pdev)) {
+		u32 csts = ~0;
 
+		if (pci_device_is_present(pdev))
+			csts = readl(dev->bar + NVME_REG_CSTS);
 		if (dev->ctrl.state == NVME_CTRL_LIVE ||
 		    dev->ctrl.state == NVME_CTRL_RESETTING) {
 			freeze = true;
--

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ