lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220712124453.2227362-1-schnelle@linux.ibm.com>
Date:   Tue, 12 Jul 2022 14:44:52 +0200
From:   Niklas Schnelle <schnelle@...ux.ibm.com>
To:     Christoph Hellwig <hch@....de>, Keith Busch <kbusch@...nel.org>
Cc:     Stefan Roese <sr@...x.de>, Matthew Rosato <mjrosato@...ux.ibm.com>,
        linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: [PATCH 0/1] nvme-pci: fix hang during error recovery when the PCI device is isolated

Hi Christoph, Hi Keith,

I found a regression when recovering NVMes after a simulated PCI error on
s390, though I believe at least some POWER systems should be affected as
well. I tracked this down to commit b98235d3a471 ("nvme-pci: harden drive
presence detect in nvme_dev_disable()") which causes nvme_start_freeze() to
not be called before nvme_reset_work() does nvme_wait_freeze() thus hanging
forever. The detailed analysis is included in the commit message and not
too complex but I'm not entirely sure my proposed solution is the correct
one.

The patch I'm sending here works for me and should at least only affect
platforms using the explicit driver->err_handler->slot_reset callback. To
my understanding it seems that the nvme_dev_disable() in
nvme_error_detected() still does the necessary quiescing towards upper
layers and I assume that nvme_start_freeze() won't do anything useful if
the controller is inaccessible but I'm not an expert in this. In particular
I'm not sure it makes sense to start freezing the queues right after
a reset.

Also note I will be travelling for about 3 weeks starting July 14th and
won't have access to s390 machines or my work mail address so apologies if
I won't answer. Feel free to do your own fix. Also Matt (on CC) might be
able to test fixes for this.

Best regards,
Niklas


Niklas Schnelle (1):
  nvme-pci: fix hang during error recovery when the PCI device is
    isolated

 drivers/nvme/host/pci.c | 1 +
 1 file changed, 1 insertion(+)

-- 
2.34.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ