[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <39098f08-696d-db4c-36ac-1199da95bc7c@suse.de>
Date: Tue, 12 Jul 2022 15:49:29 +0200
From: Hannes Reinecke <hare@...e.de>
To: Niklas Schnelle <schnelle@...ux.ibm.com>,
Christoph Hellwig <hch@....de>, Keith Busch <kbusch@...nel.org>
Cc: Stefan Roese <sr@...x.de>, Matthew Rosato <mjrosato@...ux.ibm.com>,
linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/1] nvme-pci: fix hang during error recovery when the PCI
device is isolated
On 7/12/22 14:44, Niklas Schnelle wrote:
> On s390 and powerpc PCI devices are isolated when an error is detected
> and driver->err_handler->error_detected is called with an inaccessible
> PCI device and PCI channel state set to pci_channel_io_frozen
> (see Step 1 in Documentation/PCI/pci-error-recovery.rst).
>
> In the case of NVMe devices nvme_error_detected() then calls
> nvme_dev_disable(dev, false) and requests a reset. After a successful
> reset the device is accessible again and nvme_slot_reset() resets the
> controller and queues nvme_reset_work() which then recovers the
> controller.
>
> Since commit b98235d3a471 ("nvme-pci: harden drive presence detect in
> nvme_dev_disable()") however nvme_dev_disable() no longer freezes the
> queues if pci_device_is_present() returns false. This is the case for an
> isolated PCI device. In principle this makes sense as there are no
> accessible hardware queues to run. The problem though is that for
> a previously live reset controller with online queues nvme_reset_work()
> calls nvme_wait_freeze() which, without the freeze having been
> initiated, then hangs forever. Fix this by starting the freeze in
> nvme_slot_reset() which is the earliest point where we know the device
> should be accessible again.
>
> Fixes: b98235d3a471 ("nvme-pci: harden drive presence detect in nvme_dev_disable()")
> Signed-off-by: Niklas Schnelle <schnelle@...ux.ibm.com>
> ---
> drivers/nvme/host/pci.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 193b44755662..7c0c61b74c30 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3399,6 +3399,7 @@ static pci_ers_result_t nvme_slot_reset(struct pci_dev *pdev)
> dev_info(dev->ctrl.device, "restart after slot reset\n");
> pci_restore_state(pdev);
> nvme_reset_ctrl(&dev->ctrl);
> + nvme_start_freeze(&dev->ctrl);
> return PCI_ERS_RESULT_RECOVERED;
> }
>
I am not sure if that's the right fix.
From your description the hang occurs as nvme_reset_ctrl() is calling
nvme_wait_freeze() without an corresponding nvme_start_freeze().
So why are you calling it _after_ the call to nvme_reset_ctrl()?
Cheers,
Hannes
Powered by blists - more mailing lists