linux-kernel - Re: [PATCH 1/1] nvme-pci: fix hang during error recovery when the PCI device is isolated

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <39098f08-696d-db4c-36ac-1199da95bc7c@suse.de>
Date:   Tue, 12 Jul 2022 15:49:29 +0200
From:   Hannes Reinecke <hare@...e.de>
To:     Niklas Schnelle <schnelle@...ux.ibm.com>,
        Christoph Hellwig <hch@....de>, Keith Busch <kbusch@...nel.org>
Cc:     Stefan Roese <sr@...x.de>, Matthew Rosato <mjrosato@...ux.ibm.com>,
        linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/1] nvme-pci: fix hang during error recovery when the PCI
 device is isolated

On 7/12/22 14:44, Niklas Schnelle wrote:
> On s390 and powerpc PCI devices are isolated when an error is detected
> and driver->err_handler->error_detected is called with an inaccessible
> PCI device and PCI channel state set to pci_channel_io_frozen
> (see Step 1 in Documentation/PCI/pci-error-recovery.rst).
> 
> In the case of NVMe devices nvme_error_detected() then calls
> nvme_dev_disable(dev, false) and requests a reset. After a successful
> reset the device is accessible again and nvme_slot_reset() resets the
> controller and queues nvme_reset_work() which then recovers the
> controller.
> 
> Since commit b98235d3a471 ("nvme-pci: harden drive presence detect in
> nvme_dev_disable()") however nvme_dev_disable() no longer freezes the
> queues if pci_device_is_present() returns false. This is the case for an
> isolated PCI device. In principle this makes sense as there are no
> accessible hardware queues to run. The problem though is that for
> a previously live reset controller with online queues nvme_reset_work()
> calls nvme_wait_freeze() which, without the freeze having been
> initiated, then hangs forever. Fix this by starting the freeze in
> nvme_slot_reset() which is the earliest point where we know the device
> should be accessible again.
> 
> Fixes: b98235d3a471 ("nvme-pci: harden drive presence detect in nvme_dev_disable()")
> Signed-off-by: Niklas Schnelle <schnelle@...ux.ibm.com>
> ---
>   drivers/nvme/host/pci.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 193b44755662..7c0c61b74c30 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3399,6 +3399,7 @@ static pci_ers_result_t nvme_slot_reset(struct pci_dev *pdev)
>   	dev_info(dev->ctrl.device, "restart after slot reset\n");
>   	pci_restore_state(pdev);
>   	nvme_reset_ctrl(&dev->ctrl);
> +	nvme_start_freeze(&dev->ctrl);
>   	return PCI_ERS_RESULT_RECOVERED;
>   }
>   
I am not sure if that's the right fix.
 From your description the hang occurs as nvme_reset_ctrl() is calling 
nvme_wait_freeze() without an corresponding nvme_start_freeze().
So why are you calling it _after_ the call to nvme_reset_ctrl()?

Cheers,

Hannes