linux-kernel - Re: [RESEND PATCH] vfio/pci: Skip hot reset on Link-Down

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251223153534.0968cc15.alex@shazbot.org>
Date: Tue, 23 Dec 2025 15:35:34 -0700
From: Alex Williamson <alex@...zbot.org>
To: "Jinhui Guo" <guojinhui.liam@...edance.com>
Cc: <kvm@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
 "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>
Subject: Re: [RESEND PATCH] vfio/pci: Skip hot reset on Link-Down

On Mon, 15 Dec 2025 20:30:29 +0800
"Jinhui Guo" <guojinhui.liam@...edance.com> wrote:

> On hot-pluggable ports, simultaneous surprise removal of multiple
> PCIe endpoints whether by pulling the card, powering it off, or
> dropping the link can trigger a system deadlock.

I think this only identifies one small aspect of the problems with
surprise removal and vfio-pci.  It's not just the release path of the
device that can trigger a reset, there are various user accessible
paths as well, ex. the vfio reset and hot-reset ioctls.  I think those
can trigger this same deadlock.

Beyond reset, CPU and DMA mappings to the device are still present after
a surprise removal.  The latter can really only be revoked using the
new dma-buf support for MMIO regions.

I think we should take a more comprehensive look at enabling vfio-pci to
support surprise removal beyond this one case where a cooperative guest
promptly released the device and encountered a deadlock.

In doing so, I think we're going to see several more cases where we
should test for a disconnected device before reset, some of those may
suggest that PCI-core is actually the better place for the test rather
than the leaf caller.  Thanks,

Alex

> Example: two PCIe endpoints are bound to vfio-pci and opened by
> the same process (fdA for device A, fdB for device B).
> 
> 1. A PCIe-fault brings B's link down, then A's.
> 2. The PCI core starts removing B:
>    - pciehp_unconfigure_device() takes pci_rescan_remove_lock
>    - vfio-pci's remove routine waits for fdB to be closed
> 3. While B is stuck, the core removes A:
>    - pciehp_ist() takes the read side of reset_lock A
>    - It blocks on pci_rescan_remove_lock already held by B
> 4. Killing the process closes fdA first (because it was opened first).
>    vfio_pci_core_close_device() tries to hot-reset A, so it needs
>    the write side of reset_lock A.
> 5. The write request sleeps until the read lock from step 3 is
>    released, but that reader is itself waiting for B's lock
>    -> deadlock.  
> 
> The stuck thread's backtrace is as follows:
>   /proc/1909/stack
>     [<0>] vfio_unregister_group_dev+0x99/0xf0 [vfio]
>     [<0>] vfio_pci_core_unregister_device+0x19/0xb0 [vfio_pci_core]
>     [<0>] vfio_pci_remove+0x15/0x20 [vfio_pci]
>     [<0>] pci_device_remove+0x3e/0xb0
>     [<0>] device_release_driver_internal+0x19b/0x200
>     [<0>] pci_stop_bus_device+0x6d/0x90
>     [<0>] pci_stop_and_remove_bus_device+0xe/0x20
>     [<0>] pciehp_unconfigure_device+0x8c/0x150
>     [<0>] pciehp_disable_slot+0x68/0x140
>     [<0>] pciehp_handle_presence_or_link_change+0x246/0x4c0
>     [<0>] pciehp_ist+0x244/0x280
>     [<0>] irq_thread_fn+0x1f/0x60
>     [<0>] irq_thread+0x1ac/0x290
>     [<0>] kthread+0xfa/0x240
>     [<0>] ret_from_fork+0x209/0x260
>     [<0>] ret_from_fork_asm+0x1a/0x30
>   /proc/1910/stack
>     [<0>] pciehp_unconfigure_device+0x43/0x150
>     [<0>] pciehp_disable_slot+0x68/0x140
>     [<0>] pciehp_handle_presence_or_link_change+0x246/0x4c0
>     [<0>] pciehp_ist+0x244/0x280
>     [<0>] irq_thread_fn+0x1f/0x60
>     [<0>] irq_thread+0x1ac/0x290
>     [<0>] kthread+0xfa/0x240
>     [<0>] ret_from_fork+0x209/0x260
>     [<0>] ret_from_fork_asm+0x1a/0x30
>   /proc/6765/stack
>     [<0>] pciehp_reset_slot+0x2c/0x70
>     [<0>] pci_reset_hotplug_slot+0x3e/0x60
>     [<0>] pci_reset_bus_function+0xcd/0x180
>     [<0>] cxl_reset_bus_function+0xc8/0x110
>     [<0>] __pci_reset_function_locked+0x4f/0xd0
>     [<0>] vfio_pci_core_disable+0x381/0x400 [vfio_pci_core]
>     [<0>] vfio_pci_core_close_device+0x63/0xd0 [vfio_pci_core]
>     [<0>] vfio_df_close+0x48/0x80 [vfio]
>     [<0>] vfio_df_group_close+0x32/0x70 [vfio]
>     [<0>] vfio_device_fops_release+0x1d/0x40 [vfio]
>     [<0>] __fput+0xe6/0x2b0
>     [<0>] task_work_run+0x58/0x90
>     [<0>] do_exit+0x29b/0xa80
>     [<0>] do_group_exit+0x2c/0x80
>     [<0>] get_signal+0x8f9/0x900
>     [<0>] arch_do_signal_or_restart+0x29/0x210
>     [<0>] exit_to_user_mode_loop+0x8e/0x4f0
>     [<0>] do_syscall_64+0x262/0x630
>     [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> The deadlock blocks PCI operations (lspci, sysfs removal, unplug, etc.) and
> sends system load soaring as PCI-related work stalls, preventing the system
> from isolating the fault.
> 
> The ctrl->reset_lock was added by commit 5b3f7b7d062b ("PCI: pciehp:
> Avoid slot access during reset") to serialize threads that perform or
> observe a bus reset, including pciehp_ist() and pciehp_reset_slot().
> 
> When a PCIe device is surprise-removed (card pulled or link fault)
> and generates a link-down event, pciehp_ist() handles it first; only
> later, after the device has already vanished, does vfio_pci_core_disable()
> invoke pciehp_reset_slot() - because it must take ctrl->reset_lock.
> Thus the hot reset (FLR, slot, or bus) is performed after the device
> is gone, which is pointless.
> 
> For surprise removal, the device state is set to
> pci_channel_io_perm_failure in pciehp_unconfigure_device(), indicating
> the device is already gone (disconnected).
> 
> pciehp_ist()
>   pciehp_handle_presence_or_link_change()
>     pciehp_disable_slot()
>       remove_board()
>         pciehp_unconfigure_device(presence) {
> 	  if (!presence)
> 	      pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
> 	}
> 
> Commit 39714fd73c6b ("PCI: Make pci_dev_is_disconnected() helper public for
> other drivers") adds pci_dev_is_disconnected() to let drivers check whether
> a device is gone.
> 
> Fix the deadlock by using pci_dev_is_disconnected() in
> vfio_pci_core_disable() to detect a gone PCIe device and skip the hot
> reset.
> 
> Signed-off-by: Jinhui Guo <guojinhui.liam@...edance.com>
> ---
> 
> Hi, alex
> 
> Sorry for the noise.
> I've just resent the patch to expand the commit message; no code changes.
> I hope the additional context helps the review.
> 
> Best Regards,
> Jinhui
> 
>  drivers/vfio/pci/vfio_pci_core.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 3a11e6f450f7..f42051552dd4 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -678,6 +678,16 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  		if (!vdev->reset_works)
>  			goto out;
>  
> +		/*
> +		 * Skip hot reset on Link-Down. This avoids the reset_lock
> +		 * deadlock in pciehp_reset_slot() when multiple PCIe devices
> +		 * go down at the same time.
> +		 */
> +		if (pci_dev_is_disconnected(pdev)) {
> +			vdev->needs_reset = false;
> +			goto out;
> +		}
> +
>  		pci_save_state(pdev);
>  	}
>