lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aGFBW7wet9V4WENC@wunner.de>
Date: Sun, 29 Jun 2025 15:36:27 +0200
From: Lukas Wunner <lukas@...ner.de>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: linux-kernel@...r.kernel.org, Bjorn Helgaas <bhelgaas@...gle.com>,
	linux-pci@...r.kernel.org, Parav Pandit <parav@...dia.com>,
	virtualization@...ts.linux.dev, stefanha@...hat.com,
	alok.a.tiwari@...cle.com
Subject: Re: [PATCH RFC] pci: report surprise removal events

On Sat, Jun 28, 2025 at 02:58:49PM -0400, Michael S. Tsirkin wrote:
> At the moment, in case of a surprise removal, the regular
> remove callback is invoked, exclusively.
> This works well, because mostly, the cleanup would be the same.
> 
> However, there's a race: imagine device removal was initiated by a user
> action, such as driver unbind, and it in turn initiated some cleanup and
> is now waiting for an interrupt from the device. If the device is now
> surprise-removed, that never arrives and the remove callback hangs
> forever.
> 
> Drivers can artificially add timeouts to handle that, but it can be
> flaky.
> 
> Instead, let's add a way for the driver to be notified about the
> disconnect. It can then do any necessary cleanup, knowing that the
> device is inactive.
[...]
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -549,6 +549,15 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
>  	pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
>  	pci_doe_disconnected(dev);
>  
> +	/* Notify driver of surprise removal */
> +	device_lock(&dev->dev);
> +
> +	if (dev->driver && dev->driver->err_handler &&
> +	    dev->driver->err_handler->disconnect)
> +		dev->driver->err_handler->disconnect(dev);
> +
> +	device_unlock(&dev->dev);
> +
>  	return 0;
>  }

No, that's not good:

1/ The device_lock() will reintroduce the issues solved by 74ff8864cc84.

2/ pci_dev_set_disconnected() needs to be fast so that devices are marked
   unplugged as quickly as possible.  We want to minimize the time window
   where MMIO and Config Space reads already return "all ones" and writes
   go to nirvana, but pci_dev_is_disconnected() still returns false.
   Hence invoking some driver callback which may take arbitrarily long or
   even sleeps is not an option.

The driver is already notified of removal through invocation of the
->remove() callback.  The use case you're describing is arguably
a corner case.  I do think that a timeout is a better approach
than the one proposed here.  How long does it take for the interrupt
to arrive?  If it's not just a few msec, consider polling the device
and breaking out of the pool loop as soon as pci_dev_is_disconnected()
returns true (or the MMIO read returns PCI_POSSIBLE_ERROR()).

If/when respinning, please explain the use case in more detail,
i.e. which driver, which device, pointers to code...

Thanks!

Lukas

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ