linux-kernel - Re: [PATCH RFC v5 1/5] pci: report surprise removal event

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250716222900.GA2556670@bhelgaas>
Date: Wed, 16 Jul 2025 17:29:00 -0500
From: Bjorn Helgaas <helgaas@...nel.org>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: linux-kernel@...r.kernel.org, Lukas Wunner <lukas@...ner.de>,
	Keith Busch <kbusch@...nel.org>,
	Bjorn Helgaas <bhelgaas@...gle.com>,
	Parav Pandit <parav@...dia.com>, virtualization@...ts.linux.dev,
	stefanha@...hat.com, alok.a.tiwari@...cle.com,
	linux-pci@...r.kernel.org
Subject: Re: [PATCH RFC v5 1/5] pci: report surprise removal event

On Tue, Jul 15, 2025 at 02:28:20AM -0400, Michael S. Tsirkin wrote:
> On Mon, Jul 14, 2025 at 04:13:51PM -0500, Bjorn Helgaas wrote:
> > On Mon, Jul 14, 2025 at 02:26:19AM -0400, Michael S. Tsirkin wrote:
> > > On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> > > > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > > > At the moment, in case of a surprise removal, the regular
> > > > > remove callback is invoked, exclusively.  This works well,
> > > > > because mostly, the cleanup would be the same.
> > > > > 
> > > > > However, there's a race: imagine device removal was
> > > > > initiated by a user action, such as driver unbind, and it in
> > > > > turn initiated some cleanup and is now waiting for an
> > > > > interrupt from the device. If the device is now
> > > > > surprise-removed, that never arrives and the remove callback
> > > > > hangs forever.
> > > > > 
> > > > > For example, this was reported for virtio-blk:
> > > > > 
> > > > > 	1. the graceful removal is ongoing in the remove() callback, where disk
> > > > > 	   deletion del_gendisk() is ongoing, which waits for the requests +to
> > > > > 	   complete,
> > > > > 
> > > > > 	2. Now few requests are yet to complete, and surprise removal started.
> > > > > 
> > > > > 	At this point, virtio block driver will not get notified by the driver
> > > > > 	core layer, because it is likely serializing remove() happening by
> > > > > 	+user/driver unload and PCI hotplug driver-initiated device removal.  So
> > > > > 	vblk driver doesn't know that device is removed, block layer is waiting
> > > > > 	for requests completions to arrive which it never gets.  So
> > > > > 	del_gendisk() gets stuck.
> > > > > 
> > > > > Drivers can artificially add timeouts to handle that, but it can be
> > > > > flaky.
> > > > > 
> > > > > Instead, let's add a way for the driver to be notified about the
> > > > > disconnect. It can then do any necessary cleanup, knowing that the
> > > > > device is inactive.
> > > > 
> > > > This relies on somebody (typically pciehp, I guess) calling
> > > > pci_dev_set_disconnected() when a surprise remove happens.
> > > > 
> > > > Do you think it would be practical for the driver's .remove() method
> > > > to recognize that the device may stop responding at any point, even if
> > > > no hotplug driver is present to call pci_dev_set_disconnected()?
> > > > 
> > > > Waiting forever for an interrupt seems kind of vulnerable in general.
> > > > Maybe "artificially adding timeouts" is alluding to *not* waiting
> > > > forever for interrupts?  That doesn't seem artificial to me because
> > > > it's just a fact of life that devices can disappear at arbitrary
> > > > times.
> > > > 
> > > > It seems a little fragile to me to depend on some other part of the
> > > > system to notice the surprise removal and tell you about it or
> > > > schedule your work function.  I think it would be more robust for the
> > > > driver to check directly, i.e., assume writes to the device may be
> > > > lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
> > > > never wait for an interrupt without a timeout.
> > > 
> > > virtio is ... kind of special, in that users already take it for
> > > granted that having a device as long as they want to respond
> > > still does not lead to errors and data loss.
> > > 
> > > Makes it a bit harder as our timeout would have to
> > > check for presence and retry, we can't just fail as a
> > > normal hardware device does.
> > 
> > Sorry, I don't know enough about virtio to follow what you said about 
> > "having a device as long as they want to respond".
> > 
> > We started with a graceful remove.  That must mean the user no longer
> > needs the device.
> 
> I'll try to clarify:
> 
> Indeed, the user will not submit new requests,
> but users might also not know that there are some old requests
> being in progress of being processed by the device.
> The driver, currently, waits for that processsing to be complete.
> Cancelling that with a reset on a timeout might be a regression,
> unless the timeout is very long.

This seems like a corner case and maybe rare enough that simply making
the timeout very long would be a possibility.

> > > And there's the overhead thing - poking at the device a lot
> > > puts a high load on the host.
> > 
> > Checking for PCI_POSSIBLE_ERROR() doesn't touch the device.  If you
> > did a config read already, and the result happened to be ~0, *then* we
> > have the problem of figuring out whether the actual data from the
> > device was ~0, or if the read failed and the Root Complex synthesized
> > the ~0.  In many cases a driver knows that ~0 is not a possible
> > register value.  Otherwise it might have to read another register that
> > is known not to be ~0.
> 
> To clarify, virtio generally is designed to operate solely
> by means of DMA and interrupt, completely avoiding any PCI
> reads. This, due to PCI reads being very expensive in virtualized
> scenarios.
> 
> The extra overhead I refer to is exactly initiating such a read
> where there would not be one in normal operation.

Thanks, this part is very helpful.  And since config accesses are very
expensive in *all* environments, I expect most drivers for
high-performance devices work the same way and only do config accesses
during at probe time.

If that's true, it will make this more understandable if the commit
log approaches it from that direction and omits virtio specifics.

Bjorn