linux-kernel - Re: [PATCH RFC v5 1/5] pci: report surprise removal event

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250717111154-mutt-send-email-mst@kernel.org>
Date: Thu, 17 Jul 2025 11:15:43 -0400
From: "Michael S. Tsirkin" <mst@...hat.com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: linux-kernel@...r.kernel.org, Lukas Wunner <lukas@...ner.de>,
	Keith Busch <kbusch@...nel.org>,
	Bjorn Helgaas <bhelgaas@...gle.com>,
	Parav Pandit <parav@...dia.com>, virtualization@...ts.linux.dev,
	stefanha@...hat.com, alok.a.tiwari@...cle.com,
	linux-pci@...r.kernel.org
Subject: Re: [PATCH RFC v5 1/5] pci: report surprise removal event

On Wed, Jul 16, 2025 at 05:29:00PM -0500, Bjorn Helgaas wrote:
> On Tue, Jul 15, 2025 at 02:28:20AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 14, 2025 at 04:13:51PM -0500, Bjorn Helgaas wrote:
> > > On Mon, Jul 14, 2025 at 02:26:19AM -0400, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> > > > > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > > > > At the moment, in case of a surprise removal, the regular
> > > > > > remove callback is invoked, exclusively.  This works well,
> > > > > > because mostly, the cleanup would be the same.
> > > > > > 
> > > > > > However, there's a race: imagine device removal was
> > > > > > initiated by a user action, such as driver unbind, and it in
> > > > > > turn initiated some cleanup and is now waiting for an
> > > > > > interrupt from the device. If the device is now
> > > > > > surprise-removed, that never arrives and the remove callback
> > > > > > hangs forever.
> > > > > > 
> > > > > > For example, this was reported for virtio-blk:
> > > > > > 
> > > > > > 	1. the graceful removal is ongoing in the remove() callback, where disk
> > > > > > 	   deletion del_gendisk() is ongoing, which waits for the requests +to
> > > > > > 	   complete,
> > > > > > 
> > > > > > 	2. Now few requests are yet to complete, and surprise removal started.
> > > > > > 
> > > > > > 	At this point, virtio block driver will not get notified by the driver
> > > > > > 	core layer, because it is likely serializing remove() happening by
> > > > > > 	+user/driver unload and PCI hotplug driver-initiated device removal.  So
> > > > > > 	vblk driver doesn't know that device is removed, block layer is waiting
> > > > > > 	for requests completions to arrive which it never gets.  So
> > > > > > 	del_gendisk() gets stuck.
> > > > > > 
> > > > > > Drivers can artificially add timeouts to handle that, but it can be
> > > > > > flaky.
> > > > > > 
> > > > > > Instead, let's add a way for the driver to be notified about the
> > > > > > disconnect. It can then do any necessary cleanup, knowing that the
> > > > > > device is inactive.
> > > > > 
> > > > > This relies on somebody (typically pciehp, I guess) calling
> > > > > pci_dev_set_disconnected() when a surprise remove happens.
> > > > > 
> > > > > Do you think it would be practical for the driver's .remove() method
> > > > > to recognize that the device may stop responding at any point, even if
> > > > > no hotplug driver is present to call pci_dev_set_disconnected()?
> > > > > 
> > > > > Waiting forever for an interrupt seems kind of vulnerable in general.
> > > > > Maybe "artificially adding timeouts" is alluding to *not* waiting
> > > > > forever for interrupts?  That doesn't seem artificial to me because
> > > > > it's just a fact of life that devices can disappear at arbitrary
> > > > > times.
> > > > > 
> > > > > It seems a little fragile to me to depend on some other part of the
> > > > > system to notice the surprise removal and tell you about it or
> > > > > schedule your work function.  I think it would be more robust for the
> > > > > driver to check directly, i.e., assume writes to the device may be
> > > > > lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
> > > > > never wait for an interrupt without a timeout.
> > > > 
> > > > virtio is ... kind of special, in that users already take it for
> > > > granted that having a device as long as they want to respond
> > > > still does not lead to errors and data loss.
> > > > 
> > > > Makes it a bit harder as our timeout would have to
> > > > check for presence and retry, we can't just fail as a
> > > > normal hardware device does.
> > > 
> > > Sorry, I don't know enough about virtio to follow what you said about 
> > > "having a device as long as they want to respond".
> > > 
> > > We started with a graceful remove.  That must mean the user no longer
> > > needs the device.
> > 
> > I'll try to clarify:
> > 
> > Indeed, the user will not submit new requests,
> > but users might also not know that there are some old requests
> > being in progress of being processed by the device.
> > The driver, currently, waits for that processsing to be complete.
> > Cancelling that with a reset on a timeout might be a regression,
> > unless the timeout is very long.
> 
> This seems like a corner case and maybe rare enough that simply making
> the timeout very long would be a possibility.

Indeed the timeout needs to be very long, and the average would still be
reasonable, but the worst case is terrible and the user can't insert a
replacement card all this time. The system is perceived as flaky.


> > > > And there's the overhead thing - poking at the device a lot
> > > > puts a high load on the host.
> > > 
> > > Checking for PCI_POSSIBLE_ERROR() doesn't touch the device.  If you
> > > did a config read already, and the result happened to be ~0, *then* we
> > > have the problem of figuring out whether the actual data from the
> > > device was ~0, or if the read failed and the Root Complex synthesized
> > > the ~0.  In many cases a driver knows that ~0 is not a possible
> > > register value.  Otherwise it might have to read another register that
> > > is known not to be ~0.
> > 
> > To clarify, virtio generally is designed to operate solely
> > by means of DMA and interrupt, completely avoiding any PCI
> > reads. This, due to PCI reads being very expensive in virtualized
> > scenarios.
> > 
> > The extra overhead I refer to is exactly initiating such a read
> > where there would not be one in normal operation.
> 
> Thanks, this part is very helpful.  And since config accesses are very
> expensive in *all* environments, I expect most drivers for
> high-performance devices work the same way and only do config accesses
> during at probe time.
> 
> If that's true, it will make this more understandable if the commit
> log approaches it from that direction and omits virtio specifics.
> 
> Bjorn

Will do, thanks a lot!

-- 
MST