linux-kernel - Re: [PATCH RFC v5 1/5] pci: report surprise removal event

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250717193122-mutt-send-email-mst@kernel.org>
Date: Thu, 17 Jul 2025 19:31:57 -0400
From: "Michael S. Tsirkin" <mst@...hat.com>
To: Lukas Wunner <lukas@...ner.de>
Cc: linux-kernel@...r.kernel.org, Keith Busch <kbusch@...nel.org>,
	Bjorn Helgaas <bhelgaas@...gle.com>,
	Parav Pandit <parav@...dia.com>, virtualization@...ts.linux.dev,
	stefanha@...hat.com, alok.a.tiwari@...cle.com,
	linux-pci@...r.kernel.org
Subject: Re: [PATCH RFC v5 1/5] pci: report surprise removal event

On Thu, Jul 17, 2025 at 10:12:03PM +0200, Lukas Wunner wrote:
> On Thu, Jul 17, 2025 at 11:11:44AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 14, 2025 at 08:11:04AM +0200, Lukas Wunner wrote:
> > > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > > > At the moment, in case of a surprise removal, the regular remove
> > > > callback is invoked, exclusively.  This works well, because mostly, the
> > > > cleanup would be the same.
> > > > 
> > > > However, there's a race: imagine device removal was initiated by a user
> > > > action, such as driver unbind, and it in turn initiated some cleanup and
> > > > is now waiting for an interrupt from the device. If the device is now
> > > > surprise-removed, that never arrives and the remove callback hangs
> > > > forever.
> > > 
> > > For PCI devices in a hotplug slot, user space can initiate "safe removal"
> > > by writing "0" to the hotplug slot's "power" file in sysfs.
> > > 
> > > If the PCI device is yanked from the slot while safe removal is ongoing,
> > > there is likewise no way for the driver to know that the device is
> > > suddenly gone.  That's because pciehp_unconfigure_device() only calls
> > > pci_dev_set_disconnected() in the surprise removal case, not for
> > > safe removal.
> > > 
> > > The solution proposed here is thus not a complete one:  It may work
> > > if user space initiated *driver* removal, but not if it initiated *safe*
> > > removal of the entire device.  For virtio, that may be sufficient.
> > 
> > So just as an idea, something like this can work I guess?  I'm yet to
> > test this - wrote this on the go -
> 
> Don't bother, it won't work:
> 
> pciehp_handle_presence_or_link_change() is called from pciehp_ist(),
> the IRQ thread.  During safe removal the IRQ thread is busy in
> pciehp_unconfigure_device() and waiting for the driver to unbind
> from devices being safe-removed.

Confused. I thought safe removal happens in the userspace thread
that wrote into sysfs?

> An IRQ thread is always single-threaded.  There's no second instance
> of the IRQ thread being run when another interrupt is signaled.
> Rather, the IRQ thread is re-run when it has finished.
> 
> In *theory* what would be possible is to plumb this into pciehp_isr().
> That's the hardirq handler.  This one will indeed be run when an
> interrupt comes in while the IRQ thread is running.  Normally the
> hardirq handler would just collect the events for later consumption
> by the IRQ thread.  The hardirq handler could *theoretically* mark
> devices gone while they're being safe-removed.
> 
> I'm saying "theoretically" because in reality I don't think this is
> a viable approach either:  pciehp_ist() contains code to *ignore*
> link or presence changes if they were caused by a Secondary Bus Reset
> or Downstream Port Containment.  In that case we do *not* want to mark
> devices disconnected because they're only *temporarily* inaccessible.
> This requires waiting for the SBR or DPC to conclude, which can take
> several seconds.  We can't wait in the hardirq handler.
> 
> So this cannot be solved with the current architecture of pciehp,
> at least not easily or in an elegant way.  Sorry!
> 
> Thanks,
> 
> Lukas