[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250714022128-mutt-send-email-mst@kernel.org>
Date: Mon, 14 Jul 2025 02:26:19 -0400
From: "Michael S. Tsirkin" <mst@...hat.com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: linux-kernel@...r.kernel.org, Lukas Wunner <lukas@...ner.de>,
Keith Busch <kbusch@...nel.org>,
Bjorn Helgaas <bhelgaas@...gle.com>,
Parav Pandit <parav@...dia.com>, virtualization@...ts.linux.dev,
stefanha@...hat.com, alok.a.tiwari@...cle.com,
linux-pci@...r.kernel.org
Subject: Re: [PATCH RFC v5 1/5] pci: report surprise removal event
On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote:
> Housekeeping: Note subject line convention. Indent with spaces in
> commit log. Remove spurious plus signs.
Thanks!
> On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote:
> > At the moment, in case of a surprise removal, the regular remove
> > callback is invoked, exclusively. This works well, because mostly, the
> > cleanup would be the same.
> >
> > However, there's a race: imagine device removal was initiated by a user
> > action, such as driver unbind, and it in turn initiated some cleanup and
> > is now waiting for an interrupt from the device. If the device is now
> > surprise-removed, that never arrives and the remove callback hangs
> > forever.
> >
> > For example, this was reported for virtio-blk:
> >
> > 1. the graceful removal is ongoing in the remove() callback, where disk
> > deletion del_gendisk() is ongoing, which waits for the requests +to
> > complete,
> >
> > 2. Now few requests are yet to complete, and surprise removal started.
> >
> > At this point, virtio block driver will not get notified by the driver
> > core layer, because it is likely serializing remove() happening by
> > +user/driver unload and PCI hotplug driver-initiated device removal. So
> > vblk driver doesn't know that device is removed, block layer is waiting
> > for requests completions to arrive which it never gets. So
> > del_gendisk() gets stuck.
> >
> > Drivers can artificially add timeouts to handle that, but it can be
> > flaky.
> >
> > Instead, let's add a way for the driver to be notified about the
> > disconnect. It can then do any necessary cleanup, knowing that the
> > device is inactive.
>
> This relies on somebody (typically pciehp, I guess) calling
> pci_dev_set_disconnected() when a surprise remove happens.
>
> Do you think it would be practical for the driver's .remove() method
> to recognize that the device may stop responding at any point, even if
> no hotplug driver is present to call pci_dev_set_disconnected()?
>
> Waiting forever for an interrupt seems kind of vulnerable in general.
> Maybe "artificially adding timeouts" is alluding to *not* waiting
> forever for interrupts? That doesn't seem artificial to me because
> it's just a fact of life that devices can disappear at arbitrary
> times.
>
> It seems a little fragile to me to depend on some other part of the
> system to notice the surprise removal and tell you about it or
> schedule your work function. I think it would be more robust for the
> driver to check directly, i.e., assume writes to the device may be
> lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and
> never wait for an interrupt without a timeout.
virtio is ... kind of special, in that users already take it for
granted that having a device as long as they want to respond
still does not lead to errors and data loss.
Makes it a bit harder as our timeout would have to
check for presence and retry, we can't just fail as a
normal hardware device does.
And there's the overhead thing - poking at the device a lot
puts a high load on the host.
So I can imagine a very long timeout (minutes?), and then something like
the WQ I am trying to propose here as a shortcut.
> > Since cleanups can take a long time, this takes an approach
> > of a work struct that the driver initiates and enables
> > on probe, and tears down on remove.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@...hat.com>
> > ---
> > drivers/pci/pci.h | 6 ++++++
> > include/linux/pci.h | 45 +++++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 51 insertions(+)
> >
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 12215ee72afb..3ca4ebfd46be 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -553,6 +553,12 @@ static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
> > pci_dev_set_io_state(dev, pci_channel_io_perm_failure);
> > pci_doe_disconnected(dev);
> >
> > + if (READ_ONCE(dev->disconnect_work_enable)) {
> > + /* Make sure work is up to date. */
> > + smp_rmb();
> > + schedule_work(&dev->disconnect_work);
> > + }
> > +
> > return 0;
> > }
> >
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 05e68f35f392..723b17145b62 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -548,6 +548,10 @@ struct pci_dev {
> > /* These methods index pci_reset_fn_methods[] */
> > u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */
> >
> > + /* Report disconnect events. 0x0 - disable, 0x1 - enable */
> > + u8 disconnect_work_enable;
> > + struct work_struct disconnect_work;
> > +
> > #ifdef CONFIG_PCIE_TPH
> > u16 tph_cap; /* TPH capability offset */
> > u8 tph_mode; /* TPH mode */
> > @@ -1993,6 +1997,47 @@ pci_release_mem_regions(struct pci_dev *pdev)
> > pci_select_bars(pdev, IORESOURCE_MEM));
> > }
> >
> > +/*
> > + * Run this first thing after getting a disconnect work, to prevent it from
> > + * running multiple times.
> > + * Returns: true if disconnect was enabled, proceed. false if disabled, abort.
> > + */
> > +static inline bool pci_test_and_clear_disconnect_enable(struct pci_dev *pdev)
> > +{
> > + u8 enable = 0x1;
> > + u8 disable = 0x0;
> > + return try_cmpxchg(&pdev->disconnect_work_enable, &enable, disable);
> > +}
> > +
> > +/*
> > + * Caller must initialize @pdev->disconnect_work before invoking this.
> > + * The work function must run and check pci_test_and_clear_disconnect_enable.
> > + * Note that device can go away right after this call.
> > + */
> > +static inline void pci_set_disconnect_work(struct pci_dev *pdev)
> > +{
> > + /* Make sure WQ has been initialized already */
> > + smp_wmb();
> > +
> > + WRITE_ONCE(pdev->disconnect_work_enable, 0x1);
> > +
> > + /* check the device did not go away meanwhile. */
> > + mb();
> > +
> > + if (!pci_device_is_present(pdev))
> > + schedule_work(&pdev->disconnect_work);
> > +}
> > +
> > +static inline void pci_clear_disconnect_work(struct pci_dev *pdev)
> > +{
> > + WRITE_ONCE(pdev->disconnect_work_enable, 0x0);
> > +
> > + /* Make sure to stop using work from now on. */
> > + smp_wmb();
> > +
> > + cancel_work_sync(&pdev->disconnect_work);
> > +}
> > +
> > #else /* CONFIG_PCI is not enabled */
> >
> > static inline void pci_set_flags(int flags) { }
> > --
> > MST
> >
Powered by blists - more mailing lists