[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<CY8PR12MB7195583E429203129577B51ADC46A@CY8PR12MB7195.namprd12.prod.outlook.com>
Date: Mon, 30 Jun 2025 13:52:26 +0000
From: Parav Pandit <parav@...dia.com>
To: Keith Busch <kbusch@...nel.org>
CC: "Michael S. Tsirkin" <mst@...hat.com>, Lukas Wunner <lukas@...ner.de>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, Bjorn Helgaas
<bhelgaas@...gle.com>, "linux-pci@...r.kernel.org"
<linux-pci@...r.kernel.org>, "virtualization@...ts.linux.dev"
<virtualization@...ts.linux.dev>, "stefanha@...hat.com"
<stefanha@...hat.com>, "alok.a.tiwari@...cle.com" <alok.a.tiwari@...cle.com>
Subject: RE: [PATCH RFC] pci: report surprise removal events
> From: Keith Busch <kbusch@...nel.org>
> Sent: 30 June 2025 07:14 PM
>
> On Mon, Jun 30, 2025 at 04:07:55AM +0000, Parav Pandit wrote:
> >
> > > From: Keith Busch <kbusch@...nel.org>
> > > Sent: 30 June 2025 05:10 AM
> > >
> > > On Sun, Jun 29, 2025 at 01:28:08PM -0400, Michael S. Tsirkin wrote:
> > > > On Sun, Jun 29, 2025 at 03:36:27PM +0200, Lukas Wunner wrote:
> > > > > On Sat, Jun 28, 2025 at 02:58:49PM -0400, Michael S. Tsirkin wrote:
> > > > >
> > > > > 1/ The device_lock() will reintroduce the issues solved by
> 74ff8864cc84.
> > > >
> > > > I see. What other way is there to prevent dev->driver from going
> > > > away, though? I guess I can add a new spinlock and take it both
> > > > here and when
> > > > dev->driver changes? Acceptable?
> > >
> > > You're already holding the pci_bus_sem here, so the final device 'put'
> > > can't have been called yet, so the device is valid and thread safe
> > > in this context. I think maintaining the desired lifetime of the
> > > instantiated driver is just a matter of reference counting within your driver.
> > >
> > > Just a thought on your patch, instead of introducing a new callback,
> > > you could call the existing '->error_detected()' callback with the
> > > previously set 'pci_channel_io_perm_failure' status. That would
> > > totally work for nvme to kick its cleanup much quicker than the
> > > blk_mq timeout handling we currently rely on for this scenario.
> >
> > error_detected() callback is also called while holding the device_lock() by
> report_error_detected().
> > So when remove() callback is ongoing for graceful removal and driver
> > is waiting for the request completions,
> >
> > If the error_detected() will be stuck on device lock.
>
> But I didn't suggest calling error_detected from report_error_detected.
> Just call it directly without device_lock. It's not very feasible to enforce a non-
> blocking callback, though, if speed is really a concern here.
Yeah, it would better to either always call a callback with or without the lock.
In some flows with lock and in some flows without lock would likely be
very bad as one cannot establish a sane locking order.
For time efficiency, large part of the processing is
happening in the individual driver itself as opposed to the core place.
So as long as individual driver can reliably know about error, and act swiftly, it should be good enough.
Powered by blists - more mailing lists