[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251107145156.GG20064@p1gen4-pw042f0m.boeblingen.de.ibm.com>
Date: Fri, 7 Nov 2025 15:51:56 +0100
From: Benjamin Block <bblock@...ux.ibm.com>
To: Niklas Schnelle <schnelle@...ux.ibm.com>
Cc: Bjorn Helgaas <helgaas@...nel.org>, Lukas Wunner <lukas@...ner.de>,
Keith Busch <kbusch@...nel.org>, Gerd Bayer <gbayer@...ux.ibm.com>,
Matthew Rosato <mjrosato@...ux.ibm.com>,
Halil Pasic <pasic@...ux.ibm.com>, Farhan Ali <alifm@...ux.ibm.com>,
Julian Ruess <julianr@...ux.ibm.com>,
Heiko Carstens <hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>, linux-pci@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] PCI/IOV: Fix race between SR-IOV enable/disable and
hotplug
Hey Niklas,
On Thu, Oct 30, 2025 at 11:26:02AM +0100, Niklas Schnelle wrote:
> Commit 05703271c3cd ("PCI/IOV: Add PCI rescan-remove locking when
> enabling/disabling SR-IOV") tried to fix a race between the VF removal
> inside sriov_del_vfs() and concurrent hot unplug by taking the PCI
> rescan/remove lock in sriov_del_vfs(). Similarly the PCI rescan/remove
> lock was also taken in sriov_add_vfs() to protect addition of VFs.
>
> This approach however causes deadlock on trying to remove PFs with
> SR-IOV enabled because PFs disable SR-IOV during removal and this
> removal happens under the PCI rescan/remove lock. So the original fix
> had to be reverted.
>
> Instead of taking the PCI rescan/remove lock in sriov_add_vfs() and
> sriov_del_vfs(), fix the race that occurs with SR-IOV enable and disable
> vs hotplug higher up in the callchain by taking the lock in
> sriov_numvfs_store() before calling into the driver's sriov_configure()
> callback.
>
> Cc: stable@...r.kernel.org
> Fixes: 05703271c3cd ("PCI/IOV: Add PCI rescan-remove locking when enabling/disabling SR-IOV")
> Reported-by: Benjamin Block <bblock@...ux.ibm.com>
> Signed-off-by: Niklas Schnelle <schnelle@...ux.ibm.com>
> ---
> drivers/pci/iov.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
It fixes the original bug (kernel panic) AFAICT, insofar:
Reviewed-by: Benjamin Block <bblock@...ux.ibm.com>
But tbh., I find it a bit "scary" to take one global lock at such a high
level, when it is also used at much lower levels throughout the stack.
IIRC my original (internal) report boiled down to that pci_destroy_dev() is
called essentially unlocked, and without taking any of the attached reference
counts into account.. which seems fishy IMHO. The actual freeing of the memory
is hidden behind a reference, but stuff like device_del() is just.. done..
and that seems strange to me.
Having a bit more granularity here might not hurt; but I can't say I have a
good idea on top of my mind in how to do that, so... rather fix the panic.
--
Best Regards, Benjamin Block / Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH / https://www.ibm.com/privacy
Vors. Aufs.-R.: Wolfgang Wendt / Geschäftsführung: David Faller
Sitz der Ges.: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Powered by blists - more mailing lists