[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALbr=LZ4v7N=tO1vgOsyj9AS+XuNbn6kG-QcF+PacdMjSo0iyw@mail.gmail.com>
Date: Sat, 24 Jan 2026 02:53:32 +0800
From: Gui-Dong Han <hanguidong02@...il.com>
To: Danilo Krummrich <dakr@...nel.org>
Cc: Jon Hunter <jonathanh@...dia.com>, Marek Szyprowski <m.szyprowski@...sung.com>,
Mark Brown <broonie@...nel.org>, gregkh@...uxfoundation.org, rafael@...nel.org,
linux-kernel@...r.kernel.org, baijiaju1990@...il.com,
Qiu-ji Chen <chenqiuji666@...il.com>, Aishwarya.TCV@....com,
"linux-tegra@...r.kernel.org" <linux-tegra@...r.kernel.org>
Subject: Re: [PATCH v5] driver core: enforce device_lock for driver_match_device()
On Sat, Jan 24, 2026 at 12:54 AM Danilo Krummrich <dakr@...nel.org> wrote:
>
> On Fri Jan 23, 2026 at 3:29 PM CET, Jon Hunter wrote:
> > I can fix this by either:
> >
> > 1. Reverting this patch.
> > 2. Disabling the QSPI driver.
> >
> > Now the QSPI driver has issues which need to be fixed which I am
> > wondering once fix will avoid this problem.
> >
> > However, I guess regardless of the QSPI issue, should this patch be
> > having such an impact?
>
> So, this patch by itself is correct, but it reveals when drivers do the wrong
> thing, that is register drivers from contexts where it neither makes sense nor
> it is supported by the driver core.
>
> The deadlock happens when a driver (A) registers another driver (B) from a
> context where the device lock of the device bound to (A) is held, e.g. from bus
> callbacks, such as probe(). See also [1].
>
> While never valid, the deadlock does only occur when (A) and (B) are on the same
> bus, e.g. when a platform driver registers another platform driver in its
> probe() callback.
>
> However, it is a bit more tricky than that: Let's say a platform driver
> registers an SPI controller, then spi_register_controller() might scan the SPI
> bus and register SPI devices (not drivers), which are then probed as well. So
> far this is all fine, but if now in one of the SPI drivers probe() callbacks a
> platform driver is registered, you have a deadlock condition as well.
>
> So it seems that something of this kind is going on with
> drivers/spi/spi-tegra210-quad.c.
>
> I did already run quite thorough analysis throughout the whole kernel tree with
> various static analyzers and also played around with LLMs for finding this
> pattern.
>
> The tools gave me two results:
>
> (1) The IOMMU one I already fixed [2].
> (2) The GPIO driver I posted a patch for in [3].
>
> I specifically also looked for all drivers that are required to run all the
> peripherals in the tegra194-p3509-0000+p3668-0000.dts hierarchy, but couldn't
> catch anything.
>
> (This is also why I asked about OOT, because there are quite some compatible
> strings that are not supported by any upstream driver.)
>
> I think to really see what's going in with spi-tegra210-quad.c, we need the
> dumps of the sysrq-triggers I provided in a previous mail.
It seems the issue is simpler than a recursive registration deadlock.
Looking at the logs, tegra_qspi_probe triggers a NULL pointer
dereference (Oops) while holding the device_lock. The mutex likely
remains marked as held/orphaned, blocking subsequent driver bindings
on the same bus.
This likely explains why lockdep was silent. Since this is not a lock
dependency cycle or a recursive locking violation, but rather a lock
remaining held by a terminated task, lockdep would not flag it as a
deadlock pattern.
This is indeed a side effect of enforcing the lock here—it amplifies
the impact of a crash. However, an Oops while holding the device_lock
is generally catastrophic regardless.
Following up on our previous discussion [1], refactoring
driver_override would resolve this. We could move driver_override to
struct device and protect it with a dedicated lock (e.g.,
driver_override_lock). We would then replace driver_set_override with
dev_set_driver_override and add dev_access_driver_override with
internal lock assertions. This allows us to remove device_lock from
the 2 match paths, reducing contention and preventing a single crash
from stalling the whole bus.
However, this deviates from the current paradigm where device_lock
protects sysfs attributes (like waiting_for_supplier and
power/control). If other sysfs attributes are found to share similar
constraints or would benefit from finer-grained locking (which
requires further investigation), we might have a stronger argument for
introducing a more generic sysfs_lock to handle this class of
attributes. We would also need to carefully verify safety during
device removal.
Danilo, what are your thoughts on this refactoring plan? I am willing
to attempt it, but since it touches the driver core, documentation,
and 10+ bus drivers, and I haven't submitted such a large series
before, it may take me a few weeks to get an initial version out, and
additional time to iterate based on review feedback until it is ready
for merging. If you prefer to handle it yourself to expedite things,
please let me know so we don't duplicate efforts.
[1] https://lore.kernel.org/all/DFNI14L1K1I0.3FZ84OAWXY0LP@kernel.org/
Thanks.
Powered by blists - more mailing lists