[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260129010822.GA3310904@killaraus>
Date: Thu, 29 Jan 2026 03:08:22 +0200
From: Laurent Pinchart <laurent.pinchart@...asonboard.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Bartosz Golaszewski <bartosz.golaszewski@....qualcomm.com>,
Johan Hovold <johan@...nel.org>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Danilo Krummrich <dakr@...nel.org>,
"Rafael J . Wysocki" <rafael@...nel.org>,
Tzung-Bi Shih <tzungbi@...nel.org>,
Linus Walleij <linusw@...nel.org>, Jonathan Corbet <corbet@....net>,
Shuah Khan <shuah@...nel.org>,
Wolfram Sang <wsa+renesas@...g-engineering.com>,
Simona Vetter <simona.vetter@...ll.ch>,
Dan Williams <dan.j.williams@...el.com>, linux-doc@...r.kernel.org,
linux-kselftest@...r.kernel.org, linux-kernel@...r.kernel.org,
Bartosz Golaszewski <brgl@...nel.org>
Subject: Re: [PATCH 0/3] Revert "revocable: Revocable resource management"
On Tue, Jan 27, 2026 at 07:52:32PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 27, 2026 at 10:18:27PM +0100, Bartosz Golaszewski wrote:
> > maintainers to settle an important question. It seems there are two
> > camps in this discussion: one whose perception of the problem is
> > limited to character devices being referenced from user-space at the
> > time of the driver unbind (favoring fixing the issues at the vfs
> > level) and another extending the problem to any driver unbinding where
> > we cannot ensure a proper ordering of the teardown (for whatever
> > reason:
>
> I don't think you can defend any position where the user can do
>
> echo XYZ > /sys/.../YY
>
> and the kernel has an oops.
>
> Meaing in this discussion if the user does
> echo ".." > /sys/bus/XX/drivers/YY/unbind
>
> The kernel shouldn't oops or warn.
>
> I've seen many kinds of bogus arguments over the years (especially
> misunderstanding what the module refcount does!!), but ultimately I
> think this is the generally agreed expectation.
>
> However, in practice this isn't a common work flow, it is quite alot
> of tricky work to understand how a subsystem works and put in the
> necessary protections, and frankly many subsystems have had these bugs
> for their entire existance. It isn't urgent.
I completely agree with this approach. I don't think anyone really
advocates for considering that causing an oops from userspace is a
normal behaviour. While the various discussions on this topic that
escalated into heated arguments over time may have created an impression
that we have two camps, I think we are all fundamentally in the same
camp: we want to kernel to run reliably without crashing, with
performant technical solutions. And let's remember that we achieve this
goal best when we work together.
> Several subsystems do get it right, it is very possible and the best
> practices are much more aligned with the Device<Bound> stuff in
> Rust. ie guarantee in most contexts that remove() can't run.
>
> I'm not surprised to hear pushback on trying to fix it, especially if
> the proposed fixes are not subsystem comphrenesive in
> nature. Sprinkling SRCU around as partial patches, especially in
> drivers, is not a good idea, IMHO.
>
> The reason cdev keeps coming up is because there are few common ways a
> typical driver can actually generate concurrent operations during and
> after remove that would be problematic.
>
> File descriptors, subsystem callbacks, work queues, timers,
> interrupts, and notifiers.
>
> The latter already have robust schemes to help the driver shutdown and
> end the concurrent operations. ie cancel_work_sync(),
> del_timer_sync(), free_irq(), and *notifier_unregister().
One a side note, devm_request_irq() is another of the devm_* helpers
that cause race conditions, as interrupt handlers can run right after
.remove() returns, which drivers will most likely not handle correctly.
> Many wrappered file descriptors are safe. For example the sysfs usage
> in a driver is sync stopped during device_del's calls to sysfs remove
> functions.
>
> IMHO the largest systemic issue in this space is people making their
> own fops without understanding the lifecycle model and without
> hand-rolling a special a "_sync" kind of shutdown around it.
>
> A managed fops with a sync destruction operation would go a long way
> to closing these issues.
That's what I've been advocating for. The best way to ensure that driver
code will not accessed data freed at .remove() time is to prevent the
code to run at all.
We have different classes of problems. It's tempting to think that a
single solution can handle them all, but that's often not the best
option. Based on what I've seen so far, handling the fops vs. .remove()
race is best done by ensuring that fops do not race with .remove(). Dan
Williams showed that it can be done quite simply. Conceptually speaking
(not using (S)RCU or other specific kernel primitives, to try and
describe the concepts clearly), it works as follows.
1. At the beginning of .remove(), flag that the device is being removed.
2. At the entry point of all fops, check the flag and return with an
error if set. This prevents *new* fops from being entered once
.remove() has started.
3. At the entry point of all fops, flag that a fop is in progress (with
a counter).
4. In .remove(), wake up all threads sleeping in fops.
5. At the exit point of all fops, decrease the fop-in-progress counter.
6. In .remove(), wait until the fop-in-progress counter reaches 0. At
that point, it is guaranteed that all fops will have completed and
that no new fop can run.
7. Complete .remove(), freeing resources.
Most of this of course would need to be handled in helpers at the cdev
level, and wrapped in subsystem helpers. Drivers will need to be
modified to
- Call a helper at the beginning of .remove() to handle (1).
- Wake up threads sleeping in fops (3), as only drivers know about these
(in the general case, in some subsystems this may be handled in
helpers).
- Call a helper to handle (6).
The only fop that this doesn't handle is .release(), as we can't block
.remove() until userspace closes all open file descriptors and unmaps
all memory. This is also the race condition that is the easiest to
trigger (along with the race conditions related to threads sleeping in
fops).
Solving the .release() race also requires driver involvement too. By
definition drivers have tasks to perform in .release() that need to
access resources (MMIO, data structures, ...) - otherwise there would be
no .release() operation. Drivers need to synchronize between .remove()
and .release() and skip release tasks that have been performed by
.remove() already. Strategy depends on subsystems, in the V4L2 case
things can be more complex as it's common for drivers to create multiple
cdevs, but it's entirely maangeable with clear concepts that can be
documented, implemented, and checked during review.
> ie the gpiolib example was basically all fops, one work and a bunch of
> places where the protection was redundant.
>
> Yes there are other cases, and certainly I've commonly seen cases of
> drivers reaching into other drivers, and subsystem non-file ops, but
> these cases usualy have other more fundamental issues with remove
> races :(
Drivers using resources provided by other drivers is a more complex
problem. I'm thinking about a driver acquiring a GPIO in .probe(), and
the GPIO provider disappearing at a random point of time. Or a clock, or
a regulator. These issues are more rare if we ignore unbinding drivers
forcefully through sysfs, but they can still occur, so we should try to
find a solution here too (and the sysfs unbind issue is also worth
fixing). There, unlike in the cdev case, some sort of revocable API
could make sense.
> So I would probably also take a strong position that introducing
> something like DevRes where you try to wrapper MMIO or other device
> resources is adamently not something we want to do. Not because I
> don't care about these removal races, but because I want the drivers
> to run in a Device<Bound> context with very few exceptions.
--
Regards,
Laurent Pinchart
Powered by blists - more mailing lists