[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMRc=MfP1_fH91AVjerbnwVfhA0oqBKV+CTX0_1n32g1t-Pvuw@mail.gmail.com>
Date: Wed, 28 Jan 2026 01:40:02 -0800
From: Bartosz Golaszewski <brgl@...nel.org>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Johan Hovold <johan@...nel.org>, Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Danilo Krummrich <dakr@...nel.org>, "Rafael J . Wysocki" <rafael@...nel.org>, Tzung-Bi Shih <tzungbi@...nel.org>,
Linus Walleij <linusw@...nel.org>, Jonathan Corbet <corbet@....net>, Shuah Khan <shuah@...nel.org>,
Laurent Pinchart <laurent.pinchart@...asonboard.com>,
Wolfram Sang <wsa+renesas@...g-engineering.com>, Simona Vetter <simona.vetter@...ll.ch>,
Dan Williams <dan.j.williams@...el.com>, linux-doc@...r.kernel.org,
linux-kselftest@...r.kernel.org, linux-kernel@...r.kernel.org,
Bartosz Golaszewski <brgl@...nel.org>,
Bartosz Golaszewski <bartosz.golaszewski@....qualcomm.com>
Subject: Re: [PATCH 0/3] Revert "revocable: Revocable resource management"
On Wed, 28 Jan 2026 00:52:32 +0100, Jason Gunthorpe <jgg@...dia.com> said:
> On Tue, Jan 27, 2026 at 10:18:27PM +0100, Bartosz Golaszewski wrote:
>> maintainers to settle an important question. It seems there are two
>> camps in this discussion: one whose perception of the problem is
>> limited to character devices being referenced from user-space at the
>> time of the driver unbind (favoring fixing the issues at the vfs
>> level) and another extending the problem to any driver unbinding where
>> we cannot ensure a proper ordering of the teardown (for whatever
>> reason:
>
> I don't think you can defend any position where the user can do
>
> echo XYZ > /sys/.../YY
>
> and the kernel has an oops.
>
> Meaing in this discussion if the user does
> echo ".." > /sys/bus/XX/drivers/YY/unbind
>
> The kernel shouldn't oops or warn.
>
> I've seen many kinds of bogus arguments over the years (especially
> misunderstanding what the module refcount does!!), but ultimately I
> think this is the generally agreed expectation.
>
> However, in practice this isn't a common work flow, it is quite alot
> of tricky work to understand how a subsystem works and put in the
> necessary protections, and frankly many subsystems have had these bugs
> for their entire existance. It isn't urgent.
>
In general, that's true. However I have first bumped into this can of worms
at work when a client's custom device using a USB-to-I2C adapter and creating
devices dynamically from user-space would crash the kernel or freeze a thread
when the device would be disconnected with processes still referencing file
descriptors. It may not be common but I hope we do care about corner-cases
too.
So I'd say: it's fine as is until it isn't. :)
> Several subsystems do get it right, it is very possible and the best
> practices are much more aligned with the Device<Bound> stuff in
> Rust. ie guarantee in most contexts that remove() can't run.
>
> I'm not surprised to hear pushback on trying to fix it, especially if
> the proposed fixes are not subsystem comphrenesive in
> nature. Sprinkling SRCU around as partial patches, especially in
> drivers, is not a good idea, IMHO.
>
For sure. I've been trying to address it exclusively in subsystem code, where
this kind of logic belongs. After all, subsystem code is where we do complex
things to make drivers simpler. One exception is I2C where the logic is so
broken we need to first rework a lot of drivers. Wolfram is on board with that
though.
> The reason cdev keeps coming up is because there are few common ways a
> typical driver can actually generate concurrent operations during and
> after remove that would be problematic.
>
> File descriptors, subsystem callbacks, work queues, timers,
> interrupts, and notifiers.
>
> The latter already have robust schemes to help the driver shutdown and
> end the concurrent operations. ie cancel_work_sync(),
> del_timer_sync(), free_irq(), and *notifier_unregister().
>
> Many wrappered file descriptors are safe. For example the sysfs usage
> in a driver is sync stopped during device_del's calls to sysfs remove
> functions.
>
> IMHO the largest systemic issue in this space is people making their
> own fops without understanding the lifecycle model and without
> hand-rolling a special a "_sync" kind of shutdown around it.
>
> A managed fops with a sync destruction operation would go a long way
> to closing these issues.
>
> ie the gpiolib example was basically all fops, one work and a bunch of
> places where the protection was redundant.
>
> Yes there are other cases, and certainly I've commonly seen cases of
> drivers reaching into other drivers, and subsystem non-file ops, but
> these cases usualy have other more fundamental issues with remove
> races :(
>
> So I would probably also take a strong position that introducing
> something like DevRes where you try to wrapper MMIO or other device
> resources is adamently not something we want to do. Not because I
> don't care about these removal races, but because I want the drivers
> to run in a Device<Bound> context with very few exceptions.
>
> Jason
>
Rust makes it very clear who owns what. That's something that we struggle
a lot with in C, where we have drivers that create objects and then pass
them on to the subsystem for management but the latter still references
objects that are destroyed when the driver is unbound. Devres when used
right is great and the lifetime is very clear: from binding to unbinding.
The using right is the hard part though. :)
Bartosz
Powered by blists - more mailing lists