[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPJp3hP44n96Rug9@tzungbi-laptop>
Date: Sat, 18 Oct 2025 00:07:58 +0800
From: Tzung-Bi Shih <tzungbi@...nel.org>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Benson Leung <bleung@...omium.org>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
"Rafael J . Wysocki" <rafael@...nel.org>,
Danilo Krummrich <dakr@...nel.org>,
Jonathan Corbet <corbet@....net>, Shuah Khan <shuah@...nel.org>,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
chrome-platform@...ts.linux.dev, linux-kselftest@...r.kernel.org,
Laurent Pinchart <laurent.pinchart@...asonboard.com>,
Bartosz Golaszewski <brgl@...ev.pl>,
Wolfram Sang <wsa+renesas@...g-engineering.com>,
Simona Vetter <simona.vetter@...ll.ch>,
Dan Williams <dan.j.williams@...el.com>
Subject: Re: [PATCH v5 5/7] revocable: Add fops replacement
On Fri, Oct 17, 2025 at 10:49:16AM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 17, 2025 at 02:36:58AM +0000, Tzung-Bi Shih wrote:
> > Imagining the following example:
> >
> > /* res1 and res2 are provided by hot-pluggable devices. */
> > struct filp_priv {
> > void *res1;
> > void *res2;
> > };
> >
> > /* In .open() fops */
> > priv = kzalloc(sizeof(struct filp_priv), ...);
> > priv->res1 = ...;
> > priv->res2 = ...;
> > filp->private_data = priv;
> >
> > /* In .read() fops */
> > priv = filp->private_data;
> > priv->res1 // could result UAF if the device has gone
> > priv->res2 // could result UAF if the device has gone
> >
> >
> > How does the bool * work for the example?
>
> You are thinking about it completely wrong, you are trying to keep the
> driver running conccurrently after it's remove returns - but that
> isn't how Linux drivers are designed.
>
> We have a whole family of synchronous fencing APIs that drivers call
> in their remove() callback to shut down their concurrency. Think of
> things like free_irq(), cancel_work_sync(), timer_shutdown_sync(),
> sysfs_remove_files(). All of these guarentee the concurrent callbacks
> are fenced before returning.
>
> The only issue with cros_ec is this:
>
> static void cros_ec_chardev_remove(struct platform_device *pdev)
> {
> struct miscdevice *misc = dev_get_drvdata(&pdev->dev);
>
> misc_deregister(misc);
> }
>
> It doesn't fence the cdevs! Misc is a hard API to use because it
> doesn't have a misc_deregister_sync() variation!
>
> Dan/Laurent's point and proposal was that mis_deregister() does not
> work like this! It is an anomaly that driver authors typically over
> look.
>
> So the proposal was to add some way to get a:
> misc_deregister_sync()
>
> What gives the fence. Under your proposal it would lock the SRCU and
> change the bool. After it returns no cdev related threads are running
> in fops touching res1/res2. I think your proposal to replace the fops
> and that related machinery is smart and has a chance to succeed.
>
> From this perspective your example is malformed. Resources should not
> become revoked concurrently *while a driver is bound*. The driver
> should be unbound, call misc_deregister_sync()/etc, and return from
> remove() guaranteeing it no longer touches any resources.
Imagining:
- Driver X provides the res1.
- Driver Y provides the res2.
- Driver Z provides the chardev /dev/zzz. The file operations use res1
and res2.
- A userspace program opened /dev/zzz.
In the approach, what is .remove() of driver X supposed to do when driver X
is unbinding (e.g. due to device unplug)?
If it ends up call misc_deregister_sync(), should the synchronous function
wait for the userspace program to close the FD?
The design behind revocable: the driver X waits via synchronize_srcu(), and
then it is free to go. The subsequent accesses to res1 will get NULL, and
should fail gracefully.
> For this specific cros_ec driver it's "res" is this:
>
> struct cros_ec_dev *ec = dev_get_drvdata(pdev->dev.parent);
> struct cros_ec_platform *ec_platform = dev_get_platdata(ec->dev);
In fact, no, the "res" we are concerning is struct cros_ec_device, e.g. [1].
(I knew the naming cros_ec_device vs. cros_ec_dev is somehow easy to confuse.)
[1] https://elixir.bootlin.com/linux/v6.17/source/drivers/platform/chrome/cros_ec_spi.c#L752
> This is already properly lifetime controlled!
>
> It *HAS* to be, and even your patches are assuming it by blindly
> reaching into the parent's memory!
>
> + misc->rps[0] = ec->ec_dev->revocable_provider;
>
> If the parent driver has been racily unbound at this point the
> ec->ec_dev is already a UAF!
Not really, it uses the fact that the caller is from probe(). I think the
driver can't be unbound when it is still in probe().
(Probe protocol device -> register the MFD device ->
add cros-ec-chardev device and probe.)
> For cros it is safe because the cros_ec driver is a child of a MFD and
> the MFD logic ensures that the children are unbound as part of
> destroying the parent. So 'ec' is guarenteed valid from probe() to
> remove() return.
>
> IHMO auto-revoke is a terrible idea, if you go down that path then why
> is misc special? You need to auto-revoke irqs, timers, work queues,
> etc too? That's a mess.
To be clear, I'm using misc as an example which is also the real crash we
observed. If the file operations use other resources provided by a
hot-pluggable device, it'd need to use revocable APIs to prevent the UAFs.
Powered by blists - more mailing lists