linux-kernel - Re: [PATCH v2] driver core: Fix concurrent problem of deferred_probe_extend

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGETcx_-otRyDknVs4SFVWzf5-Zi07TiKUEpetDJJ3r0BTVqmw@mail.gmail.com>
Date: Thu, 14 Aug 2025 11:16:01 -0700
From: Saravana Kannan <saravanak@...gle.com>
To: Greg KH <gregkh@...uxfoundation.org>
Cc: Wang Wensheng <wangwensheng4@...wei.com>, rafael@...nel.org, dakr@...nel.org, 
	robh@...nel.org, broonie@...nel.org, linux-kernel@...r.kernel.org, 
	chenjun102@...wei.com
Subject: Re: [PATCH v2] driver core: Fix concurrent problem of deferred_probe_extend_timeout()

On Thu, Aug 14, 2025 at 7:20 AM Greg KH <gregkh@...uxfoundation.org> wrote:
>
> On Thu, Aug 14, 2025 at 08:29:49PM +0800, Wang Wensheng wrote:
> > The deferred_probe_timeout_work may be canceled forever unexpected when
> > deferred_probe_extend_timeout() executes concurrently. Start with
> > deferred_probe_timeout_work pending, and the problem would
> > occur after the following sequence.
> >
> >          CPU0                                 CPU1
> > deferred_probe_extend_timeout
> >   -> cancel_delayed_work => true
> >                                      deferred_probe_extend_timeout
> >                                        -> cancel_delayed_wrok
> >                                          -> __cancel_work
> >                                            -> try_grab_pending
> >   -> schedule_delayed_work
> >    -> queue_delayed_work_on
> > since pending bit is grabbed,
> > just return without doing anything
> >                                         -> set_work_pool_and_clear_pending
> >                                      this __cancel_work return false and
> >                                      the work would never be queued again
> >
> > The root cause is that the PENDING_BIT of the work_struct would be set
> > temporaily in __cancel_work and this bit could prevent the work_struct
> > to be queued in another CPU.

This feels more like a workqueue API issue (this isn't too obvious
from the documentation) or me misusing the workqueue API.

Is this issue still there if you use cancel_delayed_work_sync()
instead of cancel_delayed_work()? If so, just switch to that and add
proper comment on why it needs to by "sync".

-Saravana

> >
> > Use deferred_probe_mutex to protect the cancel and queue operations for
> > the deferred_probe_timeout_work to fix this problem.
> >
> > Fixes: 2b28a1a84a0e ("driver core: Extend deferred probe timeout on driver registration")
> > Cc: <stable@...r.kernel.org>
> > Signed-off-by: Wang Wensheng <wangwensheng4@...wei.com>
> > ---
> >  drivers/base/dd.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> > index 13ab98e033ea..00419d2ee910 100644
> > --- a/drivers/base/dd.c
> > +++ b/drivers/base/dd.c
> > @@ -323,6 +323,7 @@ static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_
> >
> >  void deferred_probe_extend_timeout(void)
> >  {
> > +     guard(mutex)(&deferred_probe_mutex);
>
> But if you grab the lock here, in the probe timeout function, the lock
> will be grabbed again, causing a deadlock, right?  If not, why not?
>
> Have you run this patch with lockdep enabled?
>
> This feels broken to me, what am I missing?
>
> thanks,
>
> greg k-h