linux-kernel - Re: [PATCH v2] driver core: Fix concurrent problem of deferred_probe_extend

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <40fa16cf-950b-4ca7-9935-dbce75e46eb9@huawei.com>
Date: Fri, 15 Aug 2025 09:56:29 +0800
From: "wangwensheng (C)" <wangwensheng4@...wei.com>
To: Saravana Kannan <saravanak@...gle.com>, Greg KH
	<gregkh@...uxfoundation.org>
CC: <rafael@...nel.org>, <dakr@...nel.org>, <robh@...nel.org>,
	<broonie@...nel.org>, <linux-kernel@...r.kernel.org>, <chenjun102@...wei.com>
Subject: Re: [PATCH v2] driver core: Fix concurrent problem of
 deferred_probe_extend_timeout()



在 2025/8/15 2:16, Saravana Kannan 写道:
> On Thu, Aug 14, 2025 at 7:20 AM Greg KH <gregkh@...uxfoundation.org> wrote:
>>
>> On Thu, Aug 14, 2025 at 08:29:49PM +0800, Wang Wensheng wrote:
>>> The deferred_probe_timeout_work may be canceled forever unexpected when
>>> deferred_probe_extend_timeout() executes concurrently. Start with
>>> deferred_probe_timeout_work pending, and the problem would
>>> occur after the following sequence.
>>>
>>>           CPU0                                 CPU1
>>> deferred_probe_extend_timeout
>>>    -> cancel_delayed_work => true
>>>                                       deferred_probe_extend_timeout
>>>                                         -> cancel_delayed_wrok
>>>                                           -> __cancel_work
>>>                                             -> try_grab_pending
>>>    -> schedule_delayed_work
>>>     -> queue_delayed_work_on
>>> since pending bit is grabbed,
>>> just return without doing anything
>>>                                          -> set_work_pool_and_clear_pending
>>>                                       this __cancel_work return false and
>>>                                       the work would never be queued again
>>>
>>> The root cause is that the PENDING_BIT of the work_struct would be set
>>> temporaily in __cancel_work and this bit could prevent the work_struct
>>> to be queued in another CPU.
> 
> This feels more like a workqueue API issue (this isn't too obvious
> from the documentation) or me misusing the workqueue API.
> 
> Is this issue still there if you use cancel_delayed_work_sync()
> instead of cancel_delayed_work()? If so, just switch to that and add
> proper comment on why it needs to by "sync".
> 
> -Saravana
> 
cancel_delayed_work_sync() cannot solve the issue. Becasue this issue is 
to do with the interaction between cancel and queue operations for a 
work. The synchronization of the single cancel operation doesn't matter.

>>>
>>> Use deferred_probe_mutex to protect the cancel and queue operations for
>>> the deferred_probe_timeout_work to fix this problem.
>>>
>>> Fixes: 2b28a1a84a0e ("driver core: Extend deferred probe timeout on driver registration")
>>> Cc: <stable@...r.kernel.org>
>>> Signed-off-by: Wang Wensheng <wangwensheng4@...wei.com>
>>> ---
>>>   drivers/base/dd.c | 1 +
>>>   1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
>>> index 13ab98e033ea..00419d2ee910 100644
>>> --- a/drivers/base/dd.c
>>> +++ b/drivers/base/dd.c
>>> @@ -323,6 +323,7 @@ static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_
>>>
>>>   void deferred_probe_extend_timeout(void)
>>>   {
>>> +     guard(mutex)(&deferred_probe_mutex);
>>
>> But if you grab the lock here, in the probe timeout function, the lock
>> will be grabbed again, causing a deadlock, right?  If not, why not?
>>
>> Have you run this patch with lockdep enabled?
>>
>> This feels broken to me, what am I missing?
>>
>> thanks,
>>
>> greg k-h
>