linux-kernel - Re: workqueue: race in mod_delayed_work

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <57389F01.309@yandex-team.ru>
Date:	Sun, 15 May 2016 19:08:33 +0300
From:	Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
To:	Tejun Heo <tj@...nel.org>
Cc:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Sasha Levin <sasha.levin@...cle.com>
Subject: Re: workqueue: race in mod_delayed_work_on?

On 13.05.2016 16:49, Konstantin Khlebnikov wrote:
> On 12.05.2016 16:06, Konstantin Khlebnikov wrote:
>> On 10.05.2016 20:20, Konstantin Khlebnikov wrote:
>>> On 10.05.2016 19:36, Tejun Heo wrote:
>>>> Hello,
>>>>
>>>> On Tue, May 10, 2016 at 07:28:08PM +0300, Konstantin Khlebnikov wrote:
>>>>> On 10.05.2016 11:21, Konstantin Khlebnikov wrote:
>>>>>> I've got plenty warnings, bugs and oops around trivial use of mod_delayed_work in drivers/infiniband/core/addr.c
>>>>>
>>>>> Looks like problem in mod_delayed_work_on was hidden because add_timer is equal to mod_timer
>>>>
>>>> The timer usages are gated behind PENDING bit, so whether add_timer()
>>>> is equal to mod_timer() shouldn't matter.
>>>
>>> Hmm... this looks little bit more complicated than one bit.
>>
>> Yep, problem was here - both timer and work can be active at the same time.
>
> Nope, this is impossible. This will be a bug itself.
>
>>
>> So try_to_grab_pending can return success for two coRecentncurrent callers:
>> first get del_timer, second removes work from workqueue. After that
>> both call add timer and one of them either catch BUG_ON or corrupt timer list.
>>
>> I see two possible fixes: always remove timer and work in try_to_grab_pending
>> but this must be carefully synchronized. This will make it slower for sure.
>> Or always use mod_timer in  __queue_delayed_work() - both callers will modify timer,
>> but here is no mod_timer_on().
>>
>>>
>>>>
>>>>> but Sasha accidentally backported 874bbfe600a660cba9c776b3957b1ce393151b76
>>>>> (workqueue: make sure delayed work run in local cpu) into 3.18.25
>>>>>
>>>>> I don't see reason why that commit could break delayed work,
>>>>> most likely it highlighted some other problem.

Indeed, my problem was caused by that backported commit because branch 3.18.y
has no your fix for add_timer_on() -- 22b886dd1018093920c4250dee2a9a3cb7cff7b8
("timers: Use proper base migration in add_timer_on()"). Somehow I've missed that.

>>>>
>>>> What are you running?  Can you reproduce the issue on upstream kernel?
>>>>
>>>
>>> This is slight patched 3.18.y. Looks like this started when we upgraded kernel to 3.18.25 and
>>> somebody have loaded module ib_addr (ip in infiniband or something) which actually unused
>>> because these machines have no infiniband at all. But this code is poked from ethernet arp
>>> sometimes. So, it crashes somewhere from time to time. I'll try to stresstest this piece.
>>>
>>
>

-- 
Konstantin