linux-kernel - Re: [PATCH wq/for-4.0-fixes v2] workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAObsKBX+h8PNB-81-pjnvoSDA1sgOyckT57inMi_FPX+k=azQ@mail.gmail.com>
Date:	Thu, 5 Mar 2015 10:24:50 +0100
From:	Tomeu Vizoso <tomeu.vizoso@...il.com>
To:	Tejun Heo <tj@...nel.org>
Cc:	Jesper Nilsson <jesper.nilsson@...s.com>,
	Rabin Vincent <rabinv@...s.com>,
	Jesper Nilsson <jespern@...s.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH wq/for-4.0-fixes v2] workqueue: fix hang involving racing
 cancel[_delayed]_work_sync()'s for PREEMPT_NONE

On 3 March 2015 at 14:57, Tomeu Vizoso <tomeu.vizoso@...il.com> wrote:
> On 3 March 2015 at 14:45, Tejun Heo <tj@...nel.org> wrote:
>> Hello,
>>
>> Found the culprit.  Plain wake_up() shouldn't be used on
>> bit_waitqueues.  The following patch should fix the issue.  I replaced
>> the patch in the wq branches.
>
> Yup, this looks good from here.

Actually, I'm getting this when unloading mwifiex_sdio:

[  317.201212] Unable to handle kernel paging request at virtual
address f081a680
[  317.208522] pgd = ebb70000
[  317.211302] [f081a680] *pgd=ab868811, *pte=00000000, *ppte=00000000
[  317.217692] Internal error: Oops: 7 [#1] SMP ARM
[  317.222328] Modules linked in: cfg80211(-) bluetooth ipv6 [last
unloaded: mwifiex]
[  317.230019] CPU: 2 PID: 682 Comm: modprobe Not tainted
4.0.0-rc2-next-20150305ccu-00013-g2b3b1a8 #589
[  317.239271] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
[  317.245568] task: eb9cd580 ti: ebcea000 task.ti: ebcea000
[  317.251001] PC is at bit_waitqueue+0x38/0x6c
[  317.255312] LR is at __cancel_work_timer+0x28/0x1b0
[  317.260215] pc : [<c028fe18>]    lr : [<c0270d34>]    psr: 20000113
[  317.260215] sp : ebcebe90  ip : ee838000  fp : ebcebe9c
[  317.271737] r10: 00000000  r9 : ebcea000  r8 : c0210ba4
[  317.276991] r7 : 00000081  r6 : 2a07e3b0  r5 : bf134948  r4 : 00000000
[  317.283552] r3 : 9e370001  r2 : 0000c640  r1 : e2692904  r0 : 000ff134
[  317.290112] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  317.297286] Control: 10c5387d  Table: abb7006a  DAC: 00000015
[  317.303060] Process modprobe (pid: 682, stack limit = 0xebcea220)
[  317.309186] Stack: (0xebcebe90 to 0xebcec000)
[  317.313568] be80:                                     ebcebefc
ebcebea0 c0270d34 c028fdec
[  317.321781] bea0: c05062d0 c034e57c c1210824 ebe94be0 ebcebedc
eb8f080c bf134830 2a07e3b0
[  317.329998] bec0: 00000081 c0210ba4 ebcebef4 ebcebed8 c0506428
bf139830 bf1348dc 2a07e3b0
[  317.338220] bee0: 00000081 c0210ba4 ebcea000 00000000 ebcebf0c
ebcebf00 c0270ed8 c0270d18
[  317.346437] bf00: ebcebf34 ebcebf10 bf0ed138 c0270ec8 c11cb4f0
bf139830 bf134800 2a07e3b0
[  317.354655] bf20: 00000081 c0210ba4 ebcebf4c ebcebf38 bf125184
bf0ed120 bf1396b8 2a07f76c
[  317.362872] bf40: ebcebfa4 ebcebf50 c02c8b4c bf125158 ebcebf6c
38676663 31313230 00000000
[  317.372482] bf60: ebcebf8c ebcebf70 c0273e98 c0adcdb0 ebcea010
c0210ba4 ebcebfb0 ebcea000
[  317.382086] bf80: ebcebfac ebcebf90 c0214484 00273de0 2a07f738
2a07f738 00000000 ebcebfa8
[  317.391706] bfa0: c0210a00 c02c89a4 2a07f738 2a07f738 2a07f76c
00000800 0fea3100 00000000
[  317.401319] bfc0: 2a07f738 2a07f738 2a07e3b0 00000081 00000000
00000002 2a07e310 becc9db4
[  317.410948] bfe0: 2a07d12c becc89ac 2a061623 b6eb82e6 60000030
2a07f76c 00000000 00000000
[  317.420658] [<c028fe18>] (bit_waitqueue) from [<c0270d34>]
(__cancel_work_timer+0x28/0x1b0)
[  317.430598] [<c0270d34>] (__cancel_work_timer) from [<c0270ed8>]
(cancel_work_sync+0x1c/0x20)
[  317.440672] [<c0270ed8>] (cancel_work_sync) from [<bf0ed138>]
(regulatory_exit+0x24/0x17c [cfg80211])
[  317.451396] [<bf0ed138>] (regulatory_exit [cfg80211]) from
[<bf125184>] (cfg80211_exit+0x38/0x4c [cfg80211])
[  317.462726] [<bf125184>] (cfg80211_exit [cfg80211]) from
[<c02c8b4c>] (SyS_delete_module+0x1b4/0x1f8)
[  317.473411] [<c02c8b4c>] (SyS_delete_module) from [<c0210a00>]
(ret_fast_syscall+0x0/0x34)

Regards,

Tomeu

> Thanks,
>
> Tomeu
>
>> Thanks a lot.
>>
>> ----- 8< -----
>> From bba5a377507efef69214108c0d3790cadfab21d4 Mon Sep 17 00:00:00 2001
>> From: Tejun Heo <tj@...nel.org>
>> Date: Tue, 3 Mar 2015 08:43:09 -0500
>>
>> cancel[_delayed]_work_sync() are implemented using
>> __cancel_work_timer() which grabs the PENDING bit using
>> try_to_grab_pending() and then flushes the work item with PENDING set
>> to prevent the on-going execution of the work item from requeueing
>> itself.
>>
>> try_to_grab_pending() can always grab PENDING bit without blocking
>> except when someone else is doing the above flushing during
>> cancelation.  In that case, try_to_grab_pending() returns -ENOENT.  In
>> this case, __cancel_work_timer() currently invokes flush_work().  The
>> assumption is that the completion of the work item is what the other
>> canceling task would be waiting for too and thus waiting for the same
>> condition and retrying should allow forward progress without excessive
>> busy looping
>>
>> Unfortunately, this doesn't work if preemption is disabled or the
>> latter task has real time priority.  Let's say task A just got woken
>> up from flush_work() by the completion of the target work item.  If,
>> before task A starts executing, task B gets scheduled and invokes
>> __cancel_work_timer() on the same work item, its try_to_grab_pending()
>> will return -ENOENT as the work item is still being canceled by task A
>> and flush_work() will also immediately return false as the work item
>> is no longer executing.  This puts task B in a busy loop possibly
>> preventing task A from executing and clearing the canceling state on
>> the work item leading to a hang.
>>
>> task A                  task B                  worker
>>
>>                                                 executing work
>> __cancel_work_timer()
>>   try_to_grab_pending()
>>   set work CANCELING
>>   flush_work()
>>     block for work completion
>>                                                 completion, wakes up A
>>                         __cancel_work_timer()
>>                         while (forever) {
>>                           try_to_grab_pending()
>>                             -ENOENT as work is being canceled
>>                           flush_work()
>>                             false as work is no longer executing
>>                         }
>>
>> This patch removes the possible hang by updating __cancel_work_timer()
>> to explicitly wait for clearing of CANCELING rather than invoking
>> flush_work() after try_to_grab_pending() fails with -ENOENT.  The
>> explicit wait uses the matching bit waitqueue for the CANCELING bit.
>>
>> Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com
>>
>> v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
>>     the target bit waitqueue has wait_bit_queue's on it.  Use
>>     DEFINE_WAIT_BIT() and __wake_up_bit() instead.  Reported by Tomeu
>>     Vizoso.
>>
>> Signed-off-by: Tejun Heo <tj@...nel.org>
>> Reported-by: Rabin Vincent <rabin.vincent@...s.com>
>> Cc: Tomeu Vizoso <tomeu.vizoso@...il.com>
>> Cc: stable@...r.kernel.org
>> Tested-by: Jesper Nilsson <jesper.nilsson@...s.com>
>> Tested-by: Rabin Vincent <rabin.vincent@...s.com>
>> ---
>>  include/linux/workqueue.h |  3 ++-
>>  kernel/workqueue.c        | 37 +++++++++++++++++++++++++++++++++----
>>  2 files changed, 35 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
>> index 74db135..f597846 100644
>> --- a/include/linux/workqueue.h
>> +++ b/include/linux/workqueue.h
>> @@ -70,7 +70,8 @@ enum {
>>         /* data contains off-queue information when !WORK_STRUCT_PWQ */
>>         WORK_OFFQ_FLAG_BASE     = WORK_STRUCT_COLOR_SHIFT,
>>
>> -       WORK_OFFQ_CANCELING     = (1 << WORK_OFFQ_FLAG_BASE),
>> +       __WORK_OFFQ_CANCELING   = WORK_OFFQ_FLAG_BASE,
>> +       WORK_OFFQ_CANCELING     = (1 << __WORK_OFFQ_CANCELING),
>>
>>         /*
>>          * When a work item is off queue, its high bits point to the last
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index f288493..cfbae1d 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -2730,17 +2730,37 @@ EXPORT_SYMBOL_GPL(flush_work);
>>
>>  static bool __cancel_work_timer(struct work_struct *work, bool is_dwork)
>>  {
>> +       wait_queue_head_t *waitq = bit_waitqueue(&work->data,
>> +                                                __WORK_OFFQ_CANCELING);
>>         unsigned long flags;
>>         int ret;
>>
>>         do {
>>                 ret = try_to_grab_pending(work, is_dwork, &flags);
>>                 /*
>> -                * If someone else is canceling, wait for the same event it
>> -                * would be waiting for before retrying.
>> +                * If someone else is already canceling, wait for it to
>> +                * finish.  flush_work() doesn't work for PREEMPT_NONE
>> +                * because we may get scheduled between @work's completion
>> +                * and the other canceling task resuming and clearing
>> +                * CANCELING - flush_work() will return false immediately
>> +                * as @work is no longer busy, try_to_grab_pending() will
>> +                * return -ENOENT as @work is still being canceled and the
>> +                * other canceling task won't be able to clear CANCELING as
>> +                * we're hogging the CPU.
>> +                *
>> +                * Explicitly wait for completion using a bit waitqueue.
>> +                * We can't use wait_on_bit() as the CANCELING bit may get
>> +                * recycled to point to pwq if @work gets re-queued.
>>                  */
>> -               if (unlikely(ret == -ENOENT))
>> -                       flush_work(work);
>> +               if (unlikely(ret == -ENOENT)) {
>> +                       DEFINE_WAIT_BIT(wait, &work->data,
>> +                                       __WORK_OFFQ_CANCELING);
>> +                       prepare_to_wait(waitq, &wait.wait,
>> +                                       TASK_UNINTERRUPTIBLE);
>> +                       if (work_is_canceling(work))
>> +                               schedule();
>> +                       finish_wait(waitq, &wait.wait);
>> +               }
>>         } while (unlikely(ret < 0));
>>
>>         /* tell other tasks trying to grab @work to back off */
>> @@ -2749,6 +2769,15 @@ static bool __cancel_work_timer(struct work_struct *work, bool is_dwork)
>>
>>         flush_work(work);
>>         clear_work_data(work);
>> +
>> +       /*
>> +        * Paired with prepare_to_wait() above so that either
>> +        * __wake_up_bit() sees busy waitq here or !work_is_canceling() is
>> +        * visible there.
>> +        */
>> +       smp_mb();
>> +       __wake_up_bit(waitq, &work->data, __WORK_OFFQ_CANCELING);
>> +
>>         return ret;
>>  }
>>
>> --
>> 2.1.0
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/