linux-kernel - Re: [RFC] workqueue: Fix irq inversion deadlock in manage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 9 Oct 2017 17:40:43 +0800
From:   Lai Jiangshan <jiangshanlai@...il.com>
To:     Boqun Feng <boqun.feng@...il.com>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Josef Bacik <josef@...icpanda.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Tejun Heo <tj@...nel.org>
Subject: Re: [RFC] workqueue: Fix irq inversion deadlock in manage_workers()

On Sun, Oct 8, 2017 at 5:02 PM, Boqun Feng <boqun.feng@...il.com> wrote:
> Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
> lockdep:
>
> | [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
> | [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
> | [ 1270.473240] -----------------------------------------------------
> | [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> | [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
> | [ 1270.474994]
> | [ 1270.474994] and this task is already holding:
> | [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
> | [ 1270.476046] which would create a new lock dependency:
> | [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
> | [ 1270.476949]
> | [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
> | [ 1270.477553]  (&pool->lock/1){-.-.}
> ...
> | [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
> | [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
> ...
> | [ 1270.494735]  Possible interrupt unsafe locking scenario:
> | [ 1270.494735]
> | [ 1270.495250]        CPU0                    CPU1
> | [ 1270.495600]        ----                    ----
> | [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
> | [ 1270.496295]                                local_irq_disable();
> | [ 1270.496753]                                lock(&pool->lock/1);
> | [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
> | [ 1270.497744]   <Interrupt>
> | [ 1270.497948]     lock(&pool->lock/1);
>
> , which will cause a irq inversion deadlock if the above lock scenario
> happens.
>
> The root cause of this safe -> unsafe lock order is the
> mutex_unlock(pool::manager_arb) in manage_workers() with pool::lock
> held.

I didn't thought this kind of pattern is very seldom.  I remember I saw several.
  mutex_lock();
  do_something();
  spin_lock_irq();
  record_the_state_for_ do_something().
   // keep the spin lock held to hold the state for do_more_things().
  mutex_unlock(); // unlock() is suggested to be called when just exiting C.S.
  do_more_things();
  spin_unlock_irq();

Was all code of this pattern removed?
Could it be possible that mutex will be changed to allow this?

(If the mutex can't be changed...)
And I think the little more proper fix is to move the 'mutex_unlock();'
down. In the case for manager_arb,  'mutex_unlock();' can be called
in process_one_work() and before the worker sleeps. However,
a variable might be needed to indicate whether it should be called.

It doesn't means I don't like this fix. 'spin_unlock_irq(&pool->lock);'
before 'mutex_unlock();' makes the things more complicated.
More time is needed for understanding it. And I leave one concern
about the worker_leave_idle() below.

> An obvious fix is dropping the pool::lock before mutex_unlock()
> and re-grabing afterwards, which however will introduce a race condition
> between worker_thread() and put_unbound_pool():
>
> put_unbound_pool() will grab both pool::manager_arb and pool::lock to
> set all current IDLE workers to DIE, and may wait on the
> pool::detach_completion for the last worker to detach from the pool.
>
> And when manage_workers() is called, the caller worker_thread is in
> non-ILDE state, so if the worker dropped both pool::{manager_arb, lock}
> and got delayed for a while long enough for a put_unbound_pool(), the
> put_unbound_pool() would not switch that worker to DIE. As a result, the
> worker will not detach from the pool as it's not DIE and the
> put_unbound_pool() will not proceed as it's waiting for the last worker
> to detach, therefore deadlock.
>
> To overcome this, put the worker back to IDLE state before it drops
> pool::lock in manage_workers(), and make the worker check again whether
> it's DIE after it re-grabs the pool::lock. In this way, we fix the
> potential deadlock reported by lockdep without introducing another.
>
> Reported-by: Josef Bacik <josef@...icpanda.com>
> Signed-off-by: Boqun Feng <boqun.feng@...il.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> ---
>  kernel/workqueue.c | 35 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 64d0edf428f8..2ea7b04cc48b 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1997,7 +1997,40 @@ static bool manage_workers(struct worker *worker)
>         maybe_create_worker(pool);
>
>         pool->manager = NULL;
> +
> +       /*
> +        * Put the manager back to ->idle_list, this allows us to drop the
> +        * pool->lock safely without racing with put_unbound_pool()
> +        *
> +        *                                              <in "manager worker" thread>
> +        *                                              worker_thread():
> +        *                                                spin_lock_irq(&pool->lock);
> +        *                                                worker_leave_idle();
> +        *                                                manage_workers(): // return true
> +        *                                                  mutex_trylock(&pool->manager_arb);
> +        *                                                  <without entering idle here>
> +        *                                                  spin_unlock_irq(&pool->lock);
> +        *                                                  mutex_unlock(&pool->manager_arb);
> +        *
> +        *      put_unbound_pool():
> +        *        mutex_lock(&pool->manager_arb);
> +        *        spin_lock_irq(&pool->lock);
> +        *        <set ILDE worker to DIE>
> +        *        <the manager worker is not set to be DIE, because it's not IDLE>
> +        *        ...
> +        *        wait_for_completion(&pool->detach_completion);
> +        *        <no one will complete() because pool->workers is not empty>
> +        *
> +        *                                                spin_lock_irq(&pool->lock);
> +        *                                                <pool->worklist is empty, go to sleep>
> +        *
> +        * No one is going to wake up the manager worker, even so, it won't
> +        * complete(->detach_completion), since it's not a DIE worker.
> +        */
> +       worker_enter_idle(worker);
> +       spin_unlock_irq(&pool->lock);
>         mutex_unlock(&pool->manager_arb);
> +       spin_lock_irq(&pool->lock);
>         return true;
>  }
>
> @@ -2202,6 +2235,7 @@ static int worker_thread(void *__worker)
>  woke_up:
>         spin_lock_irq(&pool->lock);
>
> +recheck:
>         /* am I supposed to die? */
>         if (unlikely(worker->flags & WORKER_DIE)) {
>                 spin_unlock_irq(&pool->lock);
> @@ -2216,7 +2250,6 @@ static int worker_thread(void *__worker)
>         }
>
>         worker_leave_idle(worker);

I think worker_leave_idle() might be called multiple times,
which might cause bugs, since recheck is moved up.

> -recheck:
>         /* no more worker necessary? */
>         if (!need_more_worker(pool))
>                 goto sleep;
> --
> 2.14.1
>