linux-kernel - Re: [PATCH workqueue/for-4.14-fixes] workqueue: replace pool->manager

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJhGHyAQuArNBTWcAr5Dq_Lj58yMYUs_01ObNGTJrCt=3Ni3rg@mail.gmail.com>
Date:   Mon, 9 Oct 2017 23:02:34 +0800
From:   Lai Jiangshan <jiangshanlai@...il.com>
To:     Tejun Heo <tj@...nel.org>
Cc:     Boqun Feng <boqun.feng@...il.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Josef Bacik <josef@...icpanda.com>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH workqueue/for-4.14-fixes] workqueue: replace
 pool->manager_arb mutex with a flag

On Mon, Oct 9, 2017 at 9:21 PM, Tejun Heo <tj@...nel.org> wrote:
> Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by
> lockdep:
>
>  [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
>  [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted
>  [ 1270.473240] -----------------------------------------------------
>  [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
>  [ 1270.474239]  (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280
>  [ 1270.474994]
>  [ 1270.474994] and this task is already holding:
>  [ 1270.475440]  (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] worker_thread+0x366/0x3c0
>  [ 1270.476046] which would create a new lock dependency:
>  [ 1270.476436]  (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.}
>  [ 1270.476949]
>  [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock:
>  [ 1270.477553]  (&pool->lock/1){-.-.}
>  ...
>  [ 1270.488900] to a HARDIRQ-irq-unsafe lock:
>  [ 1270.489327]  (&(&lock->wait_lock)->rlock){+.+.}
>  ...
>  [ 1270.494735]  Possible interrupt unsafe locking scenario:
>  [ 1270.494735]
>  [ 1270.495250]        CPU0                    CPU1
>  [ 1270.495600]        ----                    ----
>  [ 1270.495947]   lock(&(&lock->wait_lock)->rlock);
>  [ 1270.496295]                                local_irq_disable();
>  [ 1270.496753]                                lock(&pool->lock/1);
>  [ 1270.497205]                                lock(&(&lock->wait_lock)->rlock);
>  [ 1270.497744]   <Interrupt>
>  [ 1270.497948]     lock(&pool->lock/1);
>
> , which will cause a irq inversion deadlock if the above lock scenario
> happens.
>
> The root cause of this safe -> unsafe lock order is the
> mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock
> held.
>
> Unlocking mutex while holding an irq spinlock was never safe and this
> problem has been around forever but it never got noticed because the
> only time the mutex is usually trylocked while holding irqlock making
> actual failures very unlikely and lockdep annotation missed the
> condition until the recent b9c16a0e1f73 ("locking/mutex: Fix
> lockdep_assert_held() fail").
>
> Using mutex for pool->manager_arb has always been a bit of stretch.
> It primarily is an mechanism to arbitrate managership between workers
> which can easily be done with a pool flag.  The only reason it became
> a mutex is that pool destruction path wants to exclude parallel
> managing operations.
>
> This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE
> and make the destruction path wait for the current manager on a wait
> queue.
>
> Signed-off-by: Tejun Heo <tj@...nel.org>
> Reported-by: Josef Bacik <josef@...icpanda.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Boqun Feng <boqun.feng@...il.com>
> Cc: stable@...r.kernel.org
> ---
> Hello,
>
> Boqun, thanks for the patch and explanation and I shamelessly lifted
> parts of your patch description.  I took an alternative approach
> because there's no reason to make this any more complex when the issue
> is essentially caused by abusing mutex.
>
> The patch seems to work fine but, Lai, can you please review the
> patch?
>
> Thanks.
>
>  kernel/workqueue.c |   39 ++++++++++++++++++---------------------
>  1 file changed, 18 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 64d0edf..8739b6de 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -68,6 +68,7 @@ enum {
>          * attach_mutex to avoid changing binding state while
>          * worker_attach_to_pool() is in progress.
>          */
> +       POOL_MANAGER_ACTIVE     = 1 << 0,       /* being managed */
>         POOL_DISASSOCIATED      = 1 << 2,       /* cpu can't serve workers */
>
>         /* worker flags */
> @@ -165,7 +166,6 @@ struct worker_pool {
>                                                 /* L: hash of busy workers */
>
>         /* see manage_workers() for details on the two manager mutexes */
> -       struct mutex            manager_arb;    /* manager arbitration */
>         struct worker           *manager;       /* L: purely informational */
>         struct mutex            attach_mutex;   /* attach/detach exclusion */
>         struct list_head        workers;        /* A: attached workers */
> @@ -299,6 +299,7 @@ static struct workqueue_attrs *wq_update_unbound_numa_attrs_buf;
>
>  static DEFINE_MUTEX(wq_pool_mutex);    /* protects pools and workqueues list */
>  static DEFINE_SPINLOCK(wq_mayday_lock);        /* protects wq->maydays list */
> +static DECLARE_WAIT_QUEUE_HEAD(wq_manager_wait); /* wait for manager to go away */
>
>  static LIST_HEAD(workqueues);          /* PR: list of all workqueues */
>  static bool workqueue_freezing;                /* PL: have wqs started freezing? */
> @@ -801,7 +802,7 @@ static bool need_to_create_worker(struct worker_pool *pool)
>  /* Do we have too many workers and should some go away? */
>  static bool too_many_workers(struct worker_pool *pool)
>  {
> -       bool managing = mutex_is_locked(&pool->manager_arb);
> +       bool managing = pool->flags & POOL_MANAGER_ACTIVE;
>         int nr_idle = pool->nr_idle + managing; /* manager is considered idle */
>         int nr_busy = pool->nr_workers - nr_idle;
>
> @@ -1980,24 +1981,17 @@ static bool manage_workers(struct worker *worker)
>  {
>         struct worker_pool *pool = worker->pool;
>
> -       /*
> -        * Anyone who successfully grabs manager_arb wins the arbitration
> -        * and becomes the manager.  mutex_trylock() on pool->manager_arb
> -        * failure while holding pool->lock reliably indicates that someone
> -        * else is managing the pool and the worker which failed trylock
> -        * can proceed to executing work items.  This means that anyone
> -        * grabbing manager_arb is responsible for actually performing
> -        * manager duties.  If manager_arb is grabbed and released without
> -        * actual management, the pool may stall indefinitely.
> -        */
> -       if (!mutex_trylock(&pool->manager_arb))
> +       if (pool->flags & POOL_MANAGER_ACTIVE)
>                 return false;
> +
> +       pool->flags |= POOL_MANAGER_ACTIVE;
>         pool->manager = worker;

Hello, tj

I don't find anything wrong with this patch so far.
It is good though I haven't tested it.

I was also thinking alternative code when reviewing.
The first is quite obvious. Testing POOL_MANAGER_ACTIVE
can be replaced by testing pool->manager.
And POOL_MANAGER_ACTIVE  is not needed. Isn't it?

The second thing is to make manage_workers()
and put_unbound_pool() exclusive.
Waiting event on POOL_MANAGER_ACTIVE(or pool->manager)
is one way. However, the pool's refcnt is not possible to
be dropped to zero now since the caller still hold the pool->lock
and some pwds of the works in the worklist. So the other way
to enforce the exclusive could be just doing
get_pwq(the first pwd of the worklist) and put_pwq() when
the manage_workers() done. And the code about
pool->manager_arb in put_unbound_pool() can be
simply removed.

I just want to send my thinking earlier and I didn't
think deeply enough.

What do you think.

thx
Lai


>
>         maybe_create_worker(pool);
>
>         pool->manager = NULL;
> -       mutex_unlock(&pool->manager_arb);
> +       pool->flags &= ~POOL_MANAGER_ACTIVE;
> +       wake_up(&wq_manager_wait);
>         return true;
>  }
>
> @@ -3248,7 +3242,6 @@ static int init_worker_pool(struct worker_pool *pool)
>         setup_timer(&pool->mayday_timer, pool_mayday_timeout,
>                     (unsigned long)pool);
>
> -       mutex_init(&pool->manager_arb);
>         mutex_init(&pool->attach_mutex);
>         INIT_LIST_HEAD(&pool->workers);
>
> @@ -3318,13 +3311,14 @@ static void put_unbound_pool(struct worker_pool *pool)
>         hash_del(&pool->hash_node);
>
>         /*
> -        * Become the manager and destroy all workers.  Grabbing
> -        * manager_arb prevents @pool's workers from blocking on
> -        * attach_mutex.
> +        * Become the manager and destroy all workers.  Becoming manager
> +        * prevents @pool's workers from blocking on attach_mutex.
>          */
> -       mutex_lock(&pool->manager_arb);
> -
>         spin_lock_irq(&pool->lock);
> +       wait_event_lock_irq(wq_manager_wait,
> +                           !(pool->flags & POOL_MANAGER_ACTIVE), pool->lock);
> +       pool->flags |= POOL_MANAGER_ACTIVE;
> +
>         while ((worker = first_idle_worker(pool)))
>                 destroy_worker(worker);
>         WARN_ON(pool->nr_workers || pool->nr_idle);
> @@ -3338,7 +3332,10 @@ static void put_unbound_pool(struct worker_pool *pool)
>         if (pool->detach_completion)
>                 wait_for_completion(pool->detach_completion);
>
> -       mutex_unlock(&pool->manager_arb);
> +       spin_lock_irq(&pool->lock);
> +       pool->flags &= ~POOL_MANAGER_ACTIVE;
> +       wake_up(&wq_manager_wait);
> +       spin_unlock_irq(&pool->lock);
>
>         /* shut down the timers */
>         del_timer_sync(&pool->idle_timer);