linux-kernel - Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the reader waiter to nil

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20181129131232.GN2131@hirez.programming.kicks-ass.net>
Date:   Thu, 29 Nov 2018 14:12:32 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Yongji Xie <elohimes@...il.com>
Cc:     mingo@...hat.com, will.deacon@....com,
        linux-kernel@...r.kernel.org, xieyongji@...du.com,
        zhangyu31@...du.com, liuqi16@...du.com, yuanlinsi01@...du.com,
        nixun@...du.com, lilin24@...du.com,
        Davidlohr Bueso <dave@...olabs.net>,
        Waiman Long <longman@...hat.com>
Subject: Re: [RFC] locking/rwsem: Avoid issuing wakeup before setting the
 reader waiter to nil


+Cc davidlohr and waiman

On Thu, Nov 29, 2018 at 08:50:30PM +0800, Yongji Xie wrote:
> From: Xie Yongji <xieyongji@...du.com>
> 
> Our system encountered a problem recently, the khungtaskd detected
> some process hang on mmap_sem. But the odd thing was that one task which
> is not on mmap_sem.wait_list still sleeps in rwsem_down_read_failed().
> Through code inspection, we found a potential bug can lead to this.
> 
> Imaging this:
> 
> Thread 1                                  Thread 2
>                                           down_write();
> rwsem_down_read_failed()
>  raw_spin_lock_irq(&sem->wait_lock);
>  list_add_tail(&waiter.list, &wait_list);
>  raw_spin_unlock_irq(&sem->wait_lock);
>                                           __up_write();
>                                            rwsem_wake();
>                                             __rwsem_mark_wake();
>                                              wake_q_add();
>                                              list_del(&waiter->list);
>                                              waiter->task = NULL;
>  while (true) {
>   set_current_state(TASK_UNINTERRUPTIBLE);
>   if (!waiter.task) // true
>       break;
>  }
>  __set_current_state(TASK_RUNNING);
> 
> Now Thread 1 is queued in Thread 2's wake_q without sleeping. Then
> Thread 1 call rwsem_down_read_failed() again because Thread 3
> hold the lock, if Thread 3 tries to queue Thread 1 before Thread 2
> do wakeup, it will fail and miss wakeup:
> 
> Thread 1                                  Thread 2      Thread 3
>                                                         down_write();
> rwsem_down_read_failed()
>  raw_spin_lock_irq(&sem->wait_lock);
>  list_add_tail(&waiter.list, &wait_list);
>  raw_spin_unlock_irq(&sem->wait_lock);
>                                                         __rwsem_mark_wake();
>                                                          wake_q_add();
>                                           wake_up_q();
>                                                          waiter->task = NULL;
>  while (true) {
>   set_current_state(TASK_UNINTERRUPTIBLE);
>   if (!waiter.task) // false
>       break;
>   schedule();
>  }
>                                                         wake_up_q(&wake_q);
> 
> In another word, that means we might issue the wakeup before setting the reader
> waiter to nil. If so, the wakeup may do nothing when it was called before reader
> set task state to TASK_UNINTERRUPTIBLE. Then we would have no chance to wake up
> the reader any more, and cause other writers such as "ps" command stuck on it.
> 
> This patch is not verified because we still have no way to reproduce the problem.
> But I'd like to ask for some comments from community firstly.

Urgh; so the case where the cmpxchg() fails because it already has a
wakeup in progress, which then 'violates' our expectation of when the
wakeup happens.

Yes, I think this is real, and worse, I think we need to go audit all
wake_q_add() users and document this behaviour.

In the ideal case we'd delay the actual wakeup to the last wake_up_q(),
but I don't think we can easily fix that.

> Signed-off-by: Xie Yongji <xieyongji@...du.com>
> Signed-off-by: Zhang Yu <zhangyu31@...du.com>
> ---
>  kernel/locking/rwsem-xadd.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index 09b1800..50d9af6 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -198,15 +198,22 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
>  		woken++;
>  		tsk = waiter->task;
>  
> -		wake_q_add(wake_q, tsk);
> +		get_task_struct(tsk);
>  		list_del(&waiter->list);
>  		/*
> -		 * Ensure that the last operation is setting the reader
> +		 * Ensure calling get_task_struct() before setting the reader
>  		 * waiter to nil such that rwsem_down_read_failed() cannot
>  		 * race with do_exit() by always holding a reference count
>  		 * to the task to wakeup.
>  		 */
>  		smp_store_release(&waiter->task, NULL);
> +		/*
> +		 * Ensure issuing the wakeup (either by us or someone else)
> +		 * after setting the reader waiter to nil.
> +		 */
> +		wake_q_add(wake_q, tsk);
> +		/* wake_q_add() already take the task ref */
> +		put_task_struct(tsk);
>  	}
>  
>  	adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
> -- 
> 2.2.3
>