linux-kernel - Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161123192005.GA3107@twins.programming.kicks-ass.net>
Date:   Wed, 23 Nov 2016 20:20:05 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     mingo@...nel.org, juri.lelli@....com, rostedt@...dmis.org,
        xlpang@...hat.com, bigeasy@...utronix.de,
        linux-kernel@...r.kernel.org, mathieu.desnoyers@...icios.com,
        jdesfossez@...icios.com, bristot@...hat.com
Subject: Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI

On Thu, Oct 27, 2016 at 10:36:00PM +0200, Thomas Gleixner wrote:
> > +	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
> > +	if (!new_owner) {
> > +		/*
> > +		 * This is the case where futex_lock_pi() has not yet or failed
> > +		 * to acquire the lock but still has the futex_q enqueued. So
> > +		 * the futex state has a 'waiter' while the rt_mutex state does
> > +		 * not.
> > +		 *
> > +		 * Even though there still is pi_state for this futex, we can
> > +		 * clear FUTEX_WAITERS. Either:
> > +		 *
> > +		 *  - we or futex_lock_pi() will drop the last reference and
> > +		 *    clean up this pi_state,
> > +		 *
> > +		 *  - userspace acquires the futex through its fastpath
> > +		 *    and the above pi_state cleanup still happens,
> > +		 *
> > +		 *  - or futex_lock_pi() will re-set the WAITERS bit in
> > +		 *    fixup_owner().
> > +		 */
> > +		newval = 0;
> > +		/*
> > +		 * Since pi_state->owner must point to a valid task, and
> > +		 * task_pid_vnr(pi_state->owner) must match TID_MASK, use
> > +		 * init_task.
> > +		 */
> > +		new_owner = &init_task;
> 
> So that waiter which is now spinning on pi_mutex->lock has already set the
> waiters bit, which you undo. So you created the following scenario:
> 
> CPU0	     	       	  CPU1				CPU2
> 
> TID 0x1001		  TID 0x1000			TID 0x1002
> 
> 			  Acquires futex in user space
>     			  futex value = 0x1000;
> 
> futex_lock_pi()
>   
>   lock_hb()
>   set_waiter_bit()
>   --> futex value = 0x40001000;
> 
> 			  futex_unlock_pi()
> 			   lock_hb()
>   attach_to_pi_owner()
>     rt_mutex_init_proxy_locked()
>   queue_me()
>     unlock_hb()
> 			   unlock_hb()
>   rt_mutex_lock()	   wake_futex_pi()
>    			   lock(pi_mutex->lock);
>    lock(pi_mutex->lock)	   new_owner is NULL;
> 		 	    --> futex value = 0;
> 			   rt_mutex_futex_unlock(pi_mutex);
> 			   unlock(pi_mutex->lock);
>    acquire_rt_mutex()	   return to user space
> 							Acquires futex in user space
> 							--> futex value = 0x1002
>   fixup_owner()
>     fixup_pi_state_owner()
>        uval = 0x1002;
>        newval = 0x40001001;
>        cmpxchg(uval, newval) succeeds
>        --> futex value = 0x40001001
> 
> Voila. Inconsistent state .... TID 0x1002 is borked now.

Urgh, right.

> The other option we have is to set the futex value to FUTEX_WAITERS instead
> of 0.

Yeah, I initially did that but didn't really like it, I then went on to
convince myself setting it to 0 was good, silly me. Leaking the WAITERS
bit is _by_far_ the simplest option though.

Ok, I went and implemented various broken and discarded alternatives
while re-learning all about futexes that I forgot the past few weeks,
while trying to figure out wtf the below does.

I also tried to create some 4+ CPU races that would hit holes in the
below, no luck so far.

> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2246,16 +2246,27 @@ static int fixup_pi_state_owner(u32 __us
>  	if (get_futex_value_locked(&uval, uaddr))
>  		goto handle_fault;
>  
> +	/*
> +	 * If wake_futex_pi() set the futex to 0 and made init_task the
> +	 * transient owner another task might have acquired the futex
> +	 * in user space.
> +	 */

True, but that doesn't explain why we do this. Naively leaving
pi_state->owner set while returning EAGAIN shouldn't be a problem
because put_pi_state() should clean that up.

_However_, that does rt_mutex_proxy_unlock() on it as well, and _that_
is a problem, because it's not the rt_mutex owner.

Then again, we can hit this very problem through any of the other
put_pi_state() calls after setting ->owner = &init_task I think. Which
would argue to special case this in put_pi_state() instead.

> +	if (oldowner == &init_task && uval != 0) {
> +		raw_spin_lock(&pi_state->owner->pi_lock);
> +		list_del_init(&pi_state->list);
> +		raw_spin_unlock(&pi_state->owner->pi_lock);
> +		pi_state->owner = NULL;
> +		return -EAGAIN;
>  	}
>  
> +	newval = (uval & FUTEX_OWNER_DIED) | newtid;
> +
> +	if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
> +		goto handle_fault;
> +
> +	if (curval != uval)
> +		goto retry;

This is slightly weird in that we loose the obvious cmpxchg loop
construct. So I'd write it differently, also that
get_futex_value_locked() call is entirely superfluous the second time
around, we got curval after all.

> +
>  	/*
>  	 * We fixed up user space. Now we need to fix the pi_state
>  	 * itself.
> @@ -2679,6 +2690,10 @@ static int futex_lock_pi(u32 __user *uad
>  
>  out_put_key:
>  	put_futex_key(&q.key);
> +
> +	if (ret == -EAGAIN)
> +		goto retry;
> +

And this is far too clever and really needs a comment. So the crucial
point is that this is after unqueue_me_pi(), which drops the pi_state
and loops back to lookup the pi_state again, which, hopefully, has now
been completely destroyed -- and therefore we hit the regular
attach_to_pi_owner() path, fixing up our 'funny' state.

This is where I was playing with 4+ CPU scenarios, to see if we could
somehow keep the pi_state alive and not make progress.

I think we can do the retry slightly earlier, right after
unqueue_me_pi() and then add retry_queue: right before
futex_lock_pi_atomic(), that would avoid dropping the hb->lock (and
avoids put/get_futex_key).

>  out:
>  	if (to)
>  		destroy_hrtimer_on_stack(&to->timer);

Also, since fixup_pi_state_owner() can now return -EAGAIN, all users
need handling for that.


I'll try and do a patch that does all that and attempt to write coherent
comments on our fancy new state.