linux-kernel - Re: [PATCH] [RESEND 2] Take over futex of dead task only if FUTEX

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5088C10C.80503@linux.intel.com>
Date:	Wed, 24 Oct 2012 21:33:16 -0700
From:	Darren Hart <dvhart@...ux.intel.com>
To:	Thomas Gleixner <tglx@...utronix.de>
CC:	Siddhesh Poyarekar <siddhesh.poyarekar@...il.com>,
	LKML <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] [RESEND 2] Take over futex of dead task only if FUTEX_WAITERS
 is not set



On 10/23/2012 01:29 PM, Thomas Gleixner wrote:
> Darren, Siddhesh,
> 
> On Tue, 23 Oct 2012, Darren Hart wrote:
> 
>> Hi Siddesh,
>>
>> Thanks for the patch and your work to isolate it in the glibc bug 14076.
>>
>> On 10/21/2012 08:20 PM, Siddhesh Poyarekar wrote:
>>> In futex_lock_pi_atomic, we consider that if the value in the futex
>>> variable is 0 with additional flags, then it is safe for takeover
>>> since the owner of the futex is dead.  However, when FUTEX_WAITERS is
>>> set in the futex value, handle_futex_death calls futex_wake to wake up
>>> one task. 
>>
>> It shouldn't for PI mutexes. It should just set the FUTEX_OWNER_DIED flag,
>> maintaining the FUTEX_WAITERS flag, and exit.
>>
>> int handle_futex_death(...
>> ...
>> 		/*
>> 		 * Wake robust non-PI futexes here. The wakeup of
>> 		 * PI futexes happens in exit_pi_state():
>> 		 */
>> 		if (!pi && (uval & FUTEX_WAITERS))
>> 			futex_wake(uaddr, 1, 1, FUTEX_BITSET_MATCH_ANY);
> 
> Yes, the description of the problem is slightly wrong, but it still
> pinpoints the real wreckage.
> 
>>> Hence the assumption in futex_lock_pi_atomic is not correct.
>>> The correct assumption is that a futex may be considered safe for a
>>> takeover if The FUTEX_OWNER_DIED bit is set, the TID bits are 0 and
>>> the FUTEX_WAITERS bit is not set.
> ...
>>> -	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
>>> +	if (unlikely(ownerdied ||
>>> +			!(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {
> 
> This solves the problem at hand, but I'm not too happy with the
> solution. One of the real possible scenarios which expose the problem
> is:
> 
> Futex F is initialized with PTHREAD_PRIO_INHERIT and
> PTHREAD_MUTEX_ROBUST_NP attributes.
> 
> T1 lock_futex_pi(F);
> 
> T2 lock_futex_pi(F);
> 
>    --> T2 blocks on the futex and creates pi_state which is associated
>        to T1.
> 
> T1 exits
> 
>    --> exit_robust_list() runs
> 
>        --> Futex F userspace value TID field is set to 0 and
>            FUTEX_OWNER_DIED bit is set.
>        	   
> T3 lock_futex_pi(F);
> 
>    --> Succeeds due to the check for F's userspace TID field == 0
> 
>    --> Claims ownership of the futex and sets its own TID into the
>        userspace TID field of futex F
> 
>    --> returns to user space  
> 
> T1 --> exit_pi_state_list()
> 
>        --> Transfers pi_state to waiter T2 and wakes T2 via
>        	   rt_mutex_unlock(&pi_state->mutex)
> 
> T2 --> acquires pi_state->mutex and gains real ownership of the
>        pi_state
> 
>    --> Claims ownership of the futex and sets its own TID into the
>        userspace TID field of futex F
> 
>    --> returns to user space  
> 
> T3 --> observes inconsistent state
> 
> This problem is independent of UP/SMP, preemptible/non preemptible
> kernels, or process shared vs. private. The only difference is that
> certain configurations are more likely to expose it.
> 
> So as Siddhesh correctly analyzed the following check in
> futex_lock_pi_atomic() is the culprit:
> 
> 	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
> 
> We check the userspace value for a TID value of 0 and take over the
> futex unconditionally if that's true.
> 
> AFAICT this check is there as it is correct for a different corner
> case of futexes: the WAITERS bit became stale.
> 
> Now the proposed change
> 
> -	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
> +       if (unlikely(ownerdied ||
> +                       !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {
> 
> solves the problem, but it's not obvious why and it wreckages the
> "stale WAITERS bit" case.


In what scenario does the WAITERS bit become stale for pi futexes? This
corner case seems rather core to your solution, so I would like to
understand it a bit better.


> 
> What happens is, that due to the WAITERS bit being set (T2 is blocked
> on that futex) it enforces T3 to go through lookup_pi_state(), which
> in the above case returns an existing pi_state and therefor forces T3
> to legitimately fight with T2 over the ownership of the pi_state (via
> pi_state->mutex). Probelm solved!
> 
> Though that does not work for the "WAITERS bit is stale" problem
> because if lookup_pi_state() does not find existing pi_state it
> returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to
> return -ESRCH to user space because the OWNER_DIED bit is not set.
> 
> Now there is a different solution to that problem. Do not look at the
> user space value at all and enforce a lookup of possibly available
> pi_state. If pi_state can be found, then the new incoming locker T3
> blocks on that pi_state and legitimately races with T2 to acquire the
> rt_mutex and the pi_state and therefor the proper ownership of the
> user space futex.


My first concern here is performance impact by forcing the pi_state
lookup, however, if we got this far, we already took the syscall, and
our performance sucks anyway. Correctness obviously trumps performance here.


> 
> lookup_pi_state() has the correct order of checks. It first tries to
> find a pi_state associated with the user space futex and only if that
> fails it checks for futex TID value = 0. If no pi_state is available
> nothing can create new state at that point because this happens with
> the hash bucket lock held.
> 
> So the above scenario changes to:
> 
> T1 lock_futex_pi(F);
> 
> T2 lock_futex_pi(F);
> 
>    --> T2 blocks on the futex and creates pi_state which is associated
>        to T1.
> 
> T1 exits
> 
>    --> exit_robust_list() runs
> 
>        --> Futex F userspace value TID field is set to 0 and
>            FUTEX_OWNER_DIED bit is set.
>        	   
> T3 lock_futex_pi(F);
> 
>    --> Finds pi_state and blocks on pi_state->rt_mutex
> 
> T1 --> exit_pi_state_list()
> 
>        --> Transfers pi_state to waiter T2 and wakes it via
>        	   rt_mutex_unlock(&pi_state->mutex)
> 
> T2 --> acquires pi_state->mutex and gains ownership of the pi_state
> 
>    --> Claims ownership of the futex and sets its own TID into the
>        userspace TID field of futex F
> 
>    --> returns to user space 
> 
> This covers all gazillion points on which T3 might come in between
> T1's exit_robust_list() clearing the TID field and T2 fixing it up. It
> also solves the "WAITERS bit stale" problem by forcing the take over.
> 
> Another benefit of changing the code this way is that it makes it less
> dependent on untrusted user space values and therefor minimizes the
> possible wreckage which might be inflicted.


That's a definite plus!


> As usual after staring for too long at the futex code my brain hurts
> so much that I really want to ditch that whole optimization of
> avoiding the syscall for the non contended case for PI futexes and rip
> out the maze of corner case handling code. Unfortunately we can't as
> user space relies on that existing behaviour, but at least thinking
> about it helps me to preserve my mental sanity. Maybe we should
> nevertheless :)


I was surprised at how fast you were able to page all this in after all
that travel - or is this what you did for 12 hours on the plane?

:-)

--
Darren


> Thanks,
> 
> 	tglx
> 
> -------------->
> 
> Index: linux/kernel/futex.c
> ===================================================================
> --- linux.orig/kernel/futex.c
> +++ linux/kernel/futex.c
> @@ -716,7 +716,7 @@ static int futex_lock_pi_atomic(u32 __us
>  				struct futex_pi_state **ps,
>  				struct task_struct *task, int set_waiters)
>  {
> -	int lock_taken, ret, ownerdied = 0;
> +	int lock_taken, ret, force_take = 0;
>  	u32 uval, newval, curval, vpid = task_pid_vnr(task);
>  
>  retry:
> @@ -755,17 +755,15 @@ retry:
>  	newval = curval | FUTEX_WAITERS;
>  
>  	/*
> -	 * There are two cases, where a futex might have no owner (the
> -	 * owner TID is 0): OWNER_DIED. We take over the futex in this
> -	 * case. We also do an unconditional take over, when the owner
> -	 * of the futex died.
> -	 *
> -	 * This is safe as we are protected by the hash bucket lock !
> +	 * Should we force take the futex? See below.
>  	 */
> -	if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
> -		/* Keep the OWNER_DIED bit */
> +	if (unlikely(force_take)) {
> +		/*
> +		 * Keep the OWNER_DIED and the WAITERS bit and set the
> +		 * new TID value.
> +		 */
>  		newval = (curval & ~FUTEX_TID_MASK) | vpid;
> -		ownerdied = 0;
> +		force_take = 0;
>  		lock_taken = 1;
>  	}
>  
> @@ -775,7 +773,7 @@ retry:
>  		goto retry;
>  
>  	/*
> -	 * We took the lock due to owner died take over.
> +	 * We took the lock due to forced take over.
>  	 */
>  	if (unlikely(lock_taken))
>  		return 1;
> @@ -790,20 +788,25 @@ retry:
>  		switch (ret) {
>  		case -ESRCH:
>  			/*
> -			 * No owner found for this futex. Check if the
> -			 * OWNER_DIED bit is set to figure out whether
> -			 * this is a robust futex or not.
> +			 * We failed to find an owner for this
> +			 * futex. So we have no pi_state to block
> +			 * on. This can happen in two cases:
> +			 *
> +			 * 1) The owner died
> +			 * 2) A stale FUTEX_WAITERS bit
> +			 *
> +			 * Re-read the futex value.
>  			 */
>  			if (get_futex_value_locked(&curval, uaddr))
>  				return -EFAULT;
>  
>  			/*
> -			 * We simply start over in case of a robust
> -			 * futex. The code above will take the futex
> -			 * and return happy.
> +			 * If the owner died or we have a stale
> +			 * WAITERS bit the owner TID in the user space
> +			 * futex is 0.
>  			 */
> -			if (curval & FUTEX_OWNER_DIED) {
> -				ownerdied = 1;
> +			if (!(curval & FUTEX_TID_MASK)) {
> +				force_take = 1;
>  				goto retry;
>  			}
>  		default:
> 

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Technical Lead - Linux Kernel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/