linux-kernel - Re: [RFC][PATCH] fixup pi_state in futex

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 06 Aug 2009 17:36:59 -0700
From:	Darren Hart <dvhltc@...ibm.com>
To:	Steven Rostedt <rostedt@...dmis.org>
CC:	Peter Zijlstra <peterz@...radead.org>,
	"lkml," <linux-kernel@...r.kernel.org>,
	linux-rt-users <linux-rt-users@...r.kernel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...e.hu>, John Kacur <jkacur@...hat.com>,
	Dinakar Guniguntala <dino@...ibm.com>,
	John Stultz <johnstul@...ux.vnet.ibm.com>
Subject: Re: [RFC][PATCH] fixup pi_state in futex_requeue on lock steal

Steven Rostedt wrote:
> On Thu, 6 Aug 2009, Darren Hart wrote:
> 
>> Peter Zijlstra wrote:
>>> On Wed, 2009-08-05 at 17:01 -0700, Darren Hart wrote:
>>>> NOT FOR INCLUSION
>>>>
>>>> Fixup the uval and pi_state owner in futex_requeue(requeue_pi=1) in the
>>>> event
>>>> of a lock steal or owner died.  I had hoped to leave it up to the new
>>>> owner to
>>>> fix up the userspace value since we can't really handle a fault here
>>>> gracefully.  This should be safe as the lock is contended and should force
>>>> all
>>>> userspace attempts to lock or unlock into the kernel where they'll block
>>>> on the
>>>> hb lock.  However, when I don't update the uaddr, I hit the WARN_ON(pid !=
>>>> pi_state->owner->pid) as expected, and the userspace testcase deadlocks.
>>>>
>>>> I need to try and better understand what's happening to hang userspace.
>>>> In the
>>>> meantime I thought I'd share what I'm working with atm.  This is a
>>>> complete HACK
>>>> and is ugly, non-modular, etc etc.  However, it currently works.  It would
>>>> explode
>>>> in a most impressive fashion should we happen to fault.  So the remaining
>>>> questions
>>>> are:
>>>>
>>>> o Why does userspace deadlock if we leave the uval updating to the new
>>>> owner
>>>>   waking up in futex_wait_requeue_pi()?
>>>>
>>>> o If we have to handle a fault in futex_requeue(), how can we best cleanup
>>>> the
>>>>   proxy lock acquisition and get things back into a sane state.  We
>>>> faulted, so
>>>>   determinism is out the window anyway, we just need to recover
>>>> gracefully.
>>>
>>> Do you have a trace of the thing going down?
>> I finally did get a trace... but learned something in the process. Elaborating
>> below.
> 
> I'm assuming you used ftrace as your tracing infrastructure?

:-)  Yes, ftrace with the nop tracer and some trace_printk worked 
nicely.  I turned on the trace from userspace and stopped the trace from 
within the kernel using tracing_off().  But, since the bug didn't 
actually exist, I never actually hit tracing_off().  The trace_printk's 
however nicely highlighted the "race window" that wasn't actually a 
window ;-)

Thanks for the ring buffer fixes btw.

--
Darren

> 
> -- Steve
> 
>>> Tglx and me usually use sched_switch and a few trace_printk()s sprinkled
>>> around, the typical one would be in sys_futex, printing the futex cmd
>>> and arg.
>>>
>>> OK, so run me through this one more time.
>>>
>>> A condvar has two futexes, an inner and an outer. The inner futex is
>>> always locked and the waiting threads are stacked on that.
>> 3 actually:
>>
>> cond->data->futex (the waitqueue)
>> cond->data->lock (the lock protecting the internal data)
>> outer mutex (the pthread_mutex)
>>
>>> Then on signal/broadcast, we lock the outer lock and requeue all the
>>> blocked tasks from the inner to the outer, then we release the outer
>>> lock and let them rip.
>> Yes - and in requeue_pi with a PI mutex we only let 1 rip, and requeue the
>> rest, rather than wake them all as the old implementation for PI mutexes did.
>>
>>> Since we're seeing lock steals, I'm thinking the outer lock isn't taken
>>> when we're doing the requeue?
>> Correct.  Unfortunately this is "valid" usage.
>>
>>> Anyway, during the requeue we lock-steal because the owner isn't running
>>> yet and we iterate a higher prio task in the requeue loop?
>> I believe so.
>>
>>> This leaves the outer lock's futex field messed up because it points to
>>> the wrong TID.
>> The futex uval isn't messed up, it just still hold the value of the previous
>> owners tid (not the expected owner we're stealing from).  I believe now that
>> this is proper behavior.
>>
>>> After we finish the requeue loop, we unlock the HBs.
>>>
>>>
>>> So far so good?
>> Yup.
>>
>>>
>>> Now, normally the waking thread will find itself owner and will check
>>> the futex variable and fix her up -- while holding the HB lock.
>> Correcto.
>>
>>> However, in case the outer lock gets contended again, we can get
>>> interrupted between requeue and wakeup/fixup and observe this messed up
>>> futex value, which is causing this WARN to trigger.
>> This is where I was mistaken.  I had seen the WARN_ON(pid !=
>> pi_state->owner->pid) in lookup_pi_state() while working on the previous 2
>> patches I sent to the list.  The one which updates the lock_ptr of the futex_q
>> to match that of the pi_state should fix this.  What happened before was we
>> would grab the wrong hb lock so while we were fixing up the pi_state and uval
>> in the woken thread, a contending thread would read those value while holding
>> the correct hb lock.  That race is fixed with the "[PATCH 1/2] Update woken
>> requeued futex_q lock_ptr" patch.
>>
>>> So where do we deadlock, after this all goes down? Do we perhaps lookup
>>> the wrong pi_state using that wrong TID?
>>>
>> We only deadlocked while I was (wrongly) trying to update pi_state owner from
>> the requeue thread.  Deadlocks don't occur in my testing with only patches 1
>> and 2.
>>
>> [PATCH 1/2] Update woken requeued futex_q lock_ptr" patch
>> [PATCH 2/2][RT] Avoid deadlock in rt_mutex_start_proxy_lock()
>>
>>
>>> Since its only the outer futex's value that matters, right? Can't we pin
>>> that using get_user_pages() before we take the HB lock and go into the
>>> requeue loop? That way we're sure to be able to change it without
>>> faulting.
>> I now don't believe we have to do this.  In fact, futex_lock_pi() exhibits a
>> similar "race window" (simplified below):
>>
>>         /*
>>          * Block on the PI mutex:
>>          */
>>         ret = rt_mutex_timed_lock(&q.pi_state->pi_mutex, to, 1);
>>
>>         [RACE WINDOW ]   (not really, see below)
>>
>>         spin_lock(q.lock_ptr);
>>         /*
>>          * Fixup the pi_state owner and possibly acquire the lock if we
>>          * haven't already.
>>          */
>>         res = fixup_owner(uaddr, fshared, &q, !ret);
>>
>> Note that the rt_mutex is acquire while the q.lock_ptr (hb->lock) is not held
>> (since we can sleep).  This is FINE and not a race.  Lets look at what happens
>> if another task tries to get the lock during that time:
>>
>> futex_lock_pi
>> 	futex_lock_pi_atomic
>> 		lookup_pi_state
>> 			
>> At this point we have the pi_state.  It's owner field will point to the
>> previous owner, not the task that is currently acquiring it.  But the rt_mutex
>> itself knows who owns it, so proper boosting should still occur.  Once the new
>> owner complete the pi_state update, the pi_state will be removed from the old
>> owner pi_state_list and added to its pi_state_list.  Since the futex uval
>> shows it's owned in both cases, the new contender is still forced into the
>> kernel to block on the rt_mutex.  Since we update the uval, then the
>> pi_state->owner, we are sure to be able to access the rt_mutex via the old
>> uval so long as we hold the hb->lock.
>>
>> So, I think we're fine with respect to the pi_state ownership!  In fact I
>> finally managed to catch the lock steal in the requeue loop in my tracing, and
>> everything worked fine.  Going to go rerun a bunch more tests and see if I hit
>> any other issues, if I do, I suspect they are unrelated to this.
>>
>> Thanks for the help in thinking this through.
>>
>> -- 
>> Darren Hart
>> IBM Linux Technology Center
>> Real-Time Linux Team
>>


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/