linux-kernel - Re: [RFC][PATCH] fixup pi_state in futex

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1249576660.32113.585.camel@twins>
Date:	Thu, 06 Aug 2009 18:37:40 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Darren Hart <dvhltc@...ibm.com>
Cc:	"lkml," <linux-kernel@...r.kernel.org>,
	linux-rt-users <linux-rt-users@...r.kernel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ingo Molnar <mingo@...e.hu>, John Kacur <jkacur@...hat.com>,
	Dinakar Guniguntala <dino@...ibm.com>,
	John Stultz <johnstul@...ux.vnet.ibm.com>
Subject: Re: [RFC][PATCH] fixup pi_state in futex_requeue on lock steal

On Wed, 2009-08-05 at 17:01 -0700, Darren Hart wrote:
> NOT FOR INCLUSION
> 
> Fixup the uval and pi_state owner in futex_requeue(requeue_pi=1) in the event
> of a lock steal or owner died.  I had hoped to leave it up to the new owner to
> fix up the userspace value since we can't really handle a fault here
> gracefully.  This should be safe as the lock is contended and should force all
> userspace attempts to lock or unlock into the kernel where they'll block on the
> hb lock.  However, when I don't update the uaddr, I hit the WARN_ON(pid !=
> pi_state->owner->pid) as expected, and the userspace testcase deadlocks.
> 
> I need to try and better understand what's happening to hang userspace.  In the
> meantime I thought I'd share what I'm working with atm.  This is a complete HACK
> and is ugly, non-modular, etc etc.  However, it currently works.  It would explode
> in a most impressive fashion should we happen to fault.  So the remaining questions
> are:
> 
> o Why does userspace deadlock if we leave the uval updating to the new owner
>   waking up in futex_wait_requeue_pi()?
> 
> o If we have to handle a fault in futex_requeue(), how can we best cleanup the
>   proxy lock acquisition and get things back into a sane state.  We faulted, so
>   determinism is out the window anyway, we just need to recover gracefully.

Do you have a trace of the thing going down?

Tglx and me usually use sched_switch and a few trace_printk()s sprinkled
around, the typical one would be in sys_futex, printing the futex cmd
and arg.

OK, so run me through this one more time.

A condvar has two futexes, an inner and an outer. The inner futex is
always locked and the waiting threads are stacked on that.

Then on signal/broadcast, we lock the outer lock and requeue all the
blocked tasks from the inner to the outer, then we release the outer
lock and let them rip.

Since we're seeing lock steals, I'm thinking the outer lock isn't taken
when we're doing the requeue?

Anyway, during the requeue we lock-steal because the owner isn't running
yet and we iterate a higher prio task in the requeue loop?

This leaves the outer lock's futex field messed up because it points to
the wrong TID.

After we finish the requeue loop, we unlock the HBs.

So far so good?

Now, normally the waking thread will find itself owner and will check
the futex variable and fix her up -- while holding the HB lock.

However, in case the outer lock gets contended again, we can get
interrupted between requeue and wakeup/fixup and observe this messed up
futex value, which is causing this WARN to trigger.

So where do we deadlock, after this all goes down? Do we perhaps lookup
the wrong pi_state using that wrong TID?

Since its only the outer futex's value that matters, right? Can't we pin
that using get_user_pages() before we take the HB lock and go into the
requeue loop? That way we're sure to be able to change it without
faulting.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/