linux-kernel - Re: Robust futexes: lost wakeups and design flaws in the glibc/kernel synchronization scheme

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1484213984.5606.272.camel@redhat.com>
Date:   Thu, 12 Jan 2017 10:39:44 +0100
From:   Torvald Riegel <triegel@...hat.com>
To:     LKML <linux-kernel@...r.kernel.org>
Cc:     GLIBC Devel <libc-alpha@...rceware.org>,
        "Zijlstra, Peter" <peter.zijlstra@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Rich Felker <dalias@...c.org>
Subject: Re: Robust futexes: lost wakeups and design flaws in the
 glibc/kernel synchronization scheme

On Sat, 2016-12-24 at 17:01 +0100, Torvald Riegel wrote:
> TLDR:
> (1) The Linux kernel fails to wake futex waiters when a userspace thread
> that releases a mutex is killed between modifying the futex word and the
> subsequent FUTEX_WAKE syscall.
> (2) POSIX makes certain requirements regarding when destruction of a
> mutex is safe that are violated by the current glibc/kernel
> synchronization scheme, which can lead to corruption of userspace
> memory; robust FUTEX_UNLOCK_PI must be changed slightly, and the non-PI
> robust mutexes must either use a FUTEX_UNLOCK-like syscall to release
> the mutex, try to detect this by interpreting (current position in)
> userspace code, or we must add a new futex recovery syscall if we want
> to preserve the userspace-only mutex-release fastpath.
> 
> [Please keep me in CC because I'm not subscribed to LKML.]
> 
> 
> === Robust mutexes have bugs, in both glibc and the kernel
> 
> I've been reviewing the implementation of robust mutexes in both glibc
> and the kernel support code recently because there are several bug
> reports, for example:
> https://bugzilla.redhat.com/show_bug.cgi?id=1401665
> https://sourceware.org/bugzilla/show_bug.cgi?id=19402
> https://sourceware.org/bugzilla/show_bug.cgi?id=14485
> 
> This review revealed a bunch of bugs.  I have committed/proposed patches
> that fix all glibc-only bugs that I am aware of:
> https://sourceware.org/ml/libc-alpha/2016-12/msg00587.html
> https://sourceware.org/ml/libc-alpha/2016-12/msg00862.html
> https://sourceware.org/ml/libc-alpha/2016-12/msg00863.html
> https://sourceware.org/ml/libc-alpha/2016-12/msg00947.html
> These fixes are meant to be easy to backport, so they don't contain as
> much cleanup and explanation of the glibc code as would be ideal.
> 
> I plan to follow these up with more cleanup and adding of documentation,
> which should also describe the glibc/kernel synchronization scheme in
> more detail.  I believe I have found all the bugs in robust mutexes
> (when including the kernel-side bugs described in this email), but I'll
> make another review pass when preparing this cleanup+documentation
> patch.
> 
> 
> === Background
> 
> A brief reminder of how glibc and the kernel attempt to synchronize with
> each other:
> Each thread maintains a list of robust mutexes that it has acquired, and
> a single pointer to a futex word that it is about to acquire or release
> (called op_pending).  The list is a concurrent list (to some extent)
> because userspace can be interrupted by the post-crash cleanup code run
> by the kernel (ie, this is similar to how one would synchronize with a
> signal handler).
> 
> To release a mutex without using FUTEX_UNLOCK_PI, userspace basically
> does the following:
> (L1)  atomic_store (&op_pending, &mutex->futex, memory_order_relaxed);
> (L2)  <"thread-safe" dequeue of mutex->futex from robust mutex list>
> (L3)  fw = atomic_exchange(&mutex->futex, 0, memory_order_release);
> (L4)  if (fw & FUTEX_WAITERS) futex_wake(&mutex->futex, 1);
> (L5)  atomic_store (&op_pending, NULL, memory_order_relaxed);
> (I'm not showing the signal fences required; for simplicity, assume that
> there is a signal fence after each line.)
> 
> In case of a robust PI mutex, L3 and L4 are done by the kernel in
> FUTEX_UNLOCK_PI.
> 
> 
> === Lost wakeups caused by the kernel
> 
> If a thread happens to be killed between L3 and L4 and FUTEX_WAITERS is
> set, then userspace will have set the futex word to 0 but not yet have
> run FUTEX_WAKE.
> op_pending still points to the futex word, but handle_futex_death in
> kernel/futex.c only calls futex_wake if the TID bits in the futex word
> are equal to the TID of the killed thread.  The result of that is a lost
> wake-up for one of the threads that set or shared FUTEX_WAITERS.
> 
> This can be fixed by running futex_wake conservatively, even if the
> value of *op_ending does not match the TID of the killed thread.  This
> can lead to spurious wakeups (eg, if the thread was killed between L4
> and L5), but userspace code has to, in the general case, be prepared for
> spurious futex wake-ups anyway if using POSIX mutexes:
> https://lkml.org/lkml/2014/11/27/472
> Also, the same spurious wake-ups can happen under slightly different
> executions without a crash (eg, if FUTEX_WAITERS is set spuriously due
> to timeouts on another thread that set it).
> 
> 
> === POSIX' mutex destruction requirements are violated
> 
> In a nutshell, the mutex destruction requirements mean that there must
> be a single atomic action that releases a mutex; after a mutex is
> released, the thread releasing the mutex must not access the mutex'
> memory anymore (but it is still allowed to run a FUTEX_WAKE on it); see
> https://lkml.org/lkml/2014/11/27/472 for additional background.
> 
> Currently, the kernel can potentially modify the mutex' memory until L5
> has executed.  This means that getting killed while releasing a robust
> mutex is not guaranteed to be a single atomic action.
> 
> For example, assume that Thread1 has acquired the mutex and releases it,
> but is killed right before L5; if the program is built in such a way
> that only Thread2 will acquire the mutex after Thread1 has released it,
> then POSIX allows Thread2 to acquire the mutex, release it, destroy it,
> and reuse its memory for something else (these are the destruction
> requirements).  So, Thread2 does that and reuses the memory for another
> object that is not a mutex, which happens to put the same value as the
> TID of the killed thread at the position of the futex word.  Then, the
> kernel sees that *op_pending equals the TID, and sets *op_pending to 0,
> thus corrupting the other object.
> (Note that there is no similar problem when acquiring the mutex because
> just the existance of a thread attempting to acquire the mutex disallows
> attempting to concurrently destroy the mutex.) 
> 
> For FUTEX_UNLOCK_PI, the fix is relatively straightforward: The kernel
> must clear op_pending in the syscall if op_pending is equal to the
> address of the unlocked futex.  This ensures that unlock is a single
> atomic action; the store by the kernel is harmless at least for glibc
> (and it would certainly be possible to get the same effect without a
> userspace-visible change).
> 
> If the userspace fastpath is used (ie, L3), then there is no simple fix.
> Never using the fastpath but only something like FUTEX_UNLOCK_PI slows
> down lock release, which can be bad for performance.
> 
> When using the fastpath, and to be able to solve this in the kernel when
> cleaning up after the thread is killed, the kernel would need to be able
> to detect whether the thread was killed before the critical atomic
> action in L3; the kernel cannot use the value of the futex word to
> determine this because it is not guaranteed to be a futex word of a
> robust mutex anymore.  Thus, the instruction pointer of the userspace
> thread would have to be taken into account, and userspace would have to
> enforce code generation in a way that makes it possible for the kernel
> to easily detect whether the mutex release sequence is after L3 or not.
> On the userspace side, the release sequence starting with L1 is not
> exactly trivial, and I'd prefer to not have to maintain assembly for
> that (OTOH, maybe some of the machinery of restartable sequences could
> be reused here).  Therefore, that's not an approach I'd prefer.
> 
> Alternatively, it should be possible to let threads that want to acquire
> a mutex with a dead owner request kernel-side recovery of this mutex.
> Basically, the mutex-releasing thread would first set the futex word to
> an intermediate state that signals that it intends to release the mutex
> (eg, setting a special bit but preserving the bits for the TID); it
> would then dequeue the mutex from its robust futex list; and then set
> the futex word to 0 to finalize release.
> Threads that see the special bit in the futex word signaling the intent
> to release cannot simply acquire the mutex in userspace.  They would
> have to call a new futex operation in the kernel that then either blocks
> until woken by a normal futex_wake, or recovers the mutex when the
> thread that intended to release it was killed in the meantime.
> The advantage of this approach would be that the interface is simpler
> for userspace (eg, no funny interpretation of instruction pointers or
> such).  However, the kernel-side synchronization is probably more
> complex, and there are a couple of details to be solved such as avoiding
> an ABA on the TID of the thread that intended to release the mutex.
> If there is interest in this approach, I can work on the details of the
> general synchronization scheme (though I'd like someone else to take
> care of the kernel-side implementation).
> 
> Thoughts?

Ping.