lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 12 Jan 2017 10:39:44 +0100 From: Torvald Riegel <triegel@...hat.com> To: LKML <linux-kernel@...r.kernel.org> Cc: GLIBC Devel <libc-alpha@...rceware.org>, "Zijlstra, Peter" <peter.zijlstra@...el.com>, Thomas Gleixner <tglx@...utronix.de>, Rich Felker <dalias@...c.org> Subject: Re: Robust futexes: lost wakeups and design flaws in the glibc/kernel synchronization scheme On Sat, 2016-12-24 at 17:01 +0100, Torvald Riegel wrote: > TLDR: > (1) The Linux kernel fails to wake futex waiters when a userspace thread > that releases a mutex is killed between modifying the futex word and the > subsequent FUTEX_WAKE syscall. > (2) POSIX makes certain requirements regarding when destruction of a > mutex is safe that are violated by the current glibc/kernel > synchronization scheme, which can lead to corruption of userspace > memory; robust FUTEX_UNLOCK_PI must be changed slightly, and the non-PI > robust mutexes must either use a FUTEX_UNLOCK-like syscall to release > the mutex, try to detect this by interpreting (current position in) > userspace code, or we must add a new futex recovery syscall if we want > to preserve the userspace-only mutex-release fastpath. > > [Please keep me in CC because I'm not subscribed to LKML.] > > > === Robust mutexes have bugs, in both glibc and the kernel > > I've been reviewing the implementation of robust mutexes in both glibc > and the kernel support code recently because there are several bug > reports, for example: > https://bugzilla.redhat.com/show_bug.cgi?id=1401665 > https://sourceware.org/bugzilla/show_bug.cgi?id=19402 > https://sourceware.org/bugzilla/show_bug.cgi?id=14485 > > This review revealed a bunch of bugs. I have committed/proposed patches > that fix all glibc-only bugs that I am aware of: > https://sourceware.org/ml/libc-alpha/2016-12/msg00587.html > https://sourceware.org/ml/libc-alpha/2016-12/msg00862.html > https://sourceware.org/ml/libc-alpha/2016-12/msg00863.html > https://sourceware.org/ml/libc-alpha/2016-12/msg00947.html > These fixes are meant to be easy to backport, so they don't contain as > much cleanup and explanation of the glibc code as would be ideal. > > I plan to follow these up with more cleanup and adding of documentation, > which should also describe the glibc/kernel synchronization scheme in > more detail. I believe I have found all the bugs in robust mutexes > (when including the kernel-side bugs described in this email), but I'll > make another review pass when preparing this cleanup+documentation > patch. > > > === Background > > A brief reminder of how glibc and the kernel attempt to synchronize with > each other: > Each thread maintains a list of robust mutexes that it has acquired, and > a single pointer to a futex word that it is about to acquire or release > (called op_pending). The list is a concurrent list (to some extent) > because userspace can be interrupted by the post-crash cleanup code run > by the kernel (ie, this is similar to how one would synchronize with a > signal handler). > > To release a mutex without using FUTEX_UNLOCK_PI, userspace basically > does the following: > (L1) atomic_store (&op_pending, &mutex->futex, memory_order_relaxed); > (L2) <"thread-safe" dequeue of mutex->futex from robust mutex list> > (L3) fw = atomic_exchange(&mutex->futex, 0, memory_order_release); > (L4) if (fw & FUTEX_WAITERS) futex_wake(&mutex->futex, 1); > (L5) atomic_store (&op_pending, NULL, memory_order_relaxed); > (I'm not showing the signal fences required; for simplicity, assume that > there is a signal fence after each line.) > > In case of a robust PI mutex, L3 and L4 are done by the kernel in > FUTEX_UNLOCK_PI. > > > === Lost wakeups caused by the kernel > > If a thread happens to be killed between L3 and L4 and FUTEX_WAITERS is > set, then userspace will have set the futex word to 0 but not yet have > run FUTEX_WAKE. > op_pending still points to the futex word, but handle_futex_death in > kernel/futex.c only calls futex_wake if the TID bits in the futex word > are equal to the TID of the killed thread. The result of that is a lost > wake-up for one of the threads that set or shared FUTEX_WAITERS. > > This can be fixed by running futex_wake conservatively, even if the > value of *op_ending does not match the TID of the killed thread. This > can lead to spurious wakeups (eg, if the thread was killed between L4 > and L5), but userspace code has to, in the general case, be prepared for > spurious futex wake-ups anyway if using POSIX mutexes: > https://lkml.org/lkml/2014/11/27/472 > Also, the same spurious wake-ups can happen under slightly different > executions without a crash (eg, if FUTEX_WAITERS is set spuriously due > to timeouts on another thread that set it). > > > === POSIX' mutex destruction requirements are violated > > In a nutshell, the mutex destruction requirements mean that there must > be a single atomic action that releases a mutex; after a mutex is > released, the thread releasing the mutex must not access the mutex' > memory anymore (but it is still allowed to run a FUTEX_WAKE on it); see > https://lkml.org/lkml/2014/11/27/472 for additional background. > > Currently, the kernel can potentially modify the mutex' memory until L5 > has executed. This means that getting killed while releasing a robust > mutex is not guaranteed to be a single atomic action. > > For example, assume that Thread1 has acquired the mutex and releases it, > but is killed right before L5; if the program is built in such a way > that only Thread2 will acquire the mutex after Thread1 has released it, > then POSIX allows Thread2 to acquire the mutex, release it, destroy it, > and reuse its memory for something else (these are the destruction > requirements). So, Thread2 does that and reuses the memory for another > object that is not a mutex, which happens to put the same value as the > TID of the killed thread at the position of the futex word. Then, the > kernel sees that *op_pending equals the TID, and sets *op_pending to 0, > thus corrupting the other object. > (Note that there is no similar problem when acquiring the mutex because > just the existance of a thread attempting to acquire the mutex disallows > attempting to concurrently destroy the mutex.) > > For FUTEX_UNLOCK_PI, the fix is relatively straightforward: The kernel > must clear op_pending in the syscall if op_pending is equal to the > address of the unlocked futex. This ensures that unlock is a single > atomic action; the store by the kernel is harmless at least for glibc > (and it would certainly be possible to get the same effect without a > userspace-visible change). > > If the userspace fastpath is used (ie, L3), then there is no simple fix. > Never using the fastpath but only something like FUTEX_UNLOCK_PI slows > down lock release, which can be bad for performance. > > When using the fastpath, and to be able to solve this in the kernel when > cleaning up after the thread is killed, the kernel would need to be able > to detect whether the thread was killed before the critical atomic > action in L3; the kernel cannot use the value of the futex word to > determine this because it is not guaranteed to be a futex word of a > robust mutex anymore. Thus, the instruction pointer of the userspace > thread would have to be taken into account, and userspace would have to > enforce code generation in a way that makes it possible for the kernel > to easily detect whether the mutex release sequence is after L3 or not. > On the userspace side, the release sequence starting with L1 is not > exactly trivial, and I'd prefer to not have to maintain assembly for > that (OTOH, maybe some of the machinery of restartable sequences could > be reused here). Therefore, that's not an approach I'd prefer. > > Alternatively, it should be possible to let threads that want to acquire > a mutex with a dead owner request kernel-side recovery of this mutex. > Basically, the mutex-releasing thread would first set the futex word to > an intermediate state that signals that it intends to release the mutex > (eg, setting a special bit but preserving the bits for the TID); it > would then dequeue the mutex from its robust futex list; and then set > the futex word to 0 to finalize release. > Threads that see the special bit in the futex word signaling the intent > to release cannot simply acquire the mutex in userspace. They would > have to call a new futex operation in the kernel that then either blocks > until woken by a normal futex_wake, or recovers the mutex when the > thread that intended to release it was killed in the meantime. > The advantage of this approach would be that the interface is simpler > for userspace (eg, no funny interpretation of instruction pointers or > such). However, the kernel-side synchronization is probably more > complex, and there are a couple of details to be solved such as avoiding > an ABA on the TID of the thread that intended to release the mutex. > If there is interest in this approach, I can work on the details of the > general synchronization scheme (though I'd like someone else to take > care of the kernel-side implementation). > > Thoughts? Ping.
Powered by blists - more mailing lists