lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 07 May 2009 15:40:14 -0700
From:	Darren Hart <dvhltc@...ibm.com>
To:	"lkml, " <linux-kernel@...r.kernel.org>
CC:	Thomas Gleixner <tglx@...utronix.de>,
	Sripathi Kodi <sripathik@...ibm.com>,
	Peter Zijlstra <peterz@...radead.org>,
	John Stultz <johnstul@...ibm.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Dinakar Guniguntala <dino@...ibm.com>,
	Ulrich Drepper <drepper@...hat.com>,
	Eric Dumazet <dada1@...mosbay.com>,
	Ingo Molnar <mingo@...e.hu>, Jakub Jelinek <jakub@...hat.com>
Subject: [tip PATCH] futex: add requeue-pi documentation

From: Darren Hart <dvhltc@...ibm.com>

Add Documentation/futex-requeue-pi.txt describing the motivation for the
newly added FUTEX_*REQUEUE_PI op codes and their implementation.

Signed-off-by: Darren Hart <dvhltc@...ibm.com>
Cc: Thomas Gleixner <tglx@...utronix.de>
Cc: Sripathi Kodi <sripathik@...ibm.com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: John Stultz <johnstul@...ibm.com>
Cc: Steven Rostedt <rostedt@...dmis.org>
Cc: Dinakar Guniguntala <dino@...ibm.com>
Cc: Ulrich Drepper <drepper@...hat.com>
Cc: Eric Dumazet <dada1@...mosbay.com>
Cc: Ingo Molnar <mingo@...e.hu>
Cc: Jakub Jelinek <jakub@...hat.com>
---

 Documentation/futex-requeue-pi.txt |  124 ++++++++++++++++++++++++++++++++++++
 1 files changed, 124 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/futex-requeue-pi.txt


diff --git a/Documentation/futex-requeue-pi.txt b/Documentation/futex-requeue-pi.txt
new file mode 100644
index 0000000..7933394
--- /dev/null
+++ b/Documentation/futex-requeue-pi.txt
@@ -0,0 +1,124 @@
+Futex Requeue PI
+----------------
+
+Requeueing of tasks from a non-PI futex to a PI futex requires special handling
+in order to ensure the underlying rt_mutex is never left without an owner if it
+has waiters; doing so would break the PI boosting logic [see
+rt-mutex-desgin.txt]  For the purposes of brevity, this action will be referred
+to as "requeue_pi" throughout this document.  Priority inheritance is
+abbreviated throughout as "PI".
+
+Motivation
+----------
+
+Without requeue_pi, the glibc implementation of pthread_cond_broadcast() must
+resort to waking all the tasks waiting on a pthread_condvar and letting them
+try to sort out which task gets to run first in classic thundering-herd
+formation.  An ideal implementation would wake the highest-priority waiter, and
+leave the rest to the natural wakeup inherent in unlocking the mutex associated
+with the condvar.
+
+Consider the simplified glibc calls:
+
+/* caller must lock mutex */
+pthread_cond_wait(cond, mutex)
+{
+	lock(cond->__data.__lock);
+	unlock(mutex);
+	do {
+	   unlock(cond->__data.__lock);
+	   futex_wait(cond->__data.__futex);
+	   lock(cond->__data.__lock);
+	} while(...)
+	unlock(cond->__data.__lock);
+	lock(mutex);
+}
+
+pthread_cond_broadcast(cond)
+{
+	lock(cond->__data.__lock);
+	unlock(cond->__data.__lock);
+	futex_requeue(cond->data.__futex, cond->mutex);
+}
+
+Once pthread_cond_broadcast() requeues the tasks, the cond->mutex has waiters.
+Note that pthread_cond_wait() attempts to lock the mutex only after it has
+returned to user space.  This will leave the underlying rt_mutex with waiters,
+and no owner, breaking the previously mentioned PI-boosting algorithms.
+
+In order to support PI-aware pthread_condvar's, the kernel needs to be able to
+requeue tasks to PI futexes.  This support implies that upon a successful
+futex_wait system call, the caller would return to user space already holding
+the PI futex.  The glibc implementation would be modified as follows:
+
+
+/* caller must lock mutex */
+pthread_cond_wait_pi(cond, mutex)
+{
+	lock(cond->__data.__lock);
+	unlock(mutex);
+	do {
+	   unlock(cond->__data.__lock);
+	   futex_wait_requeue_pi(cond->__data.__futex);
+	   lock(cond->__data.__lock);
+	} while(...)
+	unlock(cond->__data.__lock);
+        /* the kernel acquired the the mutex for us */
+}
+
+pthread_cond_broadcast_pi(cond)
+{
+	lock(cond->__data.__lock);
+	unlock(cond->__data.__lock);
+	futex_requeue_pi(cond->data.__futex, cond->mutex);
+}
+
+The actual glibc implementation will likely test for PI and make the
+necessary changes inside the existing calls rather than creating new calls
+for the PI cases.  Similar changes are needed for pthread_cond_timedwait()
+and pthread_cond_signal().
+
+Implementation
+--------------
+
+In order to ensure the rt_mutex has an owner if it has waiters, it is necessary
+for both the requeue code, as well as the waiting code, to be able to acquire
+the rt_mutex before returning to user space.  The requeue code cannot simply
+wake the waiter and leave it to acquire the rt_mutex as it would open a race
+window between the requeue call returning to user space and the waiter waking
+and starting to run.  This is especially true in the uncontended case.
+
+The solution involves two new rt_mutex helper routines,
+rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which allow the
+requeue code to acquire an uncontended rt_mutex on behalf of the waiter and to
+enqueue the waiter on a contended rt_mutex.  Two new system calls provide the
+kernel<->user interface to requeue_pi: FUTEX_WAIT_REQUEUE_PI and
+FUTEX_REQUEUE_CMP_PI.
+
+FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() and
+pthread_cond_timedwait()) to block on the initial futex and wait to be requeued
+to a PI-aware futex.  The implementation is the result of a high-speed
+collision between futex_wait() and futex_lock_pi(), with some extra logic to
+check for the additional wake-up scenarios.
+
+FUTEX_REQUEUE_CMP_PI is called by the waker (pthread_cond_broadcast() and
+pthread_cond_signal()) to requeue and possibly wake the waiting tasks.
+Internally, this system call is still handled by futex_requeue (by passing
+requeue_pi=1).  Before requeueing, futex_requeue() attempts to acquire the
+requeue target PI futex on behalf of the top waiter.  If it can, this waiter is
+woken.  futex_requeue() then proceeds to requeue the remaining
+nr_wake+nr_requeue tasks to the PI futex, calling rt_mutex_start_proxy_lock()
+prior to each requeue to prepare the task as a waiter on the underlying
+rt_mutex.  It is possible that the lock can be acquired at this stage as well,
+if so, the next waiter is woken to finish the acquisition of the lock.
+FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but their sum is
+all that really matters.  futex_requeue() will wake or requeue up to nr_wake +
+nr_requeue tasks.  It will wake only as many tasks as it can acquire the lock
+for, which in the majority of cases should be 0 as good programming practice
+dictates that the caller of either pthread_cond_broadcast() or
+pthread_cond_signal() acquire the mutex prior to making the call.
+FUTEX_REQUEUE_PI requires that nr_wake=1.  nr_requeue should be INT_MAX for
+broadcast and 0 for signal.
+
+
+
-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ