[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49DBC68F.3080508@us.ibm.com>
Date: Tue, 07 Apr 2009 14:33:03 -0700
From: Vernon Mauery <vernux@...ibm.com>
To: Darren Hart <dvhltc@...ibm.com>
CC: Thomas Gleixner <tglx@...utronix.de>, linux-kernel@...r.kernel.org,
Sripathi Kodi <sripathik@...ibm.com>,
Peter Zijlstra <peterz@...radead.org>,
John Stultz <johnstul@...ibm.com>,
Steven Rostedt <rostedt@...dmis.org>,
Dinakar Guniguntala <dino@...ibm.com>,
Ulrich Drepper <drepper@...hat.com>,
Eric Dumazet <dada1@...mosbay.com>,
Ingo Molnar <mingo@...e.hu>, Jakub Jelinek <jakub@...hat.com>,
niv@...ux.vnet.ibm.com
Subject: Re: [tip PATCH v7 0/9] RFC: futex: requeue pi implementation
Darren Hart wrote:
> Thomas Gleixner wrote:
>> Darren,
>>
>> On Fri, 3 Apr 2009, Darren Hart wrote:
[snip]
>> Darren, could you please polish the initial design notes - especially
>> point out the subtle differences between requeue and requeue_pi - and
>> send them into the thread? That might help Uli and Jakub and we
>> definitly want to have that info preserved in Documentation/ as well.
>
> I'll be away most of the day so I've whipped something up quickly in
> order to get discussion started. It is probably a little heavy on the
> glibc details and a little light on the kernel details for inclusion in
> Documentation/. Please review for high-level content and let me know
> what you would like to see more or less of. I'll pick this back up
> tomorrow to incorporate any suggestions and flesh out the kernel details
> a bit more.
>
Darren,
A gentle review of your documentation follows.
> Futex Requeue PI
> ----------------
>
> Requeueing of tasks from a non-PI futex to a PI futex requires special
> handling in order to ensure the underlying rt_mutex is never left without an
> owner if it has waiters; doing so would break the PI boosting logic [see
> rt-mutex-desgin.txt] For the purposes of brevity, this action will be referred
> to as "requeue_pi" throughout this document. Priority inheritance is
> abbreviated throughout as "PI".
>
> Motivation
> ----------
>
> Without requeue_pi, the current implementation of pthread_cond_broadcast()
> must resort to waking all the tasks waiting on a pthread_condvar and letting
> them try and sort out which task gets to run first in classic thundering-herd
^try to sort out^
> formation. An ideal implementation would wake the highest priority waiter,
^highest-priority^
> and leave the rest to the natural wakeup inherent in unlocking the mutex
> associated with the condvar.
>
> Consider the simplified glibc calls:
>
> /* caller must lock mutex */
> pthread_cond_wait(cond, mutex)
> {
> lock(cond->__data.__lock);
> unlock(mutex);
> do {
> unlock(cond->__data.__lock);
> futex_wait(cond->__data.__futex);
> lock(cond->__data.__lock);
> } while(...)
> unlock(cond->__data.__lock);
> lock(mutex);
> }
>
> pthread_cond_broadcast(cond)
> {
> lock(cond->__data.__lock);
> unlock(cond->__data.__lock);
> futex_requeue(cond->data.__futex, cond->mutex);
> }
>
> Once pthread_cond_broadcast() requeues the tasks, the cond->mutex has waiters.
> Note that pthread_cond_wait() attempts to lock the mutex only after it has
> returned to userspace. This will leave the underlying rt_mutex with waiters,
> and no owner, breaking the previously mentioned PI boosting algorithms.
^PI-boosting^
>
> In order to support PI-aware pthread_condvar's, the kernel needs to be able to
> requeue tasks to PI futexes. This support implies that upon a successful
> futex_wait system call, the caller would return to userspace already holding
> the PI futex. As such the glibc implemenation would be modified as follows:
^implementation^
>
>
> /* caller must lock mutex */
> pthread_cond_wait_pi(cond, mutex)
> {
> lock(cond->__data.__lock);
> unlock(mutex);
> do {
> unlock(cond->__data.__lock);
> futex_wait_requeue_pi(cond->__data.__futex);
> lock(cond->__data.__lock);
> } while(...)
> unlock(cond->__data.__lock);
> /* the kernel acquired the the mutex for us */
> }
>
> pthread_cond_broadcast_pi(cond)
> {
> lock(cond->__data.__lock);
> unlock(cond->__data.__lock);
> futex_requeue_pi(cond->data.__futex, cond->mutex);
> }
>
> The actual glibc implemenation will likely test for PI and make the
^implementation^
> necessary changes inside the existing calls rather than creating new calls
> for the PI cases. Similar changes are needed for pthread_cond_timedwait()
> and pthread_cond_signal().
>
> Implementation
> --------------
>
> In order to ensure the rt_mutex has an owner if it has waiters, it is necessary
> for the both the requeue code as well as the waiting code to be able to acquire
> the rt_mutex before returning to userspace (the requeue code cannot simply wake
I would suggest not using a parenthetical expression here because of the length
of the enclosed sentence. Also, you have a missing right parenthesis.
> the waiter and leave it to acquire the rt_mutex as it would open a race window
> between the requeue call returning to userspace and the waiter waking and
> starting to run. This is especially true in the uncontended case.
>
> To accomplish this, rt_mutex lock acquistion has to be able to be accomlished
^acquisition^ ^accomplished^
> vicariously by the requeue code on behalf of the waiter. This is accomplished
Lots of accomplishing going on here^^^. And one is missing a its p. Maybe
rewritten as:
To accomplish this, the rt_mutex lock must be acquired vicariously by the
requeue code on behalf of the waiter. This is done by two new rt_mutex
helper routines: rt_mutex_start_proxy_lock(), called by he requeue code;
and rt_mutex_finish_proxy_lock(), called by the waiter upon wakeup.
> via two new rt_mutex helper routines: rt_mutex_start_proxy_lock() (called by
> the requeue code) and rt_mutex_finish_proxy_lock(), called by the waiter upon
> wakeup.
I also changed the punctuation there to remove the parenthetical expression
modifying the first function, where only a comma was used on the second. Too
many commas and modifications made me think that maybe a semicolon would work
well.
>
> Two new system calls provide the kernel<->userspace interface to requeue_pi:
> FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI).
>
> FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() and
> pthread_cond_timedwait()) to block on the initial futex and wait to be requeued
> to a PI aware futex. The implemenation is basically the result of a high speed
^PI-aware^ ^implementation^ ^high-speed^
> collision between futex_wait() and futex_lock_pi(), with some extra logic to
> check for the additional wake-up scenarios.
>
> FUTEX_REQUEUE_PI is called by the waker (pthread_cond_broadcast() and
> pthread_cond_signal()) to requeue and possibly wake the waiting tasks.
> Internally, this system call is still handled by futex_requeue (by passing
> requeue_pi=1). Before requeueing, futex_requeue() attempts to acquire the
> requeue target PI futex on behalf of the top waiter, if it can, this waiter is
^. If^
> woken. It then proceeds to requeue the remaining nr_wake+nr_requeue tasks to
> the PI futex. It calls rt_mutex_start_proxy_lock() prior to each requeue to
> prepare the task as a waiter on the underlying rt_mutex. It is possible that
> the lock can be acquired at this stage as well, if so, the next waiter is woken
^. If^
> to finish the acquisition of the lock. FUTEX_REQUEUE_PI accepts nr_wake and
> nr_requeue as arguments, but their sum is all that really matters. It will
Lots of 'It' by this point. Maybe replacing 'it' with a proper noun a couple
of times would help make sure we are all on the same page.
> wake or requeue up to nr_wake + nr_requeue tasks. It will wake only as many
> tasks as it can acquire the lock for, which in the majority of cases should be
> 0 as good programming practice dictates that the caller of either
^0, as^
> pthread_cond_broadcast() or pthread_cond_signal() acquire the mutex prior to
> making the call. FUTEX_REQUEUE_PI requires that nr_wake=1. nr_requeue should
> be INT_MAX for broadcast and 0 for signal.
One final point is that generally numbers less than ten are spelled out, but
since this is a technical document where we see things like nr_wake=1, I figured
it would look more consistent to let those 0s slide.
Keep in mind that I am not really an editor so take my changes with a grain of
salt. (I am only *married* to an editor and if she gets wind that I am
pretending to be one on the side I will be in big trouble :)
--Vernon
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists