linux-kernel - Re: [tip PATCH v6 8/8] RFC: futex: add requeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49D15667.9050307@us.ibm.com>
Date:	Mon, 30 Mar 2009 16:31:51 -0700
From:	Darren Hart <dvhltc@...ibm.com>
To:	Eric Dumazet <dada1@...mosbay.com>
CC:	linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
	Sripathi Kodi <sripathik@...ibm.com>,
	Peter Zijlstra <peterz@...radead.org>,
	John Stultz <johnstul@...ibm.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Dinakar Guniguntala <dino@...ibm.com>,
	Ulrich Drepper <drepper@...hat.com>,
	Ingo Molnar <mingo@...e.hu>, Jakub Jelinek <jakub@...hat.com>
Subject: Re: [tip PATCH v6 8/8] RFC: futex: add requeue_pi calls

Darren Hart wrote:
> Eric Dumazet wrote:
> 
> Two more nice catches, thanks.  Corrected patch below.

If anyone is still wanting to pull these from git, you can grab them 
from my -dev branch.  Note: I pop and push branches to this branch, 
whereas the versioned branches will remain constant.

http://git.kernel.org/?p=linux/kernel/git/dvhart/linux-2.6-tip-hacks.git;a=shortlog;h=requeue-pi-dev

Thanks,

Darren

> 
>>> +static long futex_lock_pi_restart(struct restart_block *restart)
>>> +{
>>> +    u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
>>> +    ktime_t t, *tp = NULL;
>>> +    int fshared = restart->futex.flags & FLAGS_SHARED;
>>> +
>>> +    if (restart->futex.flags | FLAGS_HAS_TIMEOUT) {
>>
>> if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
> 
> 
>> if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
>>
>>> +        t.tv64 = restart->futex.time;
>>> +        tp = &t;
>>> +    }
>>> +    restart->fn = do_no_restart_syscall;
>>> +
>>
>>
>> Strange your compiler dit not complains...
> 
> Well, the comparison with an "|" is still valid - just happens to always
> be true :-)  I didn't get any errors - perhaps I should be compiling
> with some addition options?
> 
> 
> RFC: futex: add requeue_pi calls
> 
> From: Darren Hart <dvhltc@...ibm.com>
> 
> PI Futexes and their underlying rt_mutex cannot be left ownerless if 
> there are
> pending waiters as this will break the PI boosting logic, so the standard
> requeue commands aren't sufficient.  The new commands properly manage pi 
> futex
> ownership by ensuring a futex with waiters has an owner at all times.  This
> will allow glibc to properly handle pi mutexes with pthread_condvars.
> 
> The approach taken here is to create two new futex op codes:
> 
> FUTEX_WAIT_REQUEUE_PI:
> Tasks will use this op code to wait on a futex (such as a non-pi waitqueue)
> and wake after they have been requeued to a pi futex.  Prior to 
> returning to
> userspace, they will acquire this pi futex (and the underlying rt_mutex).
> 
> futex_wait_requeue_pi() is the result of a high speed collision between
> futex_wait() and futex_lock_pi() (with the first part of futex_lock_pi() 
> being
> done by futex_proxy_trylock_atomic() on behalf of the top_waiter).
> 
> FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI):
> This call must be used to wake tasks waiting with FUTEX_WAIT_REQUEUE_PI,
> regardless of how many tasks the caller intends to wake or requeue.
> pthread_cond_broadcast() should call this with nr_wake=1 and
> nr_requeue=INT_MAX.  pthread_cond_signal() should call this with 
> nr_wake=1 and
> nr_requeue=0.  The reason being we need both callers to get the benefit 
> of the
> futex_proxy_trylock_atomic() routine.  futex_requeue() also enqueues the
> top_waiter on the rt_mutex via rt_mutex_start_proxy_lock().
> 
> Changelog:
> V7pre: -Corrected FLAGS_HAS_TIMEOUT flag detection logic per Eric Dumazet
> V6: -Moved non requeue_pi related fixes/changes into separate patches
>    -Make use of new double_unlock_hb()
>    -Futex key management updates
>    -Removed unnecessary futex_requeue_pi_cleanup() routine
>    -Return -EINVAL if futex_wake is called with q.rt_waiter != NULL
>    -Rewrote futex_wait_requeue_pi() wakeup logic
>    -Rewrote requeue/wakeup loop
>    -Renamed futex_requeue_pi_init() to futex_proxy_trylock_atomic()
>    -Handle third party owner, removed -EMORON :-(
>    -Comment updates
> V5: -Update futex_requeue to allow for nr_requeue == 0
>    -Whitespace cleanup
>    -Added task_count var to futex_requeue to avoid confusion between
>     ret, res, and ret used to count wakes and requeues
> V4: -Cleanups to pass checkpatch.pl
>    -Added missing goto out; in futex_wait_requeue_pi()
>    -Moved rt_mutex_handle_wakeup to the rt_mutex_enqueue_task patch as they
>     are a functional pair.
>    -Fixed several error exit paths that failed to unqueue the futex_q, 
> which
>     not only would leave the futex_q on the hb, but would have caused an 
> exit
>     race with the waiter since they weren't synchonized on the hb lock.  
> Thanks
>     Sripathi for catching this.
>    -Fix pi_state handling in futex_requeue
>    -Several other minor fixes to futex_requeue_pi
>    -add requeue_futex function and force the requeue in requeue_pi even 
> for the
>     task we wake in the requeue loop
>    -refill the pi state cache at the beginning of futex_requeue for 
> requeue_pi
>    -have futex_requeue_pi_init ensure it stores off the pi_state for use in
>     futex_requeue
>    - Delayed starting the hrtimer until after TASK_INTERRUPTIBLE is set
>    - Fixed NULL pointer bug when futex_wait_requeue_pi() has no timer and
>      receives a signal after waking on uaddr2.  Added has_timeout to the
>      restart->futex structure.
> V3: -Added FUTEX_CMP_REQUEUE_PI op
>    -Put fshared support back.  So long as it is encoded in the op code, we
>     assume both the uaddr's are either private or share, but not mixed.
>    -Fixed access to expected value of uaddr2 in futex_wait_requeue_pi()
> V2: -Added rt_mutex enqueueing to futex_requeue_pi_init
>    -Updated fault handling and exit logic
> V1: -Initial verion
> 
> Signed-off-by: Darren Hart <dvhltc@...ibm.com>
> Cc: Thomas Gleixner <tglx@...utronix.de>
> Cc: Sripathi Kodi <sripathik@...ibm.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: John Stultz <johnstul@...ibm.com>
> Cc: Steven Rostedt <rostedt@...dmis.org>
> Cc: Dinakar Guniguntala <dino@...ibm.com>
> Cc: Ulrich Drepper <drepper@...hat.com>
> Cc: Eric Dumazet <dada1@...mosbay.com>
> Cc: Ingo Molnar <mingo@...e.hu>
> Cc: Jakub Jelinek <jakub@...hat.com>
> ---
> 
> include/linux/futex.h       |    8 +
> include/linux/thread_info.h |    3 kernel/futex.c              |  533 
> +++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 524 insertions(+), 20 deletions(-)
> 
> 
> diff --git a/include/linux/futex.h b/include/linux/futex.h
> index 3bf5bb5..b05519c 100644
> --- a/include/linux/futex.h
> +++ b/include/linux/futex.h
> @@ -23,6 +23,9 @@ union ktime;
> #define FUTEX_TRYLOCK_PI    8
> #define FUTEX_WAIT_BITSET    9
> #define FUTEX_WAKE_BITSET    10
> +#define FUTEX_WAIT_REQUEUE_PI    11
> +#define FUTEX_REQUEUE_PI    12
> +#define FUTEX_CMP_REQUEUE_PI    13
> 
> #define FUTEX_PRIVATE_FLAG    128
> #define FUTEX_CLOCK_REALTIME    256
> @@ -38,6 +41,11 @@ union ktime;
> #define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG)
> #define FUTEX_WAIT_BITSET_PRIVATE    (FUTEX_WAIT_BITS | FUTEX_PRIVATE_FLAG)
> #define FUTEX_WAKE_BITSET_PRIVATE    (FUTEX_WAKE_BITS | FUTEX_PRIVATE_FLAG)
> +#define FUTEX_WAIT_REQUEUE_PI_PRIVATE    (FUTEX_WAIT_REQUEUE_PI | \
> +                     FUTEX_PRIVATE_FLAG)
> +#define FUTEX_REQUEUE_PI_PRIVATE    (FUTEX_REQUEUE_PI | 
> FUTEX_PRIVATE_FLAG)
> +#define FUTEX_CMP_REQUEUE_PI_PRIVATE    (FUTEX_CMP_REQUEUE_PI | \
> +                     FUTEX_PRIVATE_FLAG)
> 
> /*
>  * Support for robust futexes: the kernel cleans up held futexes at
> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index e6b820f..a8cc4e1 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -21,13 +21,14 @@ struct restart_block {
>         struct {
>             unsigned long arg0, arg1, arg2, arg3;
>         };
> -        /* For futex_wait */
> +        /* For futex_wait and futex_wait_requeue_pi */
>         struct {
>             u32 *uaddr;
>             u32 val;
>             u32 flags;
>             u32 bitset;
>             u64 time;
> +            u32 *uaddr2;
>         } futex;
>         /* For nanosleep */
>         struct {
> diff --git a/kernel/futex.c b/kernel/futex.c
> index a9c7da1..115ec52 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -19,6 +19,10 @@
>  *  PRIVATE futexes by Eric Dumazet
>  *  Copyright (C) 2007 Eric Dumazet <dada1@...mosbay.com>
>  *
> + *  Requeue-PI support by Darren Hart <dvhltc@...ibm.com>
> + *  Copyright (C) IBM Corporation, 2009
> + *  Thanks to Thomas Gleixner for conceptual design and careful reviews.
> + *
>  *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
>  *  enough at me, Linus for the original (flawed) idea, Matthew
>  *  Kirkwood for proof-of-concept implementation.
> @@ -109,6 +113,9 @@ struct futex_q {
>     struct futex_pi_state *pi_state;
>     struct task_struct *task;
> 
> +    /* rt_waiter storage for requeue_pi: */
> +    struct rt_mutex_waiter *rt_waiter;
> +
>     /* Bitset for the optional bitmasked wakeup */
>     u32 bitset;
> };
> @@ -829,7 +836,7 @@ static int futex_wake(u32 __user *uaddr, int 
> fshared, int nr_wake, u32 bitset)
> 
>     plist_for_each_entry_safe(this, next, head, list) {
>         if (match_futex (&this->key, &key)) {
> -            if (this->pi_state) {
> +            if (this->pi_state || this->rt_waiter) {
>                 ret = -EINVAL;
>                 break;
>             }
> @@ -970,20 +977,116 @@ void requeue_futex(struct futex_q *q, struct 
> futex_hash_bucket *hb1,
>     q->key = *key2;
> }
> 
> -/*
> - * Requeue all waiters hashed on one physical page to another
> - * physical page.
> +/**
> + * futex_proxy_trylock_atomic() - Attempt an atomic lock for the top 
> waiter
> + * @pifutex:    the user address of the to futex
> + * @hb1:    the from futex hash bucket, must be locked by the caller
> + * @hb2:    the to futex hash bucket, must be locked by the caller
> + * @key1:    the from futex key
> + * @key2:    the to futex key
> + *
> + * Try and get the lock on behalf of the top waiter if we can do it 
> atomically.
> + * Wake the top waiter if we succeed.  hb1 and hb2 must be held by the 
> caller.
> + *
> + * Faults occur for two primary reasons at this point:
> + * 1) The address isn't mapped
> + * 2) The address isn't writeable
> + *
> + * We return EFAULT on either of these cases and rely on the caller to 
> handle
> + * them.
> + *
> + * Returns:
> + *  0 - failed to acquire the lock atomicly
> + *  1 - acquired the lock
> + * <0 - error
> + */
> +static int futex_proxy_trylock_atomic(u32 __user *pifutex,
> +                 struct futex_hash_bucket *hb1,
> +                 struct futex_hash_bucket *hb2,
> +                 union futex_key *key1, union futex_key *key2,
> +                 struct futex_pi_state **ps)
> +{
> +    struct futex_q *top_waiter;
> +    u32 curval;
> +    int ret;
> +
> +    if (get_futex_value_locked(&curval, pifutex))
> +        return -EFAULT;
> +
> +    top_waiter = futex_top_waiter(hb1, key1);
> +
> +    /* There are no waiters, nothing for us to do. */
> +    if (!top_waiter)
> +        return 0;
> +
> +    /*
> +     * Either take the lock for top_waiter or set the FUTEX_WAITERS bit.
> +     * The pi_state is returned in ps in contended cases.
> +     */
> +    ret = futex_lock_pi_atomic(pifutex, hb2, key2, ps, top_waiter->task);
> +    if (ret == 1) {
> +        /*
> +         * Set the top_waiter key for the requeue target futex so the
> +         * waiter can detect the wakeup on the right futex, but remove
> +         * it from the hb so it can detect atomic lock acquisition.
> +         */
> +        drop_futex_key_refs(&top_waiter->key);
> +        get_futex_key_refs(key2);
> +        top_waiter->key = *key2;
> +        WARN_ON(plist_node_empty(&top_waiter->list));
> +        plist_del(&top_waiter->list, &top_waiter->list.plist);
> +        /*
> +         * FIXME: wake_futex() wakes first, then nulls the lock_ptr,
> +         * and uses a memory barrier.  Do we need to?
> +         */
> +        top_waiter->lock_ptr = NULL;
> +        wake_up(&top_waiter->waiter);
> +    }
> +
> +    return ret;
> +}
> +
> +/**
> + * futex_requeue() - Requeue waiters from uaddr1 to uaddr2
> + * uaddr1:    source futex user address
> + * uaddr2:    target futex user address
> + * nr_wake:    number of waiters to wake (must be 1 for requeue_pi)
> + * nr_requeue:    number of waiters to requeue (0-INT_MAX)
> + * requeue_pi:    if we are attempting to requeue from a non-pi futex to a
> + *         pi futex (pi to pi requeue is not supported)
> + *
> + * Requeue waiters on uaddr1 to uaddr2. In the requeue_pi case, try to 
> acquire
> + * uaddr2 atomically on behalf of the top waiter.
> + *
> + * Returns:
> + * >=0: on success, the number of tasks requeued or woken
> + *  <0: on error
>  */
> static int futex_requeue(u32 __user *uaddr1, int fshared, u32 __user 
> *uaddr2,
> -             int nr_wake, int nr_requeue, u32 *cmpval)
> +             int nr_wake, int nr_requeue, u32 *cmpval,
> +             int requeue_pi)
> {
>     union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT;
> +    int drop_count = 0, task_count = 0, ret;
> +    struct futex_pi_state *pi_state = NULL;
>     struct futex_hash_bucket *hb1, *hb2;
>     struct plist_head *head1;
>     struct futex_q *this, *next;
> -    int ret, drop_count = 0;
> +    u32 curval2;
> +
> +    if (requeue_pi) {
> +        if (refill_pi_state_cache())
> +            return -ENOMEM;
> +        if (nr_wake != 1)
> +            return -EINVAL;
> +    }
> 
> retry:
> +    if (pi_state != NULL) {
> +        free_pi_state(pi_state);
> +        pi_state = NULL;
> +    }
> +
>     ret = get_futex_key(uaddr1, fshared, &key1);
>     if (unlikely(ret != 0))
>         goto out;
> @@ -1022,19 +1125,92 @@ retry_private:
>         }
>     }
> 
> +    if (requeue_pi) {
> +        /* Attempt to acquire uaddr2 and wake the top_waiter. */
> +        ret = futex_proxy_trylock_atomic(uaddr2, hb1, hb2, &key1,
> +                         &key2, &pi_state);
> +
> +        /*
> +         * At this point the top_waiter has either taken uaddr2 or is
> +         * waiting on it.  If the former, then the pi_state will not
> +         * exist yet, look it up one more time to ensure we have a
> +         * reference to it.
> +         */
> +        if (ret == 1 && !pi_state) {
> +            task_count++;
> +            ret = get_futex_value_locked(&curval2, uaddr2);
> +            if (!ret)
> +                ret = lookup_pi_state(curval2, hb2, &key2,
> +                              &pi_state);
> +        }
> +
> +        switch (ret) {
> +        case 0:
> +            break;
> +        case -EFAULT:
> +            double_unlock_hb(hb1, hb2);
> +            put_futex_key(fshared, &key2);
> +            put_futex_key(fshared, &key1);
> +            ret = get_user(curval2, uaddr2);
> +            if (!ret)
> +                goto retry;
> +            goto out;
> +        case -EAGAIN:
> +            /* The owner was exiting, try again. */
> +            double_unlock_hb(hb1, hb2);
> +            put_futex_key(fshared, &key2);
> +            put_futex_key(fshared, &key1);
> +            cond_resched();
> +            goto retry;
> +        default:
> +            goto out_unlock;
> +        }
> +    }
> +
>     head1 = &hb1->chain;
>     plist_for_each_entry_safe(this, next, head1, list) {
> -        if (!match_futex (&this->key, &key1))
> +        if (task_count - nr_wake >= nr_requeue)
> +            break;
> +
> +        if (!match_futex(&this->key, &key1))
>             continue;
> -        if (++ret <= nr_wake) {
> +
> +        /* This can go after we're satisfied with testing. */
> +        if (!requeue_pi)
> +            WARN_ON(this->rt_waiter);
> +
> +        /*
> +         * Wake nr_wake waiters.  For requeue_pi, if we acquired the
> +         * lock, we already woke the top_waiter.  If not, it will be
> +         * woken by futex_unlock_pi().
> +         */
> +        if (++task_count <= nr_wake && !requeue_pi) {
>             wake_futex(this);
> -        } else {
> -            requeue_futex(this, hb1, hb2, &key2);
> -            drop_count++;
> +            continue;
> +        }
> 
> -            if (ret - nr_wake >= nr_requeue)
> -                break;
> +        /*
> +         * Requeue nr_requeue waiters and possibly one more in the case
> +         * of requeue_pi if we couldn't acquire the lock atomically.
> +         */
> +        if (requeue_pi) {
> +            /* This can go after we're satisfied with testing. */
> +            WARN_ON(!this->rt_waiter);
> +
> +            /* Prepare the waiter to take the rt_mutex. */
> +            atomic_inc(&pi_state->refcount);
> +            this->pi_state = pi_state;
> +            ret = rt_mutex_start_proxy_lock(&pi_state->pi_mutex,
> +                            this->rt_waiter,
> +                            this->task, 1);
> +            if (ret) {
> +                this->pi_state = NULL;
> +                free_pi_state(pi_state);
> +                goto out_unlock;
> +            }
>         }
> +        requeue_futex(this, hb1, hb2, &key2);
> +        drop_count++;
>     }
> 
> out_unlock:
> @@ -1049,7 +1225,9 @@ out_put_keys:
> out_put_key1:
>     put_futex_key(fshared, &key1);
> out:
> -    return ret;
> +    if (pi_state != NULL)
> +        free_pi_state(pi_state);
> +    return ret ? ret : task_count;
> }
> 
> /* The key must be already stored in q->key. */
> @@ -1272,6 +1450,8 @@ handle_fault:
> #define FLAGS_HAS_TIMEOUT    0x04
> 
> static long futex_wait_restart(struct restart_block *restart);
> +static long futex_wait_requeue_pi_restart(struct restart_block *restart);
> +static long futex_lock_pi_restart(struct restart_block *restart);
> 
> /**
>  * finish_futex_lock_pi() - Post lock pi_state and corner case management
> @@ -1419,6 +1599,7 @@ static int futex_wait(u32 __user *uaddr, int fshared,
> 
>     q.pi_state = NULL;
>     q.bitset = bitset;
> +    q.rt_waiter = NULL;
> 
>     if (abs_time) {
>         unsigned long slack;
> @@ -1575,6 +1756,7 @@ static int futex_lock_pi(u32 __user *uaddr, int 
> fshared,
>     }
> 
>     q.pi_state = NULL;
> +    q.rt_waiter = NULL;
> retry:
>     q.key = FUTEX_KEY_INIT;
>     ret = get_futex_key(uaddr, fshared, &q.key);
> @@ -1670,6 +1852,20 @@ uaddr_faulted:
>     goto retry;
> }
> 
> +static long futex_lock_pi_restart(struct restart_block *restart)
> +{
> +    u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
> +    ktime_t t, *tp = NULL;
> +    int fshared = restart->futex.flags & FLAGS_SHARED;
> +
> +    if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
> +        t.tv64 = restart->futex.time;
> +        tp = &t;
> +    }
> +    restart->fn = do_no_restart_syscall;
> +
> +    return (long)futex_lock_pi(uaddr, fshared, restart->futex.val, tp, 0);
> +}
> 
> /*
>  * Userspace attempted a TID -> 0 atomic transition, and failed.
> @@ -1772,6 +1968,290 @@ pi_faulted:
>     return ret;
> }
> 
> +/**
> + * futex_wait_requeue_pi() - Wait on uaddr and take uaddr2
> + * @uaddr:    the futex we initialyl wait on (non-pi)
> + * @fshared:    whether the futexes are shared (1) or not (0).  They 
> must be
> + *         the same type, no requeueing from private to shared, etc.
> + * @val:    the expected value of uaddr
> + * @abs_time:    absolute timeout
> + * @bitset:    32 bit wakeup bitset set by userspace, defaults to all.
> + * @clockrt:    whether to use CLOCK_REALTIME (1) or CLOCK_MONOTONIC (0)
> + * @uaddr2:    the pi futex we will take prior to returning to user-space
> + *
> + * The caller will wait on uaddr and will be requeued by 
> futex_requeue() to
> + * uaddr2 which must be PI aware.  Normal wakeup will wake on uaddr2 and
> + * complete the acquisition of the rt_mutex prior to returning to 
> userspace.
> + * This ensures the rt_mutex maintains an owner when it has waiters; 
> without
> + * one, the pi logic wouldn't know which task to boost/deboost, if 
> there was a
> + * need to.
> + *
> + * We call schedule in futex_wait_queue_me() when we enqueue and return 
> there
> + * via the following:
> + * 1) wakeup on uaddr2 after an atomic lock acquisition by futex_requeue()
> + * 2) wakeup on uaddr2 after a requeue and subsequent unlock
> + * 3) signal (before or after requeue)
> + * 4) timeout (before or after requeue)
> + *
> + * If 3, we setup a restart_block with futex_wait_requeue_pi() as the 
> function.
> + *
> + * If 2, we may then block on trying to take the rt_mutex and return via:
> + * 5) successful lock
> + * 6) signal
> + * 7) timeout
> + * 8) other lock acquisition failure
> + *
> + * If 6, we setup a restart_block with futex_lock_pi() as the function.
> + *
> + * If 4 or 7, we cleanup and return with -ETIMEDOUT.
> + *
> + * Returns:
> + *  0 - On success
> + * <0 - On error
> + */
> +static int futex_wait_requeue_pi(u32 __user *uaddr, int fshared,
> +                 u32 val, ktime_t *abs_time, u32 bitset,
> +                 int clockrt, u32 __user *uaddr2)
> +{
> +    struct hrtimer_sleeper timeout, *to = NULL;
> +    struct rt_mutex_waiter rt_waiter;
> +    struct restart_block *restart;
> +    struct futex_hash_bucket *hb;
> +    struct rt_mutex *pi_mutex;
> +    union futex_key key2;
> +    struct futex_q q;
> +    u32 uval;
> +    int ret;
> +
> +    if (!bitset)
> +        return -EINVAL;
> +
> +    if (abs_time) {
> +        to = &timeout;
> +        hrtimer_init_on_stack(&to->timer, clockrt ? CLOCK_REALTIME :
> +                      CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
> +        hrtimer_init_sleeper(to, current);
> +        hrtimer_set_expires_range_ns(&to->timer, *abs_time,
> +                         current->timer_slack_ns);
> +    }
> +
> +    /*
> +     * The waiter is allocated on our stack, manipulated by the requeue
> +     * code while we sleep on uaddr.
> +     */
> +    debug_rt_mutex_init_waiter(&rt_waiter);
> +    rt_waiter.task = NULL;
> +
> +    q.pi_state = NULL;
> +    q.bitset = bitset;
> +    q.rt_waiter = &rt_waiter;
> +
> +retry:
> +    q.key = FUTEX_KEY_INIT;
> +    ret = get_futex_key(uaddr, fshared, &q.key);
> +    if (unlikely(ret != 0))
> +        goto out;
> +
> +    key2 = FUTEX_KEY_INIT;
> +    ret = get_futex_key(uaddr2, fshared, &key2);
> +    if (unlikely(ret != 0)) {
> +        put_futex_key(fshared, &q.key);
> +        goto out;
> +    }
> +
> +    hb = queue_lock(&q);
> +
> +    /*
> +     * Access the page AFTER the hash-bucket is locked.
> +     * Order is important:
> +     *
> +     *   Userspace waiter: val = var; if (cond(val)) futex_wait(&var, 
> val);
> +     *   Userspace waker:  if (cond(var)) { var = new; futex_wake(&var); }
> +     *
> +     * The basic logical guarantee of a futex is that it blocks ONLY
> +     * if cond(var) is known to be true at the time of blocking, for
> +     * any cond.  If we queued after testing *uaddr, that would open
> +     * a race condition where we could block indefinitely with
> +     * cond(var) false, which would violate the guarantee.
> +     *
> +     * A consequence is that futex_wait() can return zero and absorb
> +     * a wakeup when *uaddr != val on entry to the syscall.  This is
> +     * rare, but normal.
> +     */
> +    ret = get_futex_value_locked(&uval, uaddr);
> +
> +    if (unlikely(ret)) {
> +        queue_unlock(&q, hb);
> +        put_futex_key(fshared, &q.key);
> +        put_futex_key(fshared, &key2);
> +
> +        ret = get_user(uval, uaddr);
> +        if (!ret)
> +            goto retry;
> +        goto out;
> +    }
> +
> +    /* Only actually queue if *uaddr contained val.  */
> +    ret = -EWOULDBLOCK;
> +    if (uval != val) {
> +        queue_unlock(&q, hb);
> +        put_futex_key(fshared, &q.key);
> +        put_futex_key(fshared, &key2);
> +        goto out;
> +    }
> +
> +    /* Queue the futex_q, drop the hb lock, wait for wakeup. */
> +    futex_wait_queue_me(hb, &q, to);
> +
> +    /*
> +     * Ensure the requeue is atomic to avoid races while we process the
> +     * wakeup.  We only need to hold hb->lock to ensure atomicity as the
> +     * wakeup code can't change q.key from uaddr to uaddr2 if we hold that
> +     * lock. It can't be requeued from uaddr2 to something else since we
> +     * don't support a PI aware source futex for requeue.
> +     */
> +    spin_lock(&hb->lock);
> +    if (!match_futex(&q.key, &key2)) {
> +        WARN_ON(q.lock_ptr && (&hb->lock != q.lock_ptr));
> +        /*
> +         * We were not requeued, handle wakeup from futex1 (uaddr).  We
> +         * cannot have been unqueued and already hold the lock, no need
> +         * to call unqueue_me, just do it directly.
> +         */
> +        plist_del(&q.list, &q.list.plist);
> +        drop_futex_key_refs(&q.key);
> +
> +        ret = -ETIMEDOUT;
> +        if (to && !to->task) {
> +            spin_unlock(&hb->lock);
> +            goto out_put_keys;
> +        }
> +
> +        /*
> +         * We expect signal_pending(current), but another thread may
> +         * have handled it for us already.
> +         */
> +        ret = -ERESTARTSYS;
> +        if (!abs_time) {
> +            spin_unlock(&hb->lock);
> +            goto out_put_keys;
> +        }
> +
> +        restart = &current_thread_info()->restart_block;
> +        restart->fn = futex_wait_requeue_pi_restart;
> +        restart->futex.uaddr = (u32 *)uaddr;
> +        restart->futex.val = val;
> +        restart->futex.time = abs_time->tv64;
> +        restart->futex.bitset = bitset;
> +        restart->futex.flags = 0;
> +        restart->futex.uaddr2 = (u32 *)uaddr2;
> +        restart->futex.flags = FLAGS_HAS_TIMEOUT;
> +
> +        if (fshared)
> +            restart->futex.flags |= FLAGS_SHARED;
> +        if (clockrt)
> +            restart->futex.flags |= FLAGS_CLOCKRT;
> +
> +        ret = -ERESTART_RESTARTBLOCK;
> +
> +        spin_unlock(&hb->lock);
> +        goto out_put_keys;
> +    }
> +    spin_unlock(&hb->lock);
> +
> +    ret = 0;
> +    /*
> +     * Check if the waker acquired the second futex for us. If the 
> lock_ptr
> +     * is NULL, but our key is key2, then the requeue target futex was
> +     * uncontended and the waker gave it to us.  This is safe without a 
> lock
> +     * as futex_requeue() will not release the hb lock until after it's
> +     * nulled the lock_ptr and removed us from the hb.
> +     */
> +    if (!q.lock_ptr)
> +        goto out_put_keys;
> +
> +    /*
> +     * At this point we have been requeued.  We have been woken up by
> +     * futex_unlock_pi(), a timeout, or a signal, but not futex_requeue().
> +     * futex_unlock_pi() will not destroy the lock_ptr nor the pi_state.
> +     */
> +    WARN_ON(!&q.pi_state);
> +    pi_mutex = &q.pi_state->pi_mutex;
> +    ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
> +    debug_rt_mutex_free_waiter(&waiter);
> +
> +    spin_lock(q.lock_ptr);
> +    ret = finish_futex_lock_pi(uaddr, fshared, &q, ret);
> +
> +    /* Unqueue and drop the lock. */
> +    unqueue_me_pi(&q);
> +
> +    /*
> +     * If fixup_pi_state_owner() faulted and was unable to handle the
> +     * fault, unlock it and return the fault to userspace.
> +     */
> +    if (ret == -EFAULT) {
> +        if (rt_mutex_owner(pi_mutex) == current)
> +            rt_mutex_unlock(pi_mutex);
> +    } else if (ret == -EINTR) {
> +        if (get_user(uval, uaddr2)) {
> +            ret = -EFAULT;
> +            goto out_put_keys;
> +        }
> +
> +        /*
> +         * We've already been requeued, so restart by calling
> +         * futex_lock_pi() directly, rather then returning to this
> +         * function.
> +         */
> +        restart = &current_thread_info()->restart_block;
> +        restart->fn = futex_lock_pi_restart;
> +        restart->futex.uaddr = (u32 *)uaddr2;
> +        restart->futex.val = uval;
> +        restart->futex.flags = 0;
> +        if (abs_time) {
> +            restart->futex.flags |= FLAGS_HAS_TIMEOUT;
> +            restart->futex.time = abs_time->tv64;
> +        }
> +
> +        if (fshared)
> +            restart->futex.flags |= FLAGS_SHARED;
> +        if (clockrt)
> +            restart->futex.flags |= FLAGS_CLOCKRT;
> +        ret = -ERESTART_RESTARTBLOCK;
> +    }
> +
> +out_put_keys:
> +    put_futex_key(fshared, &q.key);
> +    put_futex_key(fshared, &key2);
> +
> +out:
> +    if (to) {
> +        hrtimer_cancel(&to->timer);
> +        destroy_hrtimer_on_stack(&to->timer);
> +    }
> +    return ret;
> +}
> +
> +static long futex_wait_requeue_pi_restart(struct restart_block *restart)
> +{
> +    u32 __user *uaddr = (u32 __user *)restart->futex.uaddr;
> +    u32 __user *uaddr2 = (u32 __user *)restart->futex.uaddr2;
> +    int fshared = restart->futex.flags & FLAGS_SHARED;
> +    int clockrt = restart->futex.flags & FLAGS_CLOCKRT;
> +    ktime_t t, *tp = NULL;
> +
> +    if (restart->futex.flags & FLAGS_HAS_TIMEOUT) {
> +        t.tv64 = restart->futex.time;
> +        tp = &t;
> +    }
> +    restart->fn = do_no_restart_syscall;
> +
> +    return (long)futex_wait_requeue_pi(uaddr, fshared, restart->futex.val,
> +                       tp, restart->futex.bitset, clockrt,
> +                       uaddr2);
> +}
> +
> /*
>  * Support for robust futexes: the kernel cleans up held futexes at
>  * thread exit time.
> @@ -1994,7 +2474,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, 
> ktime_t *timeout,
>         fshared = 1;
> 
>     clockrt = op & FUTEX_CLOCK_REALTIME;
> -    if (clockrt && cmd != FUTEX_WAIT_BITSET)
> +    if (clockrt && cmd != FUTEX_WAIT_BITSET && cmd != 
> FUTEX_WAIT_REQUEUE_PI)
>         return -ENOSYS;
> 
>     switch (cmd) {
> @@ -2009,10 +2489,11 @@ long do_futex(u32 __user *uaddr, int op, u32 
> val, ktime_t *timeout,
>         ret = futex_wake(uaddr, fshared, val, val3);
>         break;
>     case FUTEX_REQUEUE:
> -        ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL);
> +        ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL, 0);
>         break;
>     case FUTEX_CMP_REQUEUE:
> -        ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3);
> +        ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3,
> +                    0);
>         break;
>     case FUTEX_WAKE_OP:
>         ret = futex_wake_op(uaddr, fshared, uaddr2, val, val2, val3);
> @@ -2029,6 +2510,18 @@ long do_futex(u32 __user *uaddr, int op, u32 val, 
> ktime_t *timeout,
>         if (futex_cmpxchg_enabled)
>             ret = futex_lock_pi(uaddr, fshared, 0, timeout, 1);
>         break;
> +    case FUTEX_WAIT_REQUEUE_PI:
> +        val3 = FUTEX_BITSET_MATCH_ANY;
> +        ret = futex_wait_requeue_pi(uaddr, fshared, val, timeout, val3,
> +                        clockrt, uaddr2);
> +        break;
> +    case FUTEX_REQUEUE_PI:
> +        ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL, 1);
> +        break;
> +    case FUTEX_CMP_REQUEUE_PI:
> +        ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3,
> +                    1);
> +        break;
>     default:
>         ret = -ENOSYS;
>     }
> @@ -2046,7 +2539,8 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, 
> op, u32, val,
>     int cmd = op & FUTEX_CMD_MASK;
> 
>     if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
> -              cmd == FUTEX_WAIT_BITSET)) {
> +              cmd == FUTEX_WAIT_BITSET ||
> +              cmd == FUTEX_WAIT_REQUEUE_PI)) {
>         if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
>             return -EFAULT;
>         if (!timespec_valid(&ts))
> @@ -2058,10 +2552,11 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, 
> op, u32, val,
>         tp = &t;
>     }
>     /*
> -     * requeue parameter in 'utime' if cmd == FUTEX_REQUEUE.
> +     * requeue parameter in 'utime' if cmd == FUTEX_*_REQUEUE_*.
>      * number of waiters to wake in 'utime' if cmd == FUTEX_WAKE_OP.
>      */
>     if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE ||
> +        cmd == FUTEX_REQUEUE_PI || cmd == FUTEX_CMP_REQUEUE_PI ||
>         cmd == FUTEX_WAKE_OP)
>         val2 = (u32) (unsigned long) utime;
> 
> 
> 


-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/