linux-kernel - Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <52BA4D4C.6070709@hp.com>
Date:	Tue, 24 Dec 2013 22:13:16 -0500
From:	Waiman Long <waiman.long@...com>
To:	Davidlohr Bueso <davidlohr@...com>
CC:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...hat.com>,
	Darren Hart <dvhart@...ux.intel.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Paul McKenney <paulmck@...ux.vnet.ibm.com>,
	Mike Galbraith <efault@....de>, Jeff Mahoney <jeffm@...e.com>,
	Jason Low <jason.low2@...com>, Tom Vaden <tom.vaden@...com>,
	"Norton, Scott J" <scott.norton@...com>,
	"Chandramouleeswaran, Aswin" <aswin@...com>,
	Ingo Molnar <mingo@...nel.org>
Subject: Re: [PATCH v3 4/4] futex: Avoid taking hb lock if nothing to wakeup

On 12/20/2013 08:36 PM, Davidlohr Bueso wrote:
> On Thu, 2013-12-19 at 15:14 -0800, Linus Torvalds wrote:
>> On Thu, Dec 19, 2013 at 10:45 AM, Davidlohr Bueso<davidlohr@...com>  wrote:
>>> - increment the counter at queue_lock() as we always end up calling
>>>    queue_me() which adds the element to the list. Upon any error,
>>>    queue_unlock() is called for housekeeping, for which we decrement
>>>    to mach the increment done in queue_lock().
>>>
>>> - decrement the counter at __unqueue_me() to reflect when an element is
>>>    removed from the queue for wakeup related purposes.
>> I still hate this whole separate counter thing. It seems really annoying.
>>
>> If re-ordering things didn't work out, then why can't just the counter
>> we *already* have in the spinlock itself work as the counter? Your
>> counter update logic seems to basically match when you take the
>> spinlock anyway.
> So the following has passed all testing, just like the atomics variant.
> Thoughts?
>
> Thanks,
> Davidlohr
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index fcc6850..c8c7ce5 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -73,19 +73,22 @@
>    * Basic futex operation and ordering guarantees:
>    *
>    * The waiter reads the futex value in user space and calls
> - * futex_wait(). This function computes the hash bucket and acquires
> - * the hash bucket lock. After that it reads the futex user space value
> - * again and verifies that the data has not changed. If it has not
> - * changed it enqueues itself into the hash bucket, releases the hash
> + * futex_wait(). It computes the hash bucket and acquires the hash
> + * bucket lock. After that it reads the futex user space value again
> + * and verifies that the data has not changed. If it has not changed
> + * it enqueues itself into the hash bucket, releases the hash
>    * bucket lock and schedules.
>    *
>    * The waker side modifies the user space value of the futex and calls
> - * futex_wake(). This functions computes the hash bucket and acquires
> - * the hash bucket lock. Then it looks for waiters on that futex in the
> - * hash bucket and wakes them.
> + * futex_wake(). It computes the hash bucket and acquires the hash
> + * bucket lock. Then it looks for waiters on that futex in the hash
> + * bucket and wakes them.
>    *
> - * Note that the spin_lock serializes waiters and wakers, so that the
> - * following scenario is avoided:
> + * In scenarios where wakeups are called and no tasks are blocked on a futex,
> + * taking the hb spinlock can be avoided and simply return. In order for this
> + * optimization to work, ordering guarantees must exist so that the waiter
> + * being added to the list is acknowledged when the list is concurrently being
> + * checked by the waker, avoiding scenarios like the following:
>    *
>    * CPU 0                               CPU 1
>    * val = *futex;
> @@ -106,24 +109,50 @@
>    * This would cause the waiter on CPU 0 to wait forever because it
>    * missed the transition of the user space value from val to newval
>    * and the waker did not find the waiter in the hash bucket queue.
> - * The spinlock serializes that:
> + *
> + * The correct serialization ensures that a waiter either observes
> + * the changed user space value before blocking or is woken by a
> + * concurrent waker:
>    *
>    * CPU 0                               CPU 1
>    * val = *futex;
>    * sys_futex(WAIT, futex, val);
>    *   futex_wait(futex, val);
> - *   lock(hash_bucket(futex));
> - *   uval = *futex;
> - *                                     *futex = newval;
> - *                                     sys_futex(WAKE, futex);
> - *                                       futex_wake(futex);
> - *                                       lock(hash_bucket(futex));
> + *
> + *   waiters++;
> + *   mb(); (A)<-- paired with -.
> + *                              |
> + *   lock(hash_bucket(futex));  |
> + *                              |
> + *   uval = *futex;             |
> + *                              |        *futex = newval;
> + *                              |        sys_futex(WAKE, futex);
> + *                              |          futex_wake(futex);
> + *                              |
> + *                              `------->    mb(); (B)

Checking the state of the spinlock counter isn't the same as 
incrementing a waiter count. So your pseudo code here is misleading. See 
further explanation below.

>    *   if (uval == val)
> - *      queue();
> + *     queue();
>    *     unlock(hash_bucket(futex));
> - *     schedule();                       if (!queue_empty())
> - *                                         wake_waiters(futex);
> - *                                       unlock(hash_bucket(futex));
> + *     schedule();                         if (waiters)
> + *                                           lock(hash_bucket(futex));
> + *                                           wake_waiters(futex);
> + *                                           unlock(hash_bucket(futex));
> + *
> + * Where (A) orders the waiters increment and the futex value read -- this
> + * is guaranteed by the head counter in the hb spinlock; and where (B)
> + * orders the write to futex and the waiters read.
> + *
> + * This yields the following case (where X:=waiters, Y:=futex):
> + *
> + *	X = Y = 0
> + *
> + *	w[X]=1		w[Y]=1
> + *	MB		MB
> + *	r[Y]=y		r[X]=x
> + *
> + * Which guarantees that x==0&&  y==0 is impossible; which translates back into
> + * the guarantee that we cannot both miss the futex variable change and the
> + * enqueue.
>    */
>
>   int __read_mostly futex_cmpxchg_enabled;
> @@ -211,6 +240,35 @@ static unsigned long __read_mostly futex_hashsize;
>
>   static struct futex_hash_bucket *futex_queues;
>
> +static inline void futex_get_mm(union futex_key *key)
> +{
> +	atomic_inc(&key->private.mm->mm_count);
> +#ifdef CONFIG_SMP
> +	/*
> +	 * Ensure futex_get_mm() implies a full barrier such that
> +	 * get_futex_key() implies a full barrier. This is relied upon
> +	 * as full barrier (B), see the ordering comment above.
> +	 */
> +	smp_mb__after_atomic_inc();
> +#endif
> +}
> +
> +static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
> +{
> +#ifdef CONFIG_SMP
> +	/*
> +	 * If the hash bucket is locked then we know the ticket counter
> +	 * is non-zero and thus there is at least one waiter in the queue.
> +	 */
> +	if (spin_is_locked(&hb->lock))
> +		return true;
> +	smp_rmb(); /* Make sure we check the lock state first */
> +	return !plist_head_empty(&hb->chain);
> +#else
> +	return true;
> +#endif
> +}

The ticket spinlock counter is a cyclic counter that can cycle through 0 
periodically. So the zero-ness of the counter has no relation to whether 
it is locked or not.  Your comment above is not correct. What 
spin_is_locked() can tell you is whether one or more tasks are trying to 
get into the critical section which can be a waiter (most likely) or a 
waker. Coupled with checking if the list is empty, that could be a 
cheaper alternative to using a separate atomic counter, but it is also 
slightly less reliable and has a higher chance of false positive.

-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/