linux-kernel - Re: futex: clarification needed with drop_futex_key

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160329095039.GE3408@twins.programming.kicks-ass.net>
Date:	Tue, 29 Mar 2016 11:50:39 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Mike Galbraith <umgwanakikbuti@...il.com>
Cc:	Petros Koutoupis <petros@...roskoutoupis.com>,
	linux-kernel@...r.kernel.org,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Davidlohr Bueso <dave@...olabs.net>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	catalin.marinas@....com, "H. Peter Anvin" <hpa@...or.com>
Subject: Re: futex: clarification needed with drop_futex_key_refs and memory
 barriers

On Sun, Mar 27, 2016 at 08:25:48AM +0200, Mike Galbraith wrote:
> (futex ordering pop-flare)
> 
> On Sat, 2016-03-26 at 10:56 -0500, Petros Koutoupis wrote:
> > I stumbled on an interesting scenario which I am unable to fully explain and I
> > was hoping to get some other opinions on why this would or wouldn't work.
> > 
> > In recent testing on a 48-core Haswell arch server, our multi-threaded user space
> > application was utilizing 60% to 100% more CPU than on our smaller 24-core servers
> > (running an identical load). After spending a considerable amount of time analyzing
> > stack dumps and straces it became immediately apparent that those exact threads
> > operating with the higher CPU utilization were off in futex land.

perf should be able to tell you that pretty quickly, no?

A question about your workload; are you typically 'stuck' in
futex_wait() or some of the requeue stuff (which is more complex). Let
me assume regular futex_wait() for now.

> > Shortly afterward I stumbled on commit 76835b0ebf8a7fe85beb03c75121419a7dec52f0
> > (futex: Ensure get_futex_key_refs() always implies a barrier) which addressed the
> > handling of private futexes and preventing a race condition by completing the
> > function with a memory barrier. Now, I fully understand why this patch was implemented:
> > to have a memory barrier before checking the "waiters." It makes sense. 

Right; well, a barrier isn't _before_ something, its between two things.
And I think we're ordering the user-space store to the futex value
against the waiters load, but the code could do with a comment here.

And this should be totally irrelevant for x86 because a CPL change
already implies a full barrier.

> > What doesn't
> > make sense (so far) is when I apply the same patch to the drop counterpart,
> > drop_futex_key_refs(), and the problem goes away. See the change and my notes below.
> > 
> > 
> > --- linux/kernel/futex.c.orig   2016-03-25 19:45:08.169563263 -0500
> > +++ linux/kernel/futex.c        2016-03-25 19:46:06.901562211 -0500
> > @@ -438,11 +438,13 @@ static void drop_futex_key_refs(union fu
> > 
> >         switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
> >         case FUT_OFF_INODE:
> > -               iput(key->shared.inode);
> > +               iput(key->shared.inode); /* implies smp_mb(); (B) */
> >                 break;
> >         case FUT_OFF_MMSHARED:
> > -               mmdrop(key->private.mm);
> > +               mmdrop(key->private.mm); /* implies smp_mb(); (B) */
> >                 break;
> > +       default:
> > +               smp_mb(); /* explicit smp_mb(); (B) */
> >         }
> >  }

So the patch makes sense from a symmetry POV, but I'm having a hard time
explaining why this would make any difference to your workload.

The best I can come up with is that the explicit MFENCE (as opposed to a
LOCK prefixed instruction) causes a store-buffer flush and hence the
waker more often sees !waiters and therefore tries to acquire the bucket
lock less.

But this is not a correctness (nor ordering) issue; but purely an
architectural side-effect. Furthermore; some proposed changes:

  http://marc.info/?l=linux-kernel&m=145400059704564&w=2

might change this side-effect.


In any case; the below (completely irrelevant patch for you) is
something I would propose. It gives hb_waiter_dec() RELEASE like
semantics and ensures it cannot creep into the lock sections its
typically behind. Although strictly speaking I think it being inside
that lock region is sufficient.

It also re-orders the increment in requeue to happen before we add to
the list (as is the proper order) and removes a totally bogus comment;
spin_lock() does _NOT_ imply a full barrier.

---
 kernel/futex.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 5d6ce6413ef1..615f277f313b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -360,6 +360,7 @@ static inline void hb_waiters_inc(struct futex_hash_bucket *hb)
 static inline void hb_waiters_dec(struct futex_hash_bucket *hb)
 {
 #ifdef CONFIG_SMP
+	smp_mb__before_atomic();
 	atomic_dec(&hb->waiters);
 #endif
 }
@@ -1442,8 +1443,8 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
 	if (likely(&hb1->chain != &hb2->chain)) {
 		plist_del(&q->list, &hb1->chain);
 		hb_waiters_dec(hb1);
-		plist_add(&q->list, &hb2->chain);
 		hb_waiters_inc(hb2);
+		plist_add(&q->list, &hb2->chain);
 		q->lock_ptr = &hb2->lock;
 	}
 	get_futex_key_refs(key2);
@@ -1864,7 +1865,7 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 
 	q->lock_ptr = &hb->lock;
 
-	spin_lock(&hb->lock); /* implies MB (A) */
+	spin_lock(&hb->lock);
 	return hb;
 }