[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1860555101.1890107.1396990941681.JavaMail.zimbra@redhat.com>
Date: Tue, 8 Apr 2014 17:02:21 -0400 (EDT)
From: Jan Stancek <jstancek@...hat.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Peter Zijlstra <peterz@...radead.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
Davidlohr Bueso <davidlohr@...com>,
Ingo Molnar <mingo@...nel.org>,
Larry Woodman <lwoodman@...hat.com>
Subject: Re: [PATCH] futex: avoid race between requeue and wake
----- Original Message -----
> From: "Linus Torvalds" <torvalds@...ux-foundation.org>
> To: "Peter Zijlstra" <peterz@...radead.org>
> Cc: "Jan Stancek" <jstancek@...hat.com>, "Linux Kernel Mailing List" <linux-kernel@...r.kernel.org>, "Srikar
> Dronamraju" <srikar@...ux.vnet.ibm.com>, "Davidlohr Bueso" <davidlohr@...com>, "Ingo Molnar" <mingo@...nel.org>,
> "Larry Woodman" <lwoodman@...hat.com>
> Sent: Tuesday, 8 April, 2014 8:53:47 PM
> Subject: Re: [PATCH] futex: avoid race between requeue and wake
>
> On Tue, Apr 8, 2014 at 11:13 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> > On Tue, Apr 08, 2014 at 10:33:55AM -0700, Linus Torvalds wrote:
> >> Hmm. And let's think about that powerpc race, where "spin_is_locked()"
> >> can be false when somebody is waiting to get the lock..
> >
> > Right; so in the original discussion I never really got to why that is a
> > problem. A pending waiter cannot modify the wait list, so either we see
> > the current holder, or we see a stable list.
>
> It's not just the list vs lock, it's also the ordering wrt the actual
> futex value accesses. I have to say, the "waiters" approach makes
> things a lot simpler, because then the main ordering issues are just
> between "waiters and futex value".
>
> The spinlock+waitqueue thing is rather more complex, because the
> spinlock is gotten before reading the futex value (on the waiting
> side), but *after* writing the futex value (on the waking side). While
> the waitqueue is then modified *after* reading the futex value (on
> both the waiting and waking side).
>
> Now, on x86, my argument for using the "spinlock + wait queue"
> combination was that getting the spinlock actually has the exact same
> semantics as the atomic waiters update (because it really *is* an
> atomic add), and while the spin-unlock destroys it, we can largely
> ignore that because by that time we'd have been guaranteed to see the
> wait-queue change too.
>
> But even on x86, I was a *bit* worried about the ordering of the
> "spin_is_locked()" read and the waitqueue value read. I thought about
> it, and the smp_rmb() we put in place should have been safe, but I
> have to admit that I was a bit nervous. It gets really hard to think
> about these things.
>
> So while I still think it should all work, I have to admit that I
> didn't really mind horribly much going back to the explicit "waiters"
> model that had a lot simpler ordering: just between the "waiters"
> accesses and the "futex value" accesses, and without that combination
> of "before and after" thing.
>
> Which is also why I think I'd prefer to make the requeueing case have
> the *same* rules, ie "waiters" needs to be incremented *before*
> getting the spinlock.
>
> A simple patch that adds a "hb_waiters_inc(&hb2);" to the top of
> double_lock_hb(), and a "hb_waiters_decc(&hb2);" to double_unlock_hb()
> would seem to be the simplest solution. But that adds some unnecessary
> atomics (in particular, to the futex_wake_op() case too, which doesn't
> need it at all, it's just futex_requeue() that needs it, afaik).
>
> But that alternative patch might be worth testing, just to verify that
> "yes, doing the hb_waiters_inc before getting the lock really does fix
> Jan's case *without* the use spin_is_locked()". Jan? Can you try that?
I ran reproducer with following change on s390x system, where this
can be reproduced usually within seconds:
diff --git a/kernel/futex.c b/kernel/futex.c
index 67dacaf..9150ffd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1095,6 +1095,7 @@ static int unlock_futex_pi(u32 __user *uaddr, u32 uval)
static inline void
double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
{
+ hb_waiters_inc(hb2);
if (hb1 <= hb2) {
spin_lock(&hb1->lock);
if (hb1 < hb2)
@@ -1111,6 +1112,7 @@ double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
spin_unlock(&hb1->lock);
if (hb1 != hb2)
spin_unlock(&hb2->lock);
+ hb_waiters_dec(hb2);
}
/*
Reproducer is running without failures over an hour now and
made ~1.4 million iterations.
Regards,
Jan
>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists