linux-kernel - Re: [PATCH] futex: avoid race between requeue and wake

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFz+yMdXfsb-t3JzDhj6=M=-ggjDDGAQSbtfEcNhgyoopA@mail.gmail.com>
Date:	Tue, 8 Apr 2014 11:53:47 -0700
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Jan Stancek <jstancek@...hat.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
	Davidlohr Bueso <davidlohr@...com>,
	Ingo Molnar <mingo@...nel.org>,
	Larry Woodman <lwoodman@...hat.com>
Subject: Re: [PATCH] futex: avoid race between requeue and wake

On Tue, Apr 8, 2014 at 11:13 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> On Tue, Apr 08, 2014 at 10:33:55AM -0700, Linus Torvalds wrote:
>> Hmm. And let's think about that powerpc race, where "spin_is_locked()"
>> can be false when somebody is waiting to get the lock..
>
> Right; so in the original discussion I never really got to why that is a
> problem. A pending waiter cannot modify the wait list, so either we see
> the current holder, or we see a stable list.

It's not just the list vs lock, it's also the ordering wrt the actual
futex value accesses. I have to say, the "waiters" approach makes
things a lot simpler, because then the main ordering issues are just
between "waiters and futex value".

The spinlock+waitqueue thing is rather more complex, because the
spinlock is gotten before reading the futex value (on the waiting
side), but *after* writing the futex value (on the waking side). While
the waitqueue is then modified *after* reading the futex value (on
both the waiting and waking side).

Now, on x86, my argument for using the "spinlock + wait queue"
combination was that getting the spinlock actually has the exact same
semantics as the atomic waiters update (because it really *is* an
atomic add), and while the spin-unlock destroys it, we can largely
ignore that because by that time we'd have been guaranteed to see the
wait-queue change too.

But even on x86, I was a *bit* worried about the ordering of the
"spin_is_locked()" read and the waitqueue value read.  I thought about
it, and the smp_rmb() we put in place should have been safe, but I
have to admit that I was a bit nervous. It gets really hard to think
about these things.

So while I still think it should all work, I have to admit that I
didn't really mind horribly much going back to the explicit "waiters"
model that had a lot simpler ordering: just between the "waiters"
accesses and the "futex value" accesses, and without that combination
of "before and after" thing.

Which is also why I think I'd prefer to make the requeueing case have
the *same* rules, ie "waiters" needs to be incremented *before*
getting the spinlock.

A simple patch that adds a "hb_waiters_inc(&hb2);" to the top of
double_lock_hb(), and a "hb_waiters_decc(&hb2);" to double_unlock_hb()
would seem to be the simplest solution. But that adds some unnecessary
atomics (in particular, to the futex_wake_op() case too, which doesn't
need it at all, it's just futex_requeue() that needs it, afaik).

But that alternative patch might be worth testing, just to verify that
"yes, doing the hb_waiters_inc before getting the lock really does fix
Jan's case *without* the use spin_is_locked()". Jan? Can you try that?

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/