[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFzP60Ucn0snFchHNw2tOnS6HYp-6n+f+9nxrbTPw8fTfQ@mail.gmail.com>
Date: Thu, 20 Mar 2014 09:41:31 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Davidlohr Bueso <davidlohr@...com>
Cc: Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...nel.org>,
LKML <linux-kernel@...r.kernel.org>,
ppc-dev <linuxppc-dev@...ts.ozlabs.org>,
Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Paul Mackerras <paulus@...ba.org>,
Paul McKenney <paulmck@...ux.vnet.ibm.com>
Subject: Re: Tasks stuck in futex code (in 3.14-rc6)
On Wed, Mar 19, 2014 at 10:56 PM, Davidlohr Bueso <davidlohr@...com> wrote:
>
> This problem suggests that we missed a wakeup for a task that was adding
> itself to the queue in a wait path. And the only place that can happen
> is with the hb spinlock check for any pending waiters.
Ok, so thinking about hb_waiters_pending():
- spin_is_locked() test
- read barrier
- plist_head_empty() test
seems to be broken after all.
The race is against futex_wait() that does
- futex_wait_setup():
- queue_lock()
- get_futex_value_locked()
- futex_wait_queue_me()
- queue_me()
- plist_add()
right?
It strikes me that the "spin_is_locked()" test has no barriers wrt the
writing of the new futex value on the wake path. And the read barrier
obviously does nothing wrt the write either. Or am I missing
something? So the write that actually released the futex might be
almost arbitrarily delayed on the waking side. So the waiting side may
not see the new value, even though the waker assumes it does due to
the ordering of it doing the write first.
So maybe we need a memory barrier in hb_waiters_pending() just to make
sure that the new value is written.
But at that point, I suspect that Davidlohrs original patch that just
had explicit waiting counts is just as well. The whole point with the
head empty test was to emulate that "do we have waiters" without
having to actually add the atomics, but a memory barrier is really no
worse.
The attached is a TOTALLY UNTESTED interdiff that adds back Davidlohrs
atomic count. It may be terminally broken, I literally didn't test it
at all.
Davidlohr, mind checking and correcting this?
Linus
View attachment "patch.diff" of type "text/plain" (3576 bytes)
Powered by blists - more mailing lists