[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4AB71F95.1030008@us.ibm.com>
Date: Sun, 20 Sep 2009 23:39:17 -0700
From: Darren Hart <dvhltc@...ibm.com>
To: Thomas Gleixner <tglx@...utronix.de>
CC: "lkml, " <linux-kernel@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
Steven Rostedt <rostedt@...dmis.org>,
Ingo Molnar <mingo@...e.hu>,
Eric Dumazet <eric.dumazet@...il.com>,
Dinakar Guniguntala <dino@...ibm.com>,
John Stultz <johnstul@...ibm.com>
Subject: Re: futex: wakeup race and futex_q woken state definition
Thomas Gleixner wrote:
> On Thu, 17 Sep 2009, Darren Hart wrote:
>>> /*
>>> * !plist_node_empty() is safe here without any lock.
>>> * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
>>> */
>>> if (likely(!plist_node_empty(&q->list))) {
>>>
>>> If we move set_current_state() before the queue_me() this check is
>>> still an optimization to avoid the schedule call in case we have been
>>> woken up already. But the comment is still wrong as the wakeup code
>>> has changed:
>>>
>>> The old version did:
>>>
>>> plist_del(&q->list);
>>> wake_up_all(&q->waiters);
>>> q->lock_ptr = NULL;
>>>
>>> Today we do:
>>>
>>> p = q->task;
>>> get_task_struct(p);
>>> plist_del(&q->list);
>>> q->lock_ptr = NULL;
>>> wake_up_state(p);
>>> put_task_struct(p);
>>>
>>> We changed this because it makes no sense to use a waitqueue for a
>>> single task.
>> Right.
>>
>>
>> However, my bigger concern still remains. If the above is only an
>> optimization, we appear to have a race with wakeup where we can see a
>> non-empty list here and decide to schedule and have the wakeup code remove us
>> from the list, hiding it from all future futex related wakeups (signal and
>> timeout would still work).
>
> No.
>
> Sleeper does:
>
> set_current_state(TASK_INTERRUPTIBLE);
>
> if (!plist_empty())
> schedule();
>
> So when the list removal happened before set_current_state() we don't
> schedule. If the wakeup happens _after_ set_current_state() then the
> wake_up_state() call will bring us back to running.
>
>> We have also been seeing a race with the requeue_pi code with a JVM benchmark
>> where the apparent owner of the pi mutex remains blocked on the condvar - this
>> can be explained by the race I'm suspecting. Also, futex_requeue_pi() is
>> using futex_wait_queue_me() which expects the waker to remove the futex_q from
>> the list, which isn't how things work for PI mutexes. In an experiment, I
>> moved the spin_unlock() out of queueme() and right before the call to
>> schedule() to narrow the race window, and the hang we were experiencing
>> appears to have gone away.
>
> The correct thing to do is to move set_current_state() before queue_me().
>
Ah yes, you are correct of course. Since PI futexes do not use
plist_node_empty() to test for wakeup, the setting of TASK_ITNERRUPTIBLE
after the queue_me() sets the stage to call schedule() with the wrong
task state and lose the task forever. I have included this in my
current patch queue. We are running our tests to confirm the fix and
I'll submit the series for inclusion by tomorrow.
Thanks,
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists