linux-kernel - Re: futex: wakeup race and futex

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <4AB71F95.1030008@us.ibm.com>
Date:	Sun, 20 Sep 2009 23:39:17 -0700
From:	Darren Hart <dvhltc@...ibm.com>
To:	Thomas Gleixner <tglx@...utronix.de>
CC:	"lkml, " <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ingo Molnar <mingo@...e.hu>,
	Eric Dumazet <eric.dumazet@...il.com>,
	Dinakar Guniguntala <dino@...ibm.com>,
	John Stultz <johnstul@...ibm.com>
Subject: Re: futex: wakeup race and futex_q woken state definition

Thomas Gleixner wrote:
> On Thu, 17 Sep 2009, Darren Hart wrote:
>>> 	/*
>>> 	 * !plist_node_empty() is safe here without any lock.
>>> 	 * q.lock_ptr != 0 is not safe, because of ordering against wakeup.
>>> 	 */
>>> 	if (likely(!plist_node_empty(&q->list))) {
>>>
>>> If we move set_current_state() before the queue_me() this check is
>>> still an optimization to avoid the schedule call in case we have been
>>> woken up already. But the comment is still wrong as the wakeup code
>>> has changed:
>>>
>>> The old version did:
>>>
>>>      plist_del(&q->list);
>>>      wake_up_all(&q->waiters);
>>>      q->lock_ptr = NULL;
>>>
>>> Today we do:
>>>
>>>      p = q->task;
>>>      get_task_struct(p);
>>>      plist_del(&q->list);
>>>      q->lock_ptr = NULL;
>>>      wake_up_state(p);
>>>      put_task_struct(p);
>>>
>>> We changed this because it makes no sense to use a waitqueue for a
>>> single task.
>> Right.
>>
>>
>> However, my bigger concern still remains.  If the above is only an
>> optimization, we appear to have a race with wakeup where we can see a
>> non-empty list here and decide to schedule and have the wakeup code remove us
>> from the list, hiding it from all future futex related wakeups (signal and
>> timeout would still work).
> 
> No.
> 
> Sleeper does:
> 
> 	set_current_state(TASK_INTERRUPTIBLE);
> 	
> 	if (!plist_empty())
> 	   schedule();
> 
> So when the list removal happened before set_current_state() we don't
> schedule. If the wakeup happens _after_ set_current_state() then the
> wake_up_state() call will bring us back to running.
> 
>> We have also been seeing a race with the requeue_pi code with a JVM benchmark
>> where the apparent owner of the pi mutex remains blocked on the condvar - this
>> can be explained by the race I'm suspecting.  Also, futex_requeue_pi() is
>> using futex_wait_queue_me() which expects the waker to remove the futex_q from
>> the list, which isn't how things work for PI mutexes.  In an experiment, I
>> moved the spin_unlock() out of queueme() and right before the call to
>> schedule() to narrow the race window, and the hang we were experiencing
>> appears to have gone away.
> 
> The correct thing to do is to move set_current_state() before queue_me().
> 

Ah yes, you are correct of course.  Since PI futexes do not use 
plist_node_empty() to test for wakeup, the setting of TASK_ITNERRUPTIBLE 
after the queue_me() sets the stage to call schedule() with the wrong 
task state and lose the task forever.  I have included this in my 
current patch queue.  We are running our tests to confirm the fix and 
I'll submit the series for inclusion by tomorrow.

Thanks,

-- 
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/