linux-kernel - TASK_DEAD && ttwu() again (Was: ensure arch_ptrace/ptrace

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130123191946.GA19101@redhat.com>
Date:	Wed, 23 Jan 2013 20:19:46 +0100
From:	Oleg Nesterov <oleg@...hat.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Yasunori Goto <y-goto@...fujitsu.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Ingo Molnar <mingo@...e.hu>
Cc:	Dan Carpenter <dan.carpenter@...cle.com>,
	Michael Davidson <md@...gle.com>,
	Suleiman Souhlal <suleiman@...gle.com>,
	Julien Tinnes <jln@...gle.com>,
	Aaron Durbin <adurbin@...gle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Tejun Heo <tj@...nel.org>,
	Roland McGrath <roland@...k.frob.com>,
	Alexander Gordeev <agordeev@...hat.com>
Subject: TASK_DEAD && ttwu() again (Was: ensure arch_ptrace/ptrace_request
	can never race with SIGKILL)

To avoid the confusion, this is not connected to ptrace_freeze_traced()
changes...

With or without these changes, there is another problem: a spurious
wakeup from try_to_wake_up(TASK_NORMAL) which doesn't necessarily see
the "right" task->state.

As for ptrace_stop() this is purely theoretical, but I thought that
perhaps it makes sense to extract the "mb + unlock_wait(pi_lock)" code
from do_exit() into the generic helper,
set_current_state_sync_because_we_cant_tolerate_a_wrong_wakeup().
But when I look at this code again I am not sure it is right.

Let me remind the problem. To oversimplify, we have

	try_to_wake_up(task, state)
	{
		lock(task->pi_lock);

		if (task->state & state)
			task->state = TASK_RUNNING;

		unlock(task->pi_lock);
	}

And this means that a task doing

	current->state = STATE_1;
	// no schedule() in between
	current->state = STATE_2;
	schedule();

can be actually woken up by try_to_wake_up(STATE_1) even if it already
sleeps in STATE_2.

Usually this is fine, any wait_event-like code should be careful. But
sometimes we can't afford the false wakeup, that is why do_wait() roughly
does

	do_exit()
	{
		// down_read(mmap_sem) can do this without schedule()
		current->state = TASK_UNINTERRUPTIBLE;
		current->state = TASK_RUNNING;

		mb();

		spin_unlock_wait(current->pi_lock);

		current->state = TASK_DEAD;
		schedule();
	}

This should ensure that any subsequent (after unlock_wait) try_to_wake_up()
can't see state == UNINTERRUPTIBLE and I think this works.

But. Somehow we missed the fact (I think) that we also need to serialize
unlock_wait() and "state = TASK_DEAD". The code above can be reordered,

	do_exit()
	{
		// down_read(mmap_sem) can do this without schedule()
		current->state = TASK_UNINTERRUPTIBLE;
		current->state = TASK_RUNNING;

		mb();

		current->state = TASK_DEAD;
		// !!! ttwu() can change ->state here !!!
		spin_unlock_wait(current->pi_lock);

		schedule();
	}

and we have the same problem again. So _I think_ that we we need another
mb() after unlock_wait() ?

And, afaics, in theory we can't simply move the current mb() down, after
unlock_wait().  (again, only in theory, if nothing else we should have
the implicit barrrers after we played with ->state in the past).

Or perhaps we should modify ttwu_do_wakeup() to not blindly set RUNNING,
say, cmpxchg(old_state, RUNNING). But this is not simple/nice.

Or I missed something?

Oleg.

--- x/kernel/exit.c
+++ x/kernel/exit.c
@@ -869,8 +869,15 @@ void do_exit(long code)
 	 * To avoid it, we have to wait for releasing tsk->pi_lock which
 	 * is held by try_to_wake_up()
 	 */
+
 	smp_mb();
 	raw_spin_unlock_wait(&tsk->pi_lock);
+	/*
+	 * The first mb() ensures that after that try_to_wake_up() must see
+	 * state == TASK_RUNNING. We need another one to ensure that we set
+	 * TASK_DEAD only after ->pi_lock is really unlocked.
+	 */
+	smp_mb();

 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/