linux-kernel - Re: [PATCH 3/3] introduce task_rcu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160518191045.GP3193@twins.programming.kicks-ass.net>
Date:	Wed, 18 May 2016 21:10:45 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Oleg Nesterov <oleg@...hat.com>
Cc:	Kirill Tkhai <ktkhai@...allels.com>, linux-kernel@...r.kernel.org,
	Ingo Molnar <mingo@...hat.com>,
	Vladimir Davydov <vdavydov@...allels.com>,
	Kirill Tkhai <tkhai@...dex.ru>,
	Christoph Lameter <cl@...ux.com>
Subject: Re: [PATCH 3/3] introduce task_rcu_dereference()

On Wed, May 18, 2016 at 08:23:18PM +0200, Oleg Nesterov wrote:
> IOW. We can never know if we have a garbage in "sighand" or the real value,
> this task_struct can be freed/reallocated when we do probe_slab_address().
> 
> And this is fine. We re-check that "task == *ptask" after that. Now we have
> two different cases:
> 
> 	1. This is actually the same task/task_struct. In this case
>            sighand != NULL tells us it is still alive.
> 
>         2. This is another task which got the same memory for task_struct.
>            We can't know this of course, and we can not trust sighand != NULL.
> 
> 	   In this case we actually return a random value, but this is correct.
> 
> 	   If we return NULL - we can pretend that we actually noticed that
> 	   *ptask was updated when the previous task has exited. Or pretend
> 	   that probe_slab_address(&sighand) reads NULL.
> 
> 	   If we return the new task (because sighand is not NULL for any
> 	   reason) - this is fine too. This (new) task can't go away before
> 	   another gp pass.
> 
> 	   And please note again the "We could even eliminate the false positive"
> 	   comment above (hmm, it should probably say false negative). We could
> 	   re-read task->sighand once again to avoid the falsely NULL.
> 
> 	   But this case is very unlikely so I think we do not really care.
> 

Ah right, lets stick that in.. :-)

OK, something like so then?

---
 include/linux/sched.h |  3 ++
 kernel/exit.c         | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c   | 29 +++++---------------
 3 files changed, 86 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1b43b45a22b9..7f90002e9344 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2134,6 +2134,9 @@ static inline void put_task_struct(struct task_struct *t)
 		__put_task_struct(t);
 }
 
+struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+struct task_struct *try_get_task_struct(struct task_struct **ptask);
+
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void task_cputime(struct task_struct *t,
 			 cputime_t *utime, cputime_t *stime);
diff --git a/kernel/exit.c b/kernel/exit.c
index fd90195667e1..58d7e05821ae 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -211,6 +211,82 @@ void release_task(struct task_struct *p)
 }
 
 /*
+ * Note that if this function returns a valid task_struct pointer (!NULL)
+ * task->usage must remain >0 for the duration of the RCU critical section.
+ */
+struct task_struct *task_rcu_dereference(struct task_struct **ptask)
+{
+	struct sighand_struct *sighand;
+	struct task_struct *task;
+
+	/*
+	 * We need to verify that release_task() was not called and thus
+	 * delayed_put_task_struct() can't run and drop the last reference
+	 * before rcu_read_unlock(). We check task->sighand != NULL,
+	 * but we can read the already freed and reused memory.
+	 */
+retry:
+	task = rcu_dereference(*ptask);
+	if (!task)
+		return NULL;
+
+	probe_kernel_address(&task->sighand, sighand);
+
+	/*
+	 * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
+	 * was already freed we can not miss the preceding update of this
+	 * pointer.
+	 */
+	smp_rmb();
+	if (unlikely(task != READ_ONCE(*ptask)))
+		goto retry;
+
+	/*
+	 * We've re-checked that "task == *ptask", now we have two different
+	 * cases:
+	 *
+	 * 1. This is actually the same task/task_struct. In this case
+	 *    sighand != NULL tells us it is still alive.
+	 *
+	 * 2. This is another task which got the same memory for task_struct.
+	 *    We can't know this of course, and we can not trust
+	 *    sighand != NULL.
+	 *
+	 *    In this case we actually return a random value, but this is
+	 *    correct.
+	 *
+	 *    If we return NULL - we can pretend that we actually noticed that
+	 *    *ptask was updated when the previous task has exited. Or pretend
+	 *    that probe_slab_address(&sighand) reads NULL.
+	 *
+	 *    If we return the new task (because sighand is not NULL for any
+	 *    reason) - this is fine too. This (new) task can't go away before
+	 *    another gp pass.
+	 *
+	 *    And note: We could even eliminate the false positive if re-read
+	 *    task->sighand once again to avoid the falsely NULL. But this case
+	 *    is very unlikely so we don't care.
+	 */
+	if (!sighand)
+		return NULL;
+
+	return task;
+}
+
+struct task_struct *try_get_task_struct(struct task_struct **ptask)
+{
+	struct task_struct *task;
+
+	rcu_read_lock();
+	task = task_rcu_dereference(ptask);
+	if (task)
+		get_task_struct(task);
+	rcu_read_unlock();
+
+	return task;
+}
+
+/*
  * Determine if a process group is "orphaned", according to the POSIX
  * definition in 2.2.2.52.  Orphaned process groups are not to be affected
  * by terminal-generated stop signals.  Newly orphaned process groups are
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 218f8e83db73..1d3a410c481b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1374,30 +1374,15 @@ static void task_numa_compare(struct task_numa_env *env,
 	int dist = env->dist;
 	bool assigned = false;
 
-	rcu_read_lock();
-
-	raw_spin_lock_irq(&dst_rq->lock);
-	cur = dst_rq->curr;
-	/*
-	 * No need to move the exiting task or idle task.
-	 */
-	if ((cur->flags & PF_EXITING) || is_idle_task(cur))
-		cur = NULL;
-	else {
-		/*
-		 * The task_struct must be protected here to protect the
-		 * p->numa_faults access in the task_weight since the
-		 * numa_faults could already be freed in the following path:
-		 * finish_task_switch()
-		 *     --> put_task_struct()
-		 *         --> __put_task_struct()
-		 *             --> task_numa_free()
-		 */
-		get_task_struct(cur);
+	cur = try_get_task_struct(&dst_rq->curr);
+	if (cur) {
+		if ((cur->flags & PF_EXITING) || is_idle_task(cur)) {
+			put_task_struct(cur);
+			cur = NULL;
+		}
 	}
 
-	raw_spin_unlock_irq(&dst_rq->lock);
-
+	rcu_read_lock();
 	/*
 	 * Because we have preemption enabled we can get migrated around and
 	 * end try selecting ourselves (current == env->p) as a swap candidate.