linux-kernel - [PATCH 2/3] task: RCU protect tasks on the runqueue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <878sr6t21a.fsf_-_@x220.int.ebiederm.org>
Date:   Mon, 02 Sep 2019 23:52:01 -0500
From:   ebiederm@...ssion.com (Eric W. Biederman)
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Oleg Nesterov <oleg@...hat.com>,
        Russell King - ARM Linux admin <linux@...linux.org.uk>,
        Peter Zijlstra <peterz@...radead.org>,
        Chris Metcalf <cmetcalf@...hip.com>,
        Christoph Lameter <cl@...ux.com>,
        Kirill Tkhai <tkhai@...dex.ru>, Mike Galbraith <efault@....de>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Davidlohr Bueso <dave@...olabs.net>
Subject: [PATCH 2/3] task: RCU protect tasks on the runqueue

In the ordinary case today the rcu grace period of a task comes when a
task is reaped, well after the task has left the runqueue.  This
change guarantees that the rcu grace period always happens after a
task has left the runqueue.  As this is something that usaually happens
today I do not expect any code correctness problems with this change.
At most I anticipate timing challenges.

The only code that will run later are the functions
perf_event_delayed_put, and trace-sched_process_free.  The function
perf_event_delayed_put in the final analysis is just a WARN_ON for
cases that I assume should never happen.  So I don't see any problem
with delaying it.

The function trace_sched_process_free is a trace point and thus user
space visible.   The strangest dependencies can happen but short
of the bizarre it appears to me that trace_sched_process_free is
getting a slightly more accurate picture of when a task struct
is free.  As it is now guaranteed that the process will not be
on the runqueue.

Resources for a process are freed in release_task or in __put_task_struct
when the reference count goes to 0.  Both of which are happening at
effectively the same time as before.  The rcu grace period is just
potentially happening a little bit later in the timeline.

In the common case of a process being reaped after it leaves the
runqueue everything will happen exactly as before.

In the case where a task self reaps we are pretty much guaranteed that
the rcu grace period is delayed.  So we should get quite a bit of
coverage in of this worst case for the change in a normal threaded
workload.  So I expect any issues to turn up quickly or not at all.

I have lightly tested this change and everything appears to work
fine.

Inspired-by: Linus Torvalds <torvalds@...ux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@...hat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@...ssion.com>
---
 kernel/fork.c       | 11 +++++++----
 kernel/sched/core.c |  7 ++++---
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9f04741d5c70..7a74ade4e7d6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,10 +900,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	if (orig->cpus_ptr == &orig->cpus_mask)
 		tsk->cpus_ptr = &tsk->cpus_mask;

-	/* One for the user space visible state that goes away when reaped. */
-	refcount_set(&tsk->rcu_users, 1);
-	/* One for the rcu users, and one for the scheduler */
-	refcount_set(&tsk->usage, 2);
+	/*
+	 * One for the user space visible state that goes away when reaped.
+	 * One for the scheduler.
+	 */
+	refcount_set(&tsk->rcu_users, 2);
+	/* One for the rcu users */
+	refcount_set(&tsk->usage, 1);
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	tsk->btrace_seq = 0;
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..802958407369 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		/* Task is done with its stack. */
 		put_task_stack(prev);

-		put_task_struct(prev);
+		put_task_struct_rcu_user(prev);
 	}

 	tick_nohz_task_switch();
@@ -3857,7 +3857,7 @@ static void __sched notrace __schedule(bool preempt)

 	if (likely(prev != next)) {
 		rq->nr_switches++;
-		rq->curr = next;
+		rcu_assign_pointer(rq->curr, next);
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -5863,7 +5863,8 @@ void init_idle(struct task_struct *idle, int cpu)
 	__set_task_cpu(idle, cpu);
 	rcu_read_unlock();

-	rq->curr = rq->idle = idle;
+	rq->idle = idle;
+	rcu_assign_pointer(rq->curr, idle);
 	idle->on_rq = TASK_ON_RQ_QUEUED;
 #ifdef CONFIG_SMP
 	idle->on_cpu = 1;
-- 
2.21.0.dirty