[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <878sr6t21a.fsf_-_@x220.int.ebiederm.org>
Date: Mon, 02 Sep 2019 23:52:01 -0500
From: ebiederm@...ssion.com (Eric W. Biederman)
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Oleg Nesterov <oleg@...hat.com>,
Russell King - ARM Linux admin <linux@...linux.org.uk>,
Peter Zijlstra <peterz@...radead.org>,
Chris Metcalf <cmetcalf@...hip.com>,
Christoph Lameter <cl@...ux.com>,
Kirill Tkhai <tkhai@...dex.ru>, Mike Galbraith <efault@....de>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...nel.org>,
Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
Davidlohr Bueso <dave@...olabs.net>
Subject: [PATCH 2/3] task: RCU protect tasks on the runqueue
In the ordinary case today the rcu grace period of a task comes when a
task is reaped, well after the task has left the runqueue. This
change guarantees that the rcu grace period always happens after a
task has left the runqueue. As this is something that usaually happens
today I do not expect any code correctness problems with this change.
At most I anticipate timing challenges.
The only code that will run later are the functions
perf_event_delayed_put, and trace-sched_process_free. The function
perf_event_delayed_put in the final analysis is just a WARN_ON for
cases that I assume should never happen. So I don't see any problem
with delaying it.
The function trace_sched_process_free is a trace point and thus user
space visible. The strangest dependencies can happen but short
of the bizarre it appears to me that trace_sched_process_free is
getting a slightly more accurate picture of when a task struct
is free. As it is now guaranteed that the process will not be
on the runqueue.
Resources for a process are freed in release_task or in __put_task_struct
when the reference count goes to 0. Both of which are happening at
effectively the same time as before. The rcu grace period is just
potentially happening a little bit later in the timeline.
In the common case of a process being reaped after it leaves the
runqueue everything will happen exactly as before.
In the case where a task self reaps we are pretty much guaranteed that
the rcu grace period is delayed. So we should get quite a bit of
coverage in of this worst case for the change in a normal threaded
workload. So I expect any issues to turn up quickly or not at all.
I have lightly tested this change and everything appears to work
fine.
Inspired-by: Linus Torvalds <torvalds@...ux-foundation.org>
Inspired-by: Oleg Nesterov <oleg@...hat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@...ssion.com>
---
kernel/fork.c | 11 +++++++----
kernel/sched/core.c | 7 ++++---
2 files changed, 11 insertions(+), 7 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f04741d5c70..7a74ade4e7d6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,10 +900,13 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
if (orig->cpus_ptr == &orig->cpus_mask)
tsk->cpus_ptr = &tsk->cpus_mask;
- /* One for the user space visible state that goes away when reaped. */
- refcount_set(&tsk->rcu_users, 1);
- /* One for the rcu users, and one for the scheduler */
- refcount_set(&tsk->usage, 2);
+ /*
+ * One for the user space visible state that goes away when reaped.
+ * One for the scheduler.
+ */
+ refcount_set(&tsk->rcu_users, 2);
+ /* One for the rcu users */
+ refcount_set(&tsk->usage, 1);
#ifdef CONFIG_BLK_DEV_IO_TRACE
tsk->btrace_seq = 0;
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..802958407369 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
/* Task is done with its stack. */
put_task_stack(prev);
- put_task_struct(prev);
+ put_task_struct_rcu_user(prev);
}
tick_nohz_task_switch();
@@ -3857,7 +3857,7 @@ static void __sched notrace __schedule(bool preempt)
if (likely(prev != next)) {
rq->nr_switches++;
- rq->curr = next;
+ rcu_assign_pointer(rq->curr, next);
/*
* The membarrier system call requires each architecture
* to have a full memory barrier after updating
@@ -5863,7 +5863,8 @@ void init_idle(struct task_struct *idle, int cpu)
__set_task_cpu(idle, cpu);
rcu_read_unlock();
- rq->curr = rq->idle = idle;
+ rq->idle = idle;
+ rcu_assign_pointer(rq->curr, idle);
idle->on_rq = TASK_ON_RQ_QUEUED;
#ifdef CONFIG_SMP
idle->on_cpu = 1;
--
2.21.0.dirty
Powered by blists - more mailing lists