linux-kernel - Re: [BUG] sched: leaf_cfs_rq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160316152245.GY6344@twins.programming.kicks-ass.net>
Date:	Wed, 16 Mar 2016 16:22:45 +0100
From:	Peter Zijlstra <peterz@...radead.org>
To:	Tejun Heo <tj@...nel.org>
Cc:	Kazuki Yamaguchi <k@....jp>,
	Niklas Cassel <niklas.cassel@...s.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [BUG] sched: leaf_cfs_rq_list use after free

On Wed, Mar 16, 2016 at 07:24:14AM -0700, Tejun Heo wrote:
> So, the problem here is that cpu is using css_offline to tear down the
> css.  perf shouldn't have the same problem as it destroys its css from
> css_free.  The distinctions among different callbacks evolved over
> time and we need to update the documentation but here are the rules.
> 
> css_alloc()
> 
> 	This one is obvious.  Alloc and init.  The css will become
> 	visible during css iteration sometime between alloc and
> 	online.
> 
> css_online()
> 
> 	The css is now guaranteed to be visible for css_for_each*()
> 	iterations.  This distinction exists because some controllers
> 	need to propagate state changes downwards requiring a new css
> 	to become visible before it inherits the current state from
> 	the parent.  Conversely, there's no reason to use this
> 	callback if there's no such requirement.
> 
> 	Ex: Freezer which propagates the target state downwards and
> 	needs a new child to inherit the current state while
> 	iteratable.

So it looks like sched uses css_online() for no particular reason
either, I've moved all that into css_alloc().

> 
> css_offline()
> 
> 	The css is going down and draining usages by refusing new
> 	css_tryget_online()'s but still guaranteed to be visible
> 	during css iterations.  Controllers which explicitly needs to
> 	put, say, cache references or need to perform cleanup
> 	operations while the css is iteratable need to use this
> 	method; otherwise, no reason to bother with it.
> 
> 	Ex: blkcg pins csses as part of lookup cache which can prevent
> 	a css from being drained and released, so blkcg uses this call
> 	back to disable caching for the css.
> 
> css_released()
> 
> 	The css is drained and not visible during iteration and will
> 	be freed after a RCU grace period.  Used by controllers which
> 	cache pointers to csses being drained.
> 
> 	Ex: memcg needs to iterate csses which are being drained and
> 	its custom iterator implementation caches RCU protected
> 	pointers to csses.  memcg uses this callback to shoot down
> 	those cached pointers.
> 
> css_free()
> 
> 	The css is drained, not visible to iterations, not used by
> 	anyone, and will be freed immediately.

None of that speaks of where Zombies live, am I to infer that Zombies
pass css_offline() but stall css_released() ?

I don't particularly care about iterating css bits, but I do need my
parent group to still exist, this is now also guaranteed for
css_release(), right? The above documentation also doesn't mention this;
in particular I require that css_release() for any group is not called
before the css_release() of any child group.

The reason I'm interested in css_release() is that there is an RCU grace
period between that and css_free(), so I can kill my call_rcu() from
css_free().

> > So something needs to fundamentally ensure that ->css changes before we
> > go offline the thing.
> 
> I could be mistaken but AFAICS there doesn't seem to be anything
> requiring bothering with the more specialized exit methods.  Given
> that no css iteration is used and everything is lock protected, the
> patch at the end of this messages should do and seems to work fine
> here.  Am I missing something?

Some additional cleanups maybe.. but yes, that seems to work.

Thanks!

---
Subject: sched: Fix/cleanup cgroup teardown/init

The cpu controller hasn't kept up with the various changes in the whole
cgroup initialization / destruction sequence, and commit 2e91fa7f6d45
("cgroup: keep zombies associated with their original cgroups") caused
it to explode.

The reason for this is that zombies do not inhibit css_offline() from
being called, but do stall css_released(). Now we tear down the cfs_rq
structures on css_offline() but zombies can run after that, leading to
use-after-free issues.

The solution is to move the tear-down to css_released(), which
guarantees nobody (including no zombies) is still using our cgroup.

Furthermore, a few simple cleanups are possible too. There doesn't
appear to be any point to us using css_online() (anymore?) so fold that
in css_alloc().

And since cgroup code guarantees an RCU grace period between
css_released() and css_free() we can forgo using call_rcu() and free the
stuff immediately.

Cc: stable@...r.kernel.org
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Suggested-by: Tejun Heo <tj@...nel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
---
 kernel/sched/core.c | 35 ++++++++++++++---------------------
 1 file changed, 14 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 04a04e0378fe..bc5076fd0167 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7538,7 +7538,7 @@ void set_curr_task(int cpu, struct task_struct *p)
 /* task_group_lock serializes the addition/removal of task groups */
 static DEFINE_SPINLOCK(task_group_lock);
 
-static void free_sched_group(struct task_group *tg)
+static void sched_free_group(struct task_group *tg)
 {
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
@@ -7564,7 +7564,7 @@ struct task_group *sched_create_group(struct task_group *parent)
 	return tg;
 
 err:
-	free_sched_group(tg);
+	sched_free_group(tg);
 	return ERR_PTR(-ENOMEM);
 }
 
@@ -7584,17 +7584,16 @@ void sched_online_group(struct task_group *tg, struct task_group *parent)
 }
 
 /* rcu callback to free various structures associated with a task group */
-static void free_sched_group_rcu(struct rcu_head *rhp)
+static void sched_free_group_rcu(struct rcu_head *rhp)
 {
 	/* now it should be safe to free those cfs_rqs */
-	free_sched_group(container_of(rhp, struct task_group, rcu));
+	sched_free_group(container_of(rhp, struct task_group, rcu));
 }
 
-/* Destroy runqueue etc associated with a task group */
 void sched_destroy_group(struct task_group *tg)
 {
 	/* wait for possible concurrent references to cfs_rqs complete */
-	call_rcu(&tg->rcu, free_sched_group_rcu);
+	call_rcu(&tg->rcu, sched_free_group_rcu);
 }
 
 void sched_offline_group(struct task_group *tg)
@@ -8053,31 +8052,26 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (IS_ERR(tg))
 		return ERR_PTR(-ENOMEM);
 
+	sched_online_group(tg, parent);
+
 	return &tg->css;
 }
 
-static int cpu_cgroup_css_online(struct cgroup_subsys_state *css)
+static void cpu_cgroup_css_released(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
-	struct task_group *parent = css_tg(css->parent);
 
-	if (parent)
-		sched_online_group(tg, parent);
-	return 0;
+	sched_offline_group(tg);
 }
 
 static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
 
-	sched_destroy_group(tg);
-}
-
-static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
-{
-	struct task_group *tg = css_tg(css);
-
-	sched_offline_group(tg);
+	/*
+	 * Relies on the RCU grace period between css_released() and this.
+	 */
+	sched_free_group(tg);
 }
 
 static void cpu_cgroup_fork(struct task_struct *task)
@@ -8437,9 +8431,8 @@ static struct cftype cpu_files[] = {
 
 struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
+	.css_released	= cpu_cgroup_css_released,
 	.css_free	= cpu_cgroup_css_free,
-	.css_online	= cpu_cgroup_css_online,
-	.css_offline	= cpu_cgroup_css_offline,
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,