linux-kernel - Re: [BUG] sched: leaf_cfs_rq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160316142414.GA6980@mtj.duckdns.org>
Date:	Wed, 16 Mar 2016 07:24:14 -0700
From:	Tejun Heo <tj@...nel.org>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Kazuki Yamaguchi <k@....jp>,
	Niklas Cassel <niklas.cassel@...s.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [BUG] sched: leaf_cfs_rq_list use after free

Hello, Peter.

Sorry about the delay.

On Mon, Mar 14, 2016 at 01:09:03PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 14, 2016 at 12:20:57PM +0100, Peter Zijlstra wrote:
> > So I would suggest TJ to revert that patch and queue it for stable.
> > 
> > It it clearly borken, because cgroup_exit() is called from preemptible
> > context, so _obviously_ we can (and clearly will) schedule after that,
> > which is somewhat hard if the cgroup we're supposedly belonging to has
> > been torn to shreds in the meantime.
> 
> And I think it might be fundamentally broken, because it leaves ->css
> set to whatever cgroup we had, while simultaneously allowing that css to
> go away.

So, the problem here is that cpu is using css_offline to tear down the
css.  perf shouldn't have the same problem as it destroys its css from
css_free.  The distinctions among different callbacks evolved over
time and we need to update the documentation but here are the rules.

css_alloc()

	This one is obvious.  Alloc and init.  The css will become
	visible during css iteration sometime between alloc and
	online.

css_online()

	The css is now guaranteed to be visible for css_for_each*()
	iterations.  This distinction exists because some controllers
	need to propagate state changes downwards requiring a new css
	to become visible before it inherits the current state from
	the parent.  Conversely, there's no reason to use this
	callback if there's no such requirement.

	Ex: Freezer which propagates the target state downwards and
	needs a new child to inherit the current state while
	iteratable.

css_offline()

	The css is going down and draining usages by refusing new
	css_tryget_online()'s but still guaranteed to be visible
	during css iterations.  Controllers which explicitly needs to
	put, say, cache references or need to perform cleanup
	operations while the css is iteratable need to use this
	method; otherwise, no reason to bother with it.

	Ex: blkcg pins csses as part of lookup cache which can prevent
	a css from being drained and released, so blkcg uses this call
	back to disable caching for the css.

css_released()

	The css is drained and not visible during iteration and will
	be freed after a RCU grace period.  Used by controllers which
	cache pointers to csses being drained.

	Ex: memcg needs to iterate csses which are being drained and
	its custom iterator implementation caches RCU protected
	pointers to csses.  memcg uses this callback to shoot down
	those cached pointers.

css_free()

	The css is drained, not visible to iterations, not used by
	anyone, and will be freed immediately.

So, controllers which don't have persistent states or synchronization
requirements around state propagation have no reason to bother with
all the rest.  css_alloc() and css_free() are the right callbacks to
init and tear down csses.  If a controller has specific needs to
propagate states, only the part of operations which are affected
should be in the respective specialized init/exit methods.

> This means that anything trying to use this pointer; and there's quite a
> lot of that; is in for a nasty surprise.
> 
> So you really need to change the ->css, either when the task starts
> dying (like it used to), or otherwise right before the cgroup goes
> offline.

So, the rule is that the css should stay serviceable until everyone
depending on it is gone.

> The argument used was that people want to see resources consumed by
> Zombies, which is all fine and dandy, but when you destroy the cgroup
> you cannot see that anyway.
> 
> So something needs to fundamentally ensure that ->css changes before we
> go offline the thing.

I could be mistaken but AFAICS there doesn't seem to be anything
requiring bothering with the more specialized exit methods.  Given
that no css iteration is used and everything is lock protected, the
patch at the end of this messages should do and seems to work fine
here.  Am I missing something?

Thanks.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0f5abc6..98d019d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8332,14 +8332,8 @@ static void cpu_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct task_group *tg = css_tg(css);
 
-	sched_destroy_group(tg);
-}
-
-static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
-{
-	struct task_group *tg = css_tg(css);
-
 	sched_offline_group(tg);
+	sched_destroy_group(tg);
 }
 
 static void cpu_cgroup_fork(struct task_struct *task)
@@ -8701,7 +8695,6 @@ struct cgroup_subsys cpu_cgrp_subsys = {
 	.css_alloc	= cpu_cgroup_css_alloc,
 	.css_free	= cpu_cgroup_css_free,
 	.css_online	= cpu_cgroup_css_online,
-	.css_offline	= cpu_cgroup_css_offline,
 	.fork		= cpu_cgroup_fork,
 	.can_attach	= cpu_cgroup_can_attach,
 	.attach		= cpu_cgroup_attach,