linux-kernel - Re: [PATCH] sched, cgroup: Use exit hook to avoid use-after-free crash

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101225175525.GA3393@balbir.in.ibm.com>
Date:	Sat, 25 Dec 2010 23:25:25 +0530
From:	Balbir Singh <balbir@...ux.vnet.ibm.com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Mike Galbraith <efault@....de>,
	Miklos Vajna <vmiklos@...galware.org>,
	shenghui <crosslonelyover@...il.com>,
	kernel-janitors@...r.kernel.org, linux-kernel@...r.kernel.org,
	mingo@...e.hu, Greg KH <greg@...ah.com>,
	Paul Turner <pjt@...gle.com>,
	Yong Zhang <yong.zhang0@...il.com>,
	Li Zefan <lizf@...fujitsu.com>,
	Paul Menage <menage@...gle.com>,
	Srivatsa Vaddagiri <vatsa@...ibm.com>
Subject: Re: [PATCH] sched, cgroup: Use exit hook to avoid use-after-free
 crash

* Peter Zijlstra <peterz@...radead.org> [2010-12-24 16:59:13]:

> On Fri, 2010-12-24 at 13:16 +0100, Mike Galbraith wrote:
> > On Fri, 2010-12-24 at 11:54 +0100, Peter Zijlstra wrote:
> 
> > > Right, so the cgroup core is supposed to already emit -EBUSY when there
> > > are associated tasks with the cgroup, that _should_ be sufficient, the
> > > pre_destroy() method is to frob some extra constraints or somesuch.
> > > 
> > > Our problem looks to be that a task (afaict usually current) changes
> > > cgroups without us getting notified of it. On destruction the task is
> > > still enqueued in the cfs_rq being destroyed but is not actually part of
> > > that cgroup according to the task->css bits.
> > 
> > Could it be an exiting task?  We're still preemptible, and iirc, you run
> > a CONFIG_PREEMPT kernel.  (grasp at all straws;)
> > 
> > cgroup_exit:
> >         /* Reassign the task to the init_css_set. */
> >         task_lock(tsk);
> >         cg = tsk->cgroups;
> >         tsk->cgroups = &init_css_set;
> >         task_unlock(tsk);
> >         if (cg)
> >                 put_css_set_taskexit(cg);
> > 
> 
> This straw appears true:
> 
> $ grep -e cpu_cgroup\\\|f491447c log9
> 
> ...
> 
> kworker/-1196    0d..2. 1601180us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /system/systemd-modules-load.service
> kworker/-1196    0d..2. 1601186us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /system/systemd-modules-load.service
> kworker/-1196    0d..2. 1601188us : __dequeue_entity: f491447c from f492a480, 1 left
> kworker/-1196    0d..2. 1601188us : pick_next_task_fair: picked: f491447c, modprobe/1210
> kworker/-1196    0d..2. 1601192us : __print_runqueue:      curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /system/systemd-modules-load.service
> modprobe-1210    0d..5. 1601802us : __print_runqueue:      curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
> modprobe-1210    0d..5. 1601807us : __print_runqueue:      curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
> modprobe-1210    0d..2. 1601817us : __print_runqueue:      curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
> modprobe-1210    0d..2. 1601819us : __enqueue_entity: f491447c to f492a480, 1 tasks
> modprobe-1210    0d..2. 1601826us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
> modprobe-1210    0d..2. 1601832us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
> modprobe-1210    0d..2. 1601839us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
> kworker/-1196    0d..2. 1601848us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
> kworker/-1196    0d..2. 1601854us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
> kworker/-1196    0d..2. 1601860us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
> kworker/-1196    0d..2. 1601865us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
> kworker/-1196    0d..2. 1601871us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
> kworker/-1196    0d..2. 1601872us : __dequeue_entity: f491447c from f492a480, 1 left
> kworker/-1196    0d..2. 1601873us : pick_next_task_fair: picked: f491447c, modprobe/1210
> kworker/-1196    0d..2. 1601876us : __print_runqueue:      curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
> modprobe-1210    0d..7. 1601895us : __print_runqueue:      curr: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
> modprobe-1210    0d..7. 1601900us : __print_runqueue:      curr: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
> modprobe-1210    0d..2. 1601909us : __print_runqueue:      curr: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
> modprobe-1210    0d..2. 1601911us : __enqueue_entity: f491447c to f492a480, 1 tasks
> modprobe-1210    0d..2. 1601918us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
> modprobe-1210    0d..2. 1601924us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
> modprobe-1210    0d..2. 1601931us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
> kworker/-1196    0d..2. 1602071us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
> kworker/-1196    0d..2. 1602080us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
> kworker/-1196    0d..2. 1602089us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
> kworker/-1196    0d..2. 1602097us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
> kworker/-1196    0d..2. 1602105us : __print_runqueue:      se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
> kworker/-1196    0d..2. 1602107us : __dequeue_entity: f491447c from f492a480, 1 left
> kworker/-1196    0d..2. 1602108us : pick_next_task_fair: picked: f491447c, modprobe/1210
> kworker/-1196    0d..2. 1602114us : __print_runqueue:      curr: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
> modprobe-1210    0d..3. 1602128us : __print_runqueue:      curr: f491447c, comm: modprobe/1210, state: 80, load: 1024, cgroup: /
> 
> 
> So cgroup moves a task without calling cgroup_subsys::attach() which is
> odd, but it does have an ::exit method, sadly it calls that _before_
> re-assigning the task, which means we have to jump through some hoops.
> 
> The below seems to fix the problem for me..
> 
> ---
> Subject: sched, cgroup: Use exit hook to avoid use-after-free crash
> 
> By not notifying the controller of the on-exit move back to
> init_css_set, we fail to move the task out of the previous cgroup's
> cfs_rq. This leads to an opportunity for a cgroup-destroy to come in and
> free the cgroup (there are no active tasks left in it after all) to
> which the not-quite dead task is still enqueued.
> 
> Cc: stable@...nel.org
> Reported-by: Miklos Vajna <vmiklos@...galware.org>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@...llo.nl>
> ---
>  kernel/sched.c |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 7e401f8..572625c 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -611,6 +611,9 @@ static inline struct task_group *task_group(struct task_struct *p)
>  	struct task_group *tg;
>  	struct cgroup_subsys_state *css;
>  
> +	if (p->flags & PF_EXITING)
> +		return &root_task_group;
> +
>  	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
>  			lockdep_is_held(&task_rq(p)->lock));
>  	tg = container_of(css, struct task_group, css);
> @@ -8887,6 +8890,12 @@ cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
>  	}
>  }
>  
> +static void
> +cpu_cgroup_exit(struct cgroup_subsys *ss, struct task_struct *task)
> +{
> +	sched_move_task(task);
> +}
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
>  				u64 shareval)
> @@ -8959,6 +8968,7 @@ struct cgroup_subsys cpu_cgroup_subsys = {
>  	.destroy	= cpu_cgroup_destroy,
>  	.can_attach	= cpu_cgroup_can_attach,
>  	.attach		= cpu_cgroup_attach,
> +	.exit		= cpu_cgroup_exit,
>  	.populate	= cpu_cgroup_populate,
>  	.subsys_id	= cpu_cgroup_subsys_id,
>  	.early_init	= 1,
> 
>

Very good catch!

Looks very reasonable and correct to me

Acked-by: Balbir Singh <balbir@...ux.vnet.ibm.com>
 
 

-- 
	Three Cheers,
	Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/