linux-kernel - Re: [git] CFS-devel, latest code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20070925144830.GA5286@linux.vnet.ibm.com>
Date:	Tue, 25 Sep 2007 20:18:30 +0530
From:	Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>
To:	Ingo Molnar <mingo@...e.hu>
Cc:	Mike Galbraith <efault@....de>, linux-kernel@...r.kernel.org,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Dhaval Giani <dhaval@...ux.vnet.ibm.com>,
	Dmitry Adamushko <dmitry.adamushko@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [git] CFS-devel, latest code

On Tue, Sep 25, 2007 at 01:33:06PM +0200, Ingo Molnar wrote:
> > hm. perhaps this fixup in kernel/sched.c:set_task_cpu():
> > 
> >         p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
> > 
> > needs to become properly group-hierarchy aware?

You seem to have hit the nerve for this problem. The two patches I sent:

	http://lkml.org/lkml/2007/9/25/117
	http://lkml.org/lkml/2007/9/25/168

partly help, but we can do better.

> ===================================================================
> --- linux.orig/kernel/sched.c
> +++ linux/kernel/sched.c
> @@ -1039,7 +1039,8 @@ void set_task_cpu(struct task_struct *p,
>  {
>  	int old_cpu = task_cpu(p);
>  	struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
> -	u64 clock_offset;
> +	struct sched_entity *se;
> +	u64 clock_offset, voffset;
> 
>  	clock_offset = old_rq->clock - new_rq->clock;
> 
> @@ -1051,7 +1052,11 @@ void set_task_cpu(struct task_struct *p,
>  	if (p->se.block_start)
>  		p->se.block_start -= clock_offset;
>  #endif
> -	p->se.vruntime -= old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;
> +
> +	se = &p->se;
> +	voffset = old_rq->cfs.min_vruntime - new_rq->cfs.min_vruntime;

This one feels wrong, although I can't express my reaction correctly ..

> +	for_each_sched_entity(se)
> +		se->vruntime -= voffset;

Note that parent entities for a task is per-cpu. So if a task A
belonging to userid guest hops from CPU0 to CPU1, then it gets a new parent 
entity as well, which is different from its parent entity on CPU0.

Before:
	taskA->se.parent = guest's tg->se[0]

After:
	taskA->se.parent = guest's tg->se[1]

So walking up the entity hierarchy and fixing up (parent)se->vruntime will do
little good after the task has moved to a new cpu.

IMO, we need to be doing this :

	- For dequeue of higher level sched entities, simulate as if
	  they are going to "sleep" 
	- For enqueue of higher level entities, simulate as if they are
	  "waking up". This will cause enqueue_entity() to reset their
	  vruntime (to existing value for cfs_rq->min_vruntime) when they 
	  "wakeup".

If we don't do this, then lets say a group had only one task (A) and it
moves from CPU0 to CPU1. Then on CPU1, when group level entity for task
A is enqueued, it will have a very low vruntime (since it was never
running) and this will give task A unlimited cpu time, until its group
entity catches up with all the "sleep" time.

Let me try a fix for this next ..

-- 
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/