linux-kernel - Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1374076741.7412.35.camel@j-VirtualBox>
Date:	Wed, 17 Jul 2013 08:59:01 -0700
From:	Jason Low <jason.low2@...com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Ingo Molnar <mingo@...hat.com>,
	LKML <linux-kernel@...r.kernel.org>,
	Mike Galbraith <efault@....de>,
	Thomas Gleixner <tglx@...utronix.de>,
	Paul Turner <pjt@...gle.com>, Alex Shi <alex.shi@...el.com>,
	Preeti U Murthy <preeti@...ux.vnet.ibm.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Morten Rasmussen <morten.rasmussen@....com>,
	Namhyung Kim <namhyung@...nel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Kees Cook <keescook@...omium.org>,
	Mel Gorman <mgorman@...e.de>, Rik van Riel <riel@...hat.com>,
	aswin@...com, scott.norton@...com, chegu_vinod@...com
Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too
 frequently

Hi Peter,

On Wed, 2013-07-17 at 11:39 +0200, Peter Zijlstra wrote:
> On Wed, Jul 17, 2013 at 01:11:41AM -0700, Jason Low wrote:
> > For the more complex model, are you suggesting that each completion time
> > is the time it takes to complete 1 iteration of the for_each_domain()
> > loop? 
> 
> Per sd, yes? So higher domains (or lower depending on how you model the thing
> in you head) have bigger CPU spans, and thus take longer to complete. Imagine
> the top domain of a 4096 cpu system, it would go look at all cpus to see if it
> could find a task.
> 
> > Based on some of the data I collected, a single iteration of the
> > for_each_domain() loop is almost always significantly lower than the
> > approximate CPU idle time, even in workloads where idle_balance is
> > lowering performance. The bigger issue is that it takes so many of these
> > attempts before idle_balance actually "worked" and pulls a tasks.
> 
> I'm confused, so:
> 
>   schedule()
>     if (!rq->nr_running)
>       idle_balance()
>         for_each_domain(sd)
>           load_balance(sd)
> 
> is the entire thing, there's no other loop in there.

So if we have the following: 

for_each_domain(sd)
	before = sched_clock_cpu
	load_balance(sd)
	after = sched_clock_cpu
	idle_balance_completion_time = after - before

At this point, the "idle_balance_completion_time" is usually a very
small value and is usually a lot smaller than the avg CPU idle time.
However, the vast majority of the time, load_balance returns 0.

> > I initially was thinking about each "completion time" of an idle balance
> > as the sum total of the times of all iterations to complete until a task
> > is successfully pulled within each domain.
> 
> So you're saying that normally idle_balance() won't find a task to pull? And we
> need many times going newidle before we do get something?

Yes, a while ago, I collected some data on the rate in which
idle_balance() does not pull tasks, and it was a very high number.

> Wouldn't this mean that there simply weren't enough tasks to keep all cpus busy?

If I remember correctly, in a lot of those load_balance attempts when
the machine is under a high Java load, there were no "imbalance" between
the groups in each sched_domain.

> If there were tasks we could've pulled, we might need to look at why they
> weren't and maybe fix that. Now it could be that it things this cpu, even with
> the (little) idle time it has is sufficiently loaded and we'll get a 'local'
> wakeup soon enough. That's perfectly fine.
> 
> What we should avoid is spending more time looking for tasks then we have idle,
> since that reduces the total time we can spend doing useful work. So that is I
> think the critical cut-off point.

Do you think its worth a try to consider each newidle balance attempt as
the total load_balance attempts until it is able to move a task, and
then skip balancing within the domain if a CPU's avg idle time is less
than that avg time doing newidle balance? 

Thanks,
Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/