linux-kernel - Re: sched: Consequences of integrating the Per Entity Load Tracking Metric into the Load Balancer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1357373590.5690.172.camel@marge.simpson.net>
Date:	Sat, 05 Jan 2013 09:13:10 +0100
From:	Mike Galbraith <bitbucket@...ine.de>
To:	Preeti U Murthy <preeti@...ux.vnet.ibm.com>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	"svaidy@...ux.vnet.ibm.com" <svaidy@...ux.vnet.ibm.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Viresh Kumar <viresh.kumar@...aro.org>,
	Amit Kucheria <amit.kucheria@...aro.org>,
	Morten Rasmussen <Morten.Rasmussen@....com>,
	Paul McKenney <paul.mckenney@...aro.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Arjan van de Ven <arjan@...ux.intel.com>,
	Ingo Molnar <mingo@...nel.org>, Paul Turner <pjt@...gle.com>,
	Venki Pallipadi <venki@...gle.com>,
	Robin Randhawa <robin.randhawa@....com>,
	Lists linaro-dev <linaro-dev@...ts.linaro.org>,
	Matthew Garrett <mjg59@...f.ucam.org>,
	Alex Shi <alex.shi@...el.com>, srikar@...ux.vnet.ibm.com
Subject: Re: sched: Consequences of integrating the Per Entity Load Tracking
 Metric into the Load Balancer

On Thu, 2013-01-03 at 16:08 +0530, Preeti U Murthy wrote:

> Subject: [PATCH] sched: Merge select_idle_sibling with the behaviour of SD_BALANCE_WAKE
> 
> The function of select_idle_sibling() is to place the woken up task in the
> vicinity of the waking cpu or on the previous cpu depending on what wake_affine() says.
> This placement being only in an idle group.If an idle group is not found,the
> fallback cpu is either the waking cpu or the previous cpu accordingly.
> 
> This results in the runqueue of the waking cpu or the previous cpu getting
> overloaded when the system is committed,which is a latency hit to these tasks.
> 
> What is required is that the newly woken up tasks be placed close to the wake
> up cpu or the previous cpu,whichever is best, for reasons to avoid latency hit and cache
> coldness respectively.This is achieved with wake_affine() deciding which
> cache domain the task should be placed on.
> 
> Once this is decided,instead of searching for a completely idle group,let us
> search for the idlest group.This will anyway return a completely idle group
> if it exists and its mechanism will fall back to what select_idle_sibling()
> was doing.But if this fails,find_idlest_group() continues the search for a
> relatively more idle group.
> 
> The argument could be that,we wish to avoid migration of the newly woken up
> task to any other group unless it is completely idle.But in this case, to
> begin with we choose a sched domain,within which a migration could be less
> harmful.We enable the SD_BALANCE_WAKE flag on the SMT and MC domains to co-operate
> with the same.

What if..

Fast movers currently suffer from traversing large package, mostly due
to traversal order walking 1:1 buddies hand in hand across the whole
package endlessly.  With only one buddy pair running, it's horrific.

Even if you change the order to be friendlier, perturbation induces
bouncing.  More spots to bounce too equals more bouncing.  Ergo, I cross
coupled cpu pairs to eliminate that.  If buddies are perturbed, having
one and only one buddy cpu pulls them back together, so can't induce a
bounce fest, only correct.  That worked well, but had the down side that
some loads really REALLY want maximum spread, so suffer when you remove
migration options as I did.  There's in_interrupt() consideration I'm
not so sure of too, in that case, going the extra mile to find an idle
hole to plug _may_ be worth some extra cost too.. dunno.  

So wrt integration, what if a buddy cpu were made a FIRST choice of
generic wake balancing vs the ONLY choice of select_idle_sibling() as I
did?  If buddy cpu is available, cool, perturbed pairs find each other
and pair back up, if not, and you were here too recently, you stay with
prev_cpu, avoid bounce and traversal at high frequency.  All tasks can
try the cheap buddy cpu first, all can try full domain as well, just not
at insane rates.  The heavier the short term load average (or such, with
instant decay on longish idle ala idle balance throttle so you ramp
well), the longer the 'forget eating full balance' interval becomes,
with cutoff affecting the cheap but also not free cross coupled buddy
cpu as well at some point.  Looking for an idle cpu at hefty load is a
waste of cycles at best, plugging micro-holes does nothing good even if
you find one, forget wake balance entirely at some cutoff, let periodic
balancing do it's thing in peace.

Hrmph, that's too many words, but basically, I think whacking
select_idle_sibling() integration into wake balance makes loads of
sense, but needs a bit more to not end up just moving the problems to a
different spot.

I still have a 2.6-rt problem I need to find time to squabble with, but
maybe I'll soonish see if what you did plus what I did combined works
out on that 4x10 core box where current is _so_ unbelievably horrible.
Heck, it can't get any worse, and the restricted wake balance alone
kinda sorta worked.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/