linux-kernel - Re: top-down balance purpose discussion -- resend

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <52E21677.9090907@linaro.org>
Date:	Fri, 24 Jan 2014 15:29:59 +0800
From:	Alex Shi <alex.shi@...aro.org>
To:	Peter Zijlstra <peterz@...radead.org>
CC:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Ingo Molnar <mingo@...nel.org>, Mike Galbraith <efault@....de>,
	Daniel Lezcano <daniel.lezcano@...aro.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Morten Rasmussen <morten.rasmussen@....com>,
	Amit Kucheria <amit.kucheria@...aro.org>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: top-down balance purpose discussion -- resend


Any more comments for this idea? :)

On 01/22/2014 03:40 PM, Alex Shi wrote:
> On 01/21/2014 10:57 PM, Peter Zijlstra wrote:
>> On Tue, Jan 21, 2014 at 10:04:26PM +0800, Alex Shi wrote:
>>>
>>> Current scheduler load balance is bottom-up mode, each CPU need
>>> initiate the balance by self.
>>>
>>> 1, Like in a integrate computer system, it has smt/core/cpu/numa, 4
>>> level scheduler domains. If there is just 2 tasks in whole system that
>>> both running on cpu0. Current load balance need to pull task to another
>>> smt in smt domain, then pull task to another core, then pull task to
>>> another cpu, finally pull task to another numa. Totally it is need 4
>>> times task moving to get system balance.
>>
>> Except the idle load balancer, and esp. the newidle can totally by-pass
>> this.
>>
>> If you do the packing right in the newidle pass, you'd get there in 1
>> step.
> 
> It give me a huge pressure to argue with you a great experts. I am
> waiting and very appreciate for any comments and corrections. :)
> 
> Yes, a newidle will kindly relief this. but it can not eliminate it. If
> a newidle happens on another numa group. It just needs 1 step. But if it
> happens on another smt group, it still needs 4 steps. So generally, we
> still need one more steps before well balance.
> 
> In this example, if a newidle is in the same smallest group, maybe we
> should wakeup a remotest cpu in system/llc to avoid extra task moving in
> near future for best performance.
> And for power saving, maybe we'd better kick the task to smallest group,
> then let the remote cpu group idle.
> But for current newidle, it's impossible to do this because newidle is
> also bottom-up mode.
>>
>>> Generally, the task moving complexity is
>>> 	O(nm log n),  n := nr_cpus, m := nr_tasks
>>>
>>> There is a excellent summary and explanation for this in
>>> kernel/sched/fair.c:4605
>>
>> Which is a perfectly fine scheme for a busy system.
>>
>>> Another weakness of current LB is that every cpu need to get the other
>>> cpus' load info repeatedly and try to figure out busiest sched
>>> group/queue on every sched domain level. But it just waste time, since
>>> it may not conduct a task moving. One of reasons is that cpu can only
>>> pull task, not pushing.
>>
>> This doesn't make sense.. and in fact, we do a limited amount of 3rd
>> party movements.
> 
> Yes, but the 3rd party movements is too limited, just for task pinned.
>>
>> Whatever you do, you have to repeat the information gathering anyhow,
>> because it constantly changes.
>>
> 
> Yes, it is good to collection the load info once for once balance. but
> if the balance cpu is busiest cpu, current balance still keep collecting
> every group load info from bottom to up, and then do nothing on this
> imbalance system. This is bad.
> 
>> Trying to serialize that doesn't make any kind of sense. The only thing
>> you want is that the system converges.
> 
> Sorry, would you like to give a bit more details of 'serialize' is no sense?
>>
>> Skipped the rest because it seems build on a fundament I don't agree
>> with. That 4 move thing is just silly for an idle system, and we
>> shouldn't do that.
>>
>> I also very much do not want a single CPU balancing the entire system,
>> that's the anti-thesis of scalable.
> 
> Sorry. IMHO, single cpu is possible to handle 1000 cpu balancing. And it
> is far more scalable than every cpu do balance in system, since there is
> only one cpu need to pick other cpu load info.
> 
> BTW, there is no organize among all cpus' balancing currently. That's a
> a bit mess. Like if 2 cpus in a small cpu group just do balance for
> whole system at the same time, then both of them think self group is
> light and want more load. then they have the chance to over pull load to
> self group. That is bad. And single balancing has no such problem.
> 


-- 
Thanks
    Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/