linux-kernel - Re: [PATCH V6] sched/fair: Remove group imbalance from calculate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5ddf061e-26a2-7151-adff-7ae339c848ac@arm.com>
Date:   Fri, 28 Jul 2017 13:16:24 +0100
From:   Dietmar Eggemann <dietmar.eggemann@....com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Jeffrey Hugo <jhugo@...eaurora.org>,
        Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
        Austin Christ <austinwc@...eaurora.org>,
        Tyler Baicar <tbaicar@...eaurora.org>,
        Timur Tabi <timur@...eaurora.org>
Subject: Re: [PATCH V6] sched/fair: Remove group imbalance from
 calculate_imbalance()

On 26/07/17 15:54, Peter Zijlstra wrote:
> On Tue, Jul 18, 2017 at 08:48:53PM +0100, Dietmar Eggemann wrote:
>> Hi Jeffrey,
>>
>> On 13/07/17 20:55, Jeffrey Hugo wrote:

[...]

>>> Since the group imbalance path in calculate_imbalance() is at best a NOP
>>> but otherwise harmful, remove it.
> 
> Hurm.. so fix_small_imbalance() itself is a pile of dog poo... it used
> to make sense a long time ago, but smp-nice and then cgroups made a
> complete joke of things.
> 
>> IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE
>> level sched groups.
> 
> That'd be a NUMA box?

I don't think it's NUMA. SD level are MC, DIE w/ # DIE sg's >> 2.

[...]

>> but here the prefer_sibling handling (group overloaded) eclipses 'group
>> imbalance' the moment one of the cfs tasks can go to cpu2 so the if
>> condition you got rid of is a nop.
>>
>> I wonder if it is fair to say that your fix helps multi-cluster
>> (especially with n > 2) systems without SMT and with your first patch
>> [1] for this specific, cpu affinity restricted test cases.
> 
> I tried on an IVB-EP with all the HT siblings unplugged, could not
> reproduce either. Still at n=2 though. Let me fire up an EX, that'll get
> me n=4.
> 
> So this is 4 * 18 * 2 = 144 cpus:

Impressive ;-)

> 
> # for ((i=72; i<144; i++)) ; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done
> # taskset -pc 0,18 $$
> # while :; do :; done & while :; do :; done &
> 
> So I'm taking SMT out, affine to first and second MC group, start 2
> loops.
> 
> Using another console I see them both using 100%.
> 
> If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%.
> Then instantly they balance and I get 2x100% back.

Yeah, could reproduce on IVB-EP (2x10x2).

> Anything else I need to reproduce? (other than maybe a slightly less
> insane machine :-)

I guess what Jeff is trying to avoid is that 'busiest->load_per_task'
lowered to 'sds->avg_load' in case of an imbalanced busiest sg:

  if (busiest->group_type == group_imbalanced)
    busiest->load_per_task = min(busiest->load_per_task, sds->avg_load);

is so low that later fix_small_imbalance() won't be called and
'env->imbalance' stays so low that load-balance of on 50% task to the
now idle cpu won't happen.

  if (env->imbalance < busiest->load_per_task)
    fix_small_imbalance(env, sds);

Having really a lot of otherwise idle DIE sg's helps to keep
'sds->avg_load' low in comparison to 'busiest->load_per_task'.

> Because I have the feeling that while this patch cures things for you,
> you're fighting symptoms.

Unfortunately, don't have a machine available with n >> 2 (on DIE or
NUMA) ...