linux-kernel - Re: [PATCH] sched/fair: handle case of task_h

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9a282390-1c81-0e77-9567-116c8777f7b5@arm.com>
Date:   Thu, 9 Jul 2020 15:34:50 +0200
From:   Dietmar Eggemann <dietmar.eggemann@....com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Valentin Schneider <valentin.schneider@....com>
Subject: Re: [PATCH] sched/fair: handle case of task_h_load() returning 0

On 08/07/2020 11:47, Vincent Guittot wrote:
> On Wed, 8 Jul 2020 at 11:45, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>>
>> On 02/07/2020 16:42, Vincent Guittot wrote:
>>> task_h_load() can return 0 in some situations like running stress-ng
>>> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
>>> system. The load balance doesn't handle this correctly because
>>
>> I guess the issue here is that 'cfs_rq->h_load' in
>>
>> task_h_load() {
>>     struct cfs_rq *cfs_rq = task_cfs_rq(p);
>>     ...
>>     return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>>                     cfs_rq_load_avg(cfs_rq) + 1);
>> }
>>
>> is still ~0 (or at least pretty small) compared to se.avg.load_avg being
>> 1024 and cfs_rq_load_avg(cfs_rq) n*1024 in these lb occurrences.
>>
>>> env->imbalance never decreases and it will stop pulling tasks only after
>>> reaching loop_max, which can be equal to the number of running tasks of
>>> the cfs. Make sure that imbalance will be decreased by at least 1.

Looks like it's bounded by sched_nr_migrate (32 on my E5-2690 v2).

env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);

[...]

>> I assume that this is related to the LKP mail
> 
> I have found this problem while studying the regression raised in the
> email below but it doesn't fix it. At least, it's not enough
> 
>>
>> https://lkml.kernel.org/r/20200421004749.GC26573@shao2-debian ?

I see. It also happens with other workloads but it's most visible
at the beginning of a workload (fork).

Still on E5-2690 v2 (2*2*10, 40 CPUs):

In the taskgroup cfs_rq->h_load is ~ 1024/40 = 25 so this leads to
task_h_load = 0 with cfs_rq->avg.load_avg 40 times higher than the
individual task load (1024).

One incarnation of 20 loops w/o any progress (that's w/o your patch).

With loop='loop/loop_break/loop_max'
and load='p->se.avg.load_avg/cfs_rq->h_load/cfs_rq->avg.load_avg'

Jul  9 10:41:18 e105613-lin kernel: [73.068844] [stress-ng-mmapf 2907] SMT CPU37->CPU17 imb=8 loop=1/32/32 load=1023/23/43006
Jul  9 10:41:18 e105613-lin kernel: [73.068873] [stress-ng-mmapf 3501] SMT CPU37->CPU17 imb=8 loop=2/32/32 load=1022/23/41983
Jul  9 10:41:18 e105613-lin kernel: [73.068890] [stress-ng-mmapf 2602] SMT CPU37->CPU17 imb=8 loop=3/32/32 load=1023/23/40960
...
Jul  9 10:41:18 e105613-lin kernel: [73.069136] [stress-ng-mmapf 2520] SMT CPU37->CPU17 imb=8 loop=18/32/32 load=1023/23/25613
Jul  9 10:41:18 e105613-lin kernel: [73.069144] [stress-ng-mmapf 3107] SMT CPU37->CPU17 imb=8 loop=19/32/32 load=1021/23/24589
Jul  9 10:41:18 e105613-lin kernel: [73.069149] [stress-ng-mmapf 2672] SMT CPU37->CPU17 imb=8 loop=20/32/32 load=1024/23/23566
...

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@....com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@....com>