[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9a282390-1c81-0e77-9567-116c8777f7b5@arm.com>
Date: Thu, 9 Jul 2020 15:34:50 +0200
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
linux-kernel <linux-kernel@...r.kernel.org>,
Valentin Schneider <valentin.schneider@....com>
Subject: Re: [PATCH] sched/fair: handle case of task_h_load() returning 0
On 08/07/2020 11:47, Vincent Guittot wrote:
> On Wed, 8 Jul 2020 at 11:45, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>>
>> On 02/07/2020 16:42, Vincent Guittot wrote:
>>> task_h_load() can return 0 in some situations like running stress-ng
>>> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
>>> system. The load balance doesn't handle this correctly because
>>
>> I guess the issue here is that 'cfs_rq->h_load' in
>>
>> task_h_load() {
>> struct cfs_rq *cfs_rq = task_cfs_rq(p);
>> ...
>> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
>> cfs_rq_load_avg(cfs_rq) + 1);
>> }
>>
>> is still ~0 (or at least pretty small) compared to se.avg.load_avg being
>> 1024 and cfs_rq_load_avg(cfs_rq) n*1024 in these lb occurrences.
>>
>>> env->imbalance never decreases and it will stop pulling tasks only after
>>> reaching loop_max, which can be equal to the number of running tasks of
>>> the cfs. Make sure that imbalance will be decreased by at least 1.
Looks like it's bounded by sched_nr_migrate (32 on my E5-2690 v2).
env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
[...]
>> I assume that this is related to the LKP mail
>
> I have found this problem while studying the regression raised in the
> email below but it doesn't fix it. At least, it's not enough
>
>>
>> https://lkml.kernel.org/r/20200421004749.GC26573@shao2-debian ?
I see. It also happens with other workloads but it's most visible
at the beginning of a workload (fork).
Still on E5-2690 v2 (2*2*10, 40 CPUs):
In the taskgroup cfs_rq->h_load is ~ 1024/40 = 25 so this leads to
task_h_load = 0 with cfs_rq->avg.load_avg 40 times higher than the
individual task load (1024).
One incarnation of 20 loops w/o any progress (that's w/o your patch).
With loop='loop/loop_break/loop_max'
and load='p->se.avg.load_avg/cfs_rq->h_load/cfs_rq->avg.load_avg'
Jul 9 10:41:18 e105613-lin kernel: [73.068844] [stress-ng-mmapf 2907] SMT CPU37->CPU17 imb=8 loop=1/32/32 load=1023/23/43006
Jul 9 10:41:18 e105613-lin kernel: [73.068873] [stress-ng-mmapf 3501] SMT CPU37->CPU17 imb=8 loop=2/32/32 load=1022/23/41983
Jul 9 10:41:18 e105613-lin kernel: [73.068890] [stress-ng-mmapf 2602] SMT CPU37->CPU17 imb=8 loop=3/32/32 load=1023/23/40960
...
Jul 9 10:41:18 e105613-lin kernel: [73.069136] [stress-ng-mmapf 2520] SMT CPU37->CPU17 imb=8 loop=18/32/32 load=1023/23/25613
Jul 9 10:41:18 e105613-lin kernel: [73.069144] [stress-ng-mmapf 3107] SMT CPU37->CPU17 imb=8 loop=19/32/32 load=1021/23/24589
Jul 9 10:41:18 e105613-lin kernel: [73.069149] [stress-ng-mmapf 2672] SMT CPU37->CPU17 imb=8 loop=20/32/32 load=1024/23/23566
...
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@....com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@....com>
Powered by blists - more mailing lists