[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtBeRXCEWB3dTC8uOqbQ5xaZqQTAeG1EVGEk+pJcYz00sw@mail.gmail.com>
Date: Wed, 8 Jul 2020 11:47:59 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Dietmar Eggemann <dietmar.eggemann@....com>
Cc: Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
linux-kernel <linux-kernel@...r.kernel.org>,
Valentin Schneider <valentin.schneider@....com>
Subject: Re: [PATCH] sched/fair: handle case of task_h_load() returning 0
On Wed, 8 Jul 2020 at 11:45, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>
> On 02/07/2020 16:42, Vincent Guittot wrote:
> > task_h_load() can return 0 in some situations like running stress-ng
> > mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> > system. The load balance doesn't handle this correctly because
>
> I guess the issue here is that 'cfs_rq->h_load' in
>
> task_h_load() {
> struct cfs_rq *cfs_rq = task_cfs_rq(p);
> ...
> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> cfs_rq_load_avg(cfs_rq) + 1);
> }
>
> is still ~0 (or at least pretty small) compared to se.avg.load_avg being
> 1024 and cfs_rq_load_avg(cfs_rq) n*1024 in these lb occurrences.
>
> > env->imbalance never decreases and it will stop pulling tasks only after
> > reaching loop_max, which can be equal to the number of running tasks of
> > the cfs. Make sure that imbalance will be decreased by at least 1.
> >
> > misfit task is the other feature that doesn't handle correctly such
> > situation although it's probably more difficult to face the problem
> > because of the smaller number of CPUs and running tasks on heterogenous
> > system.
> >
> > We can't simply ensure that task_h_load() returns at least one because it
> > would imply to handle underrun in other places.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
> > ---
> > kernel/sched/fair.c | 18 +++++++++++++++++-
> > 1 file changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6fab1d17c575..62747c24aa9e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > return;
> > }
> >
> > - rq->misfit_task_load = task_h_load(p);
> > + /*
> > + * Make sure that misfit_task_load will not be null even if
> > + * task_h_load() returns 0. misfit_task_load is only used to select
> > + * rq with highest load so adding 1 will not modify the result
> > + * of the comparison.
> > + */
> > + rq->misfit_task_load = task_h_load(p) + 1;
> > }
> >
> > #else /* CONFIG_SMP */
> > @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
> > env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
> > goto next;
> >
> > + /*
> > + * Depending of the number of CPUs and tasks and the
> > + * cgroup hierarchy, task_h_load() can return a null
> > + * value. Make sure that env->imbalance decreases
> > + * otherwise detach_tasks() will stop only after
> > + * detaching up to loop_max tasks.
> > + */
> > + if (!load)
> > + load = 1;
> > +
> > env->imbalance -= load;
> > break;
>
> I assume that this is related to the LKP mail
I have found this problem while studying the regression raised in the
email below but it doesn't fix it. At least, it's not enough
>
> https://lkml.kernel.org/r/20200421004749.GC26573@shao2-debian ?
>
> I ran the test (5.8.0-rc4 w/o vs. w/ your patch) on 'Intel(R) Xeon(R)
> CPU E5-2690 v2 @ 3.00GHz' (2*2*10, 40 CPUs).
> I can't see the changes in the magnitude shown in the email above (they
> used a 96 CPU system though).
> I used only scheduler stressor mmapfork in taskgroup /A/B/C:
>
> stress-ng --timeout 1 --times --verify --metrics-brief --sequential 40 --class scheduler --exclude (all except mmapfork)
>
> 5.8.0-rc4-custom-dieegg01-stress-ng-base
>
> stress-ng: info: [3720] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
> stress-ng: info: [3720] (secs) (secs) (secs) (real time) (usr+sys time)
> stress-ng: info: [3720] mmapfork 40 1.98 12.53 71.12 20.21 0.48
> stress-ng: info: [5201] mmapfork 40 2.50 13.10 98.61 16.01 0.36
> stress-ng: info: [6682] mmapfork 40 2.58 14.80 98.63 15.88 0.36
> stress-ng: info: [8195] mmapfork 40 1.79 12.57 61.61 22.31 0.54
> stress-ng: info: [9679] mmapfork 40 2.20 12.17 82.66 18.20 0.42
> stress-ng: info: [11164] mmapfork 40 2.61 15.09 102.86 16.86 0.37
> stress-ng: info: [12773] mmapfork 40 1.89 12.32 65.09 21.15 0.52
> stress-ng: info: [3883] mmapfork 40 2.14 12.90 76.73 18.68 0.45
> stress-ng: info: [6845] mmapfork 40 2.25 11.83 84.06 17.80 0.42
> stress-ng: info: [8326] mmapfork 40 1.76 12.93 56.65 22.70 0.57
>
> Mean: 18.98 (σ: 2.369)
> 5.8.0-rc4-custom-dieegg01-stress-ng
>
> stress-ng: info: [3895] mmapfork 40 2.40 13.56 92.83 16.67 0.38
> stress-ng: info: [5379] mmapfork 40 2.08 13.65 74.11 19.23 0.46
> stress-ng: info: [6860] mmapfork 40 2.15 13.72 80.24 18.62 0.43
> stress-ng: info: [8341] mmapfork 40 2.37 13.74 90.93 16.85 0.38
> stress-ng: info: [9822] mmapfork 40 2.10 12.48 83.85 19.09 0.42
> stress-ng: info: [13816] mmapfork 40 2.05 12.13 77.64 19.49 0.45
> stress-ng: info: [15297] mmapfork 40 2.53 13.16 100.26 15.84 0.35
> stress-ng: info: [16780] mmapfork 40 2.00 12.10 71.25 20.02 0.48
> stress-ng: info: [18262] mmapfork 40 1.73 12.24 57.69 23.09 0.57
> stress-ng: info: [19743] mmapfork 40 1.78 12.51 57.89 22.48 0.57
>
> Mean: 19.14 (σ: 2.239)
>
>
>
>
>
Powered by blists - more mailing lists