[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230718161829.ws3vn3ufnod6kpxh@airbuntu>
Date: Tue, 18 Jul 2023 17:18:29 +0100
From: Qais Yousef <qyousef@...alina.io>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched/fair: Fix impossible migrate_util scenario in
load balance
On 07/18/23 14:48, Vincent Guittot wrote:
> Le dimanche 16 juil. 2023 à 02:41:25 (+0100), Qais Yousef a écrit :
> > We've seen cases while running geekbench that an idle little core never
> > pulls a task from a bigger overloaded cluster for 100s of ms and
> > sometimes over a second.
> >
> > It turned out that the load balance identifies this as a migrate_util
> > type since the local group (little cluster) has a spare capacity and
> > will try to pull a task. But the little cluster capacity is very small
> > nowadays (around 200 or less) and if two busy tasks are stuck on a mid
> > core which has a capacity of over 700, this means the util of each tasks
> > will be around 350+ range. Which is always bigger than the spare
> > capacity of the little group with a single idle core.
> >
> > When trying to detach_tasks() we bail out then because of the comparison
> > of:
> >
> > if (util > env->imbalance)
> > goto next;
> >
> > In calculate_imbalance() we convert a migrate_util into migrate_task
> > type if the CPU trying to do the pull is idle. But we only do this if
> > env->imbalance is 0; which I can't understand. AFAICT env->imbalance
> > contains the local group's spare capacity. If it is 0, this means it's
> > fully busy.
> >
> > Removing this condition fixes the problem, but since I can't fully
> > understand why it checks for 0, sending this as RFC. It could be a typo
> > and meant to check for
> >
> > env->imbalance != 0
> >
> > instead?
> >
> > Signed-off-by: Qais Yousef (Google) <qyousef@...alina.io>
> > ---
> > kernel/sched/fair.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..682d9d6a8691 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10288,7 +10288,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > * waiting task in this overloaded busiest group. Let's
> > * try to pull it.
> > */
> > - if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
> > + if (env->idle != CPU_NOT_IDLE) {
>
> With this change you completely skip migrate_util for idle and newly idle case
> and this would be too aggressive.
Yeah I didn't have great confidence in it to be honest.
Could you help me understand the meaning of env->imbalance == 0 though? At this
stage its value is
env->imbalance = max(local->group_capacity, local->group_util) - local->group_util;
which AFAICT is calculating the _spare_ capacity, right? So when we check
env->imbalance == 0 we say if this_cpu is (idle OR newly idle) AND the local
group is fully utilized? Why it must be fully utilized to do the pull? It's
counter intuitive to me. I'm probably misinterpreting something but can't see
it.
>
> We can do something similar to migrate_load in detach_tasks():
>
> ---
> kernel/sched/fair.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d3df5b1642a6..64111ac7e137 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8834,7 +8834,13 @@ static int detach_tasks(struct lb_env *env)
> case migrate_util:
> util = task_util_est(p);
>
> - if (util > env->imbalance)
> + /*
> + * Make sure that we don't migrate too much utilization.
> + * Nevertheless, let relax the constraint if
> + * scheduler fails to find a good waiting task to
> + * migrate.
> + */
> + if (shr_bound(util, env->sd->nr_balance_failed) > env->imbalance)
> goto next;
Thanks! This looks better but I still see a 100 or 200 ms delay sometimes.
Still debugging it but I _think_ it's a combination of two things:
1. nr_balance_failed doesn't increment as fast - I see a lot of 0s with
occasional 1s and less frequent 2s
2. something might wake up briefly on that cpu in between load balance,
and given how small the littles are they make the required
nr_balance_failed to tip the scale even higher
Thanks
--
Qais Yousef
>
> env->imbalance -= util;
> --
>
>
>
> > env->migration_type = migrate_task;
> > env->imbalance = 1;
> > }
> > --
> > 2.25.1
> >
Powered by blists - more mailing lists