[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180907125554.GA24106@hirez.programming.kicks-ass.net>
Date: Fri, 7 Sep 2018 14:55:54 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: mingo@...nel.org, linux-kernel@...r.kernel.org,
dietmar.eggemann@....com, jhugo@...eaurora.org
Subject: Re: [PATCH] sched/fair: fix load_balance redo for null imbalance
On Fri, Sep 07, 2018 at 02:35:51PM +0200, Vincent Guittot wrote:
> Le Friday 07 Sep 2018 à 13:37:49 (+0200), Peter Zijlstra a écrit :
> > On Fri, Sep 07, 2018 at 09:51:04AM +0200, Vincent Guittot wrote:
> > > It can happen that load_balance finds a busiest group and then a busiest rq
> > > but the calculated imbalance is in fact null.
> >
> > Cute. Does that happen often?
>
> I have a use case with RT tasks that reproduces the problem regularly.
> It happens at least when we have CPUs with different capacity either because
> of heterogeous CPU or because of RT/DL reducing available capacity for cfs
> I have put the call path that trigs the problem below and accroding to the
> comment it seems that we can reach similar state when playing with priority.
>
> >
> > > If the calculated imbalance is null, it's useless to try to find a busiest
> > > rq as no task will be migrated and we can return immediately.
> > >
> > > This situation can happen with heterogeneous system or smp system when RT
> > > tasks are decreasing the capacity of some CPUs.
> >
> > Is it the result of one of those "force_balance" conditions in
> > find_busiest_group() ? Should we not fix that to then return NULL
> > instead?
>
> The UC is:
> We have a newly_idle load balance that is triggered when RT task becomes idle
> ( but I think that I have seen that with idle load balance too)
>
> we trigs:
> if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
> busiest->group_no_capacity)
> goto force_balance;
>
> In calculate_imbalance we use the path
> /*
> * Avg load of busiest sg can be less and avg load of local sg can
> * be greater than avg load across all sgs of sd because avg load
> * factors in sg capacity and sgs with smaller group_type are
> * skipped when updating the busiest sg:
> */
> if (busiest->avg_load <= sds->avg_load ||
> local->avg_load >= sds->avg_load) {
> env->imbalance = 0;
> return fix_small_imbalance(env, sds);
> }
>
> but fix_small_imbalance finally decides to return without modifying imbalance
> like here
> if (busiest->avg_load + scaled_busy_load_per_task >=
> local->avg_load + (scaled_busy_load_per_task * imbn)) {
> env->imbalance = busiest->load_per_task;
> return;
> }
That one actually does modify imbalance :-) But I get your point.
> Beside this patch, I'm preparing another patch in fix small imbalance to
> ensure 1 task per CPU in similar situation but according to the comment above,
> we can reach this situation because of tasks priority
Didn't we all hate fix_small_imbalance() ?
Anyway, I think I'd prefer something like the below; although it might
be nicer to thread the return value through calculate_imbalance() and
fix_small_imbalance(), but looking at them that's not going to be
particularly nicer.
Do you agree with this?, If so, I'll stick your orignal Changelog on it
and pretend this is what you send me :-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..0596a29f3d2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8269,7 +8269,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
force_balance:
/* Looks like there is an imbalance. Compute it */
calculate_imbalance(env, &sds);
- return sds.busiest;
+ return env->imbalance ? sds.busiest : NULL;
out_balanced:
env->imbalance = 0;
Powered by blists - more mailing lists