lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtA6s82qXWOSdd6eNu__z_HZe9U_F0+RcBJj=PVKT7go7A@mail.gmail.com>
Date:   Thu, 20 Jul 2023 14:31:09 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Qais Yousef <qyousef@...alina.io>
Cc:     Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched/fair: Fix impossible migrate_util scenario in
 load balance

On Tue, 18 Jul 2023 at 19:25, Qais Yousef <qyousef@...alina.io> wrote:
>
> On 07/18/23 18:31, Vincent Guittot wrote:
> > On Tue, 18 Jul 2023 at 18:18, Qais Yousef <qyousef@...alina.io> wrote:
> > >
> > > On 07/18/23 14:48, Vincent Guittot wrote:
> > > > Le dimanche 16 juil. 2023 à 02:41:25 (+0100), Qais Yousef a écrit :
> > > > > We've seen cases while running geekbench that an idle little core never
> > > > > pulls a task from a bigger overloaded cluster for 100s of ms and
> > > > > sometimes over a second.
> > > > >
> > > > > It turned out that the load balance identifies this as a migrate_util
> > > > > type since the local group (little cluster) has a spare capacity and
> > > > > will try to pull a task. But the little cluster capacity is very small
> > > > > nowadays (around 200 or less) and if two busy tasks are stuck on a mid
> > > > > core which has a capacity of over 700, this means the util of each tasks
> > > > > will be around 350+ range. Which is always bigger than the spare
> > > > > capacity of the little group with a single idle core.
> > > > >
> > > > > When trying to detach_tasks() we bail out then because of the comparison
> > > > > of:
> > > > >
> > > > >     if (util > env->imbalance)
> > > > >             goto next;
> > > > >
> > > > > In calculate_imbalance() we convert a migrate_util into migrate_task
> > > > > type if the CPU trying to do the pull is idle. But we only do this if
> > > > > env->imbalance is 0; which I can't understand. AFAICT env->imbalance
> > > > > contains the local group's spare capacity. If it is 0, this means it's
> > > > > fully busy.
> > > > >
> > > > > Removing this condition fixes the problem, but since I can't fully
> > > > > understand why it checks for 0, sending this as RFC. It could be a typo
> > > > > and meant to check for
> > > > >
> > > > >     env->imbalance != 0
> > > > >
> > > > > instead?
> > > > >
> > > > > Signed-off-by: Qais Yousef (Google) <qyousef@...alina.io>
> > > > > ---
> > > > >  kernel/sched/fair.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index a80a73909dc2..682d9d6a8691 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -10288,7 +10288,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > > > >                      * waiting task in this overloaded busiest group. Let's
> > > > >                      * try to pull it.
> > > > >                      */
> > > > > -                   if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
> > > > > +                   if (env->idle != CPU_NOT_IDLE) {
> > > >
> > > > With this change you completely skip migrate_util for idle and newly idle case
> > > > and this would be too aggressive.
> > >
> > > Yeah I didn't have great confidence in it to be honest.
> > >
> > > Could you help me understand the meaning of env->imbalance == 0 though? At this
> > > stage its value is
> > >
> > >         env->imbalance = max(local->group_capacity, local->group_util) - local->group_util;
> > >
> > > which AFAICT is calculating the _spare_ capacity, right? So when we check
> > > env->imbalance == 0 we say if this_cpu is (idle OR newly idle) AND the local
> > > group is fully utilized? Why it must be fully utilized to do the pull? It's
> > > counter intuitive to me. I'm probably misinterpreting something but can't see
> >
> > This is a special case. We have some situations where group_util is
> > higher than capacity because of tasks newly migrated to this group for
> > example so the spare capacity is null but one cpu is idle or newly
> > idle. In this case we try to pull a task with the risk that this group
> > becomes overloaded. That's why we do not try to pull a task every
> > time.
> > But that might be good choice all the time
>
> So on misfit, I do see that a bigger cpu will pull the task quickly as soon as
> a bigger cpu gets idle.
>
> This scenario is the opposite. Maybe the exception in my case is that the
> little cpu has spare capacity as it's mostly idle all the time. It's just
> unlucky circumstances at wake up ended up putting two tasks on bigger core.

I was trying to reproduce the behavior but I was failing until I
realized that this code path is used when the 2 groups are not sharing
their cache. Which topology do you use ? I thought that dynamiQ and
shares cache between all 8 cpus was the norm for arm64 embedded device
now

Also when you say "the little cluster capacity is very small nowadays
(around 200 or less)", it is the capacity of 1 core or the cluster ?

>
> Specifically, at the start of some of the sub-tests, there's a good chance that
> we have simultaneous wake ups and there's a limitation/race in EAS because of
> the gap between select_task_rq_fair() and enqueue_task_fair(). If two task wake
> up simultaneously, select_task_rq_fair() could be called twice before the
> enqueue_task_fair() and end up selecting the same CPU for both tasks not
> realizing one of them is just waiting to be enqueued. IOW, EAS will not take
> into account the updated util of one of the CPUs because of the (short) delay
> to enqueue it.
>
> This should be fixed (the wake up race), but this is a different story and
> a bit trickier.
>
> The risk of pulling always is:
>
>         1. Risk force migrating prev task if it woke up shortly after the pull.
>            Which is no worse IMHO than misfit going almost immediately to
>            bigger core.
>
>         2. Not sure of not being too smart about which task to pull. I can
>            envisage other scenarios where one of the two tasks is better to
>            pull. In geekbench both tasks are equal. But maybe in other use
>            cases one of them less impactful. For example if one of them has
>            a low uclamp_max but the other doesn't. But this case is unsupported
>            feature at the moment. My plan (hope) to treat these uclamp_max as
>            misfit migration. Which I think is the better path in general to
>            treat special cases. So for migration_util this behavior might be
>            sensible all the time. We are working too hard, let's use all of our
>            resources and make use all of idle cpus. If prev_task wakes up,
>            there's no harm; I doubt the cache hotness is a problem even given
>            two tasks are busy all the time trashing L1 anyway.
>
>         3. Not sure what will happen in cases we have nu_running > nr_cpus and
>            some tasks happen to sleep for brief period of times. Two tasks
>            stuck on little core is worse than two tasks stuck on mid or big
>            core. But maybe migrate_util will pull it back again given how
>            little capacity they have?
>
> Cheers
>
> --
> Qais Yousef
>
> >
> > > it.
> > >
> > > >
> > > > We can do something similar to migrate_load in detach_tasks():
> > > >
> > > > ---
> > > >  kernel/sched/fair.c | 8 +++++++-
> > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index d3df5b1642a6..64111ac7e137 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -8834,7 +8834,13 @@ static int detach_tasks(struct lb_env *env)
> > > >               case migrate_util:
> > > >                       util = task_util_est(p);
> > > >
> > > > -                     if (util > env->imbalance)
> > > > +                     /*
> > > > +                      * Make sure that we don't migrate too much utilization.
> > > > +                      * Nevertheless, let relax the constraint if
> > > > +                      * scheduler fails to find a good waiting task to
> > > > +                      * migrate.
> > > > +                      */
> > > > +                     if (shr_bound(util, env->sd->nr_balance_failed) > env->imbalance)
> > > >                               goto next;
> > >
> > > Thanks! This looks better but I still see a 100 or 200 ms delay sometimes.
> > > Still debugging it but I _think_ it's a combination of two things:
> > >
> > >         1. nr_balance_failed doesn't increment as fast - I see a lot of 0s with
> > >            occasional 1s and less frequent 2s
> > >         2. something might wake up briefly on that cpu in between load balance,
> > >            and given how small the littles are they make the required
> > >            nr_balance_failed to tip the scale even higher
> > >
> > >
> > > Thanks
> > >
> > > --
> > > Qais Yousef
> > >
> > > >
> > > >                       env->imbalance -= util;
> > > > --
> > > >
> > > >
> > > >
> > > > >                             env->migration_type = migrate_task;
> > > > >                             env->imbalance = 1;
> > > > >                     }
> > > > > --
> > > > > 2.25.1
> > > > >

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ