lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200106145225.GB3466@techsingularity.net>
Date:   Mon, 6 Jan 2020 14:52:25 +0000
From:   Mel Gorman <mgorman@...hsingularity.net>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Phil Auld <pauld@...hat.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
        Quentin Perret <quentin.perret@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <Morten.Rasmussen@....com>,
        Hillf Danton <hdanton@...a.com>,
        Parth Shah <parth@...ux.ibm.com>,
        Rik van Riel <riel@...riel.com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance
 between SD_NUMA domains v2

Sorry I sent out v3 before seeing this email as my mail only synchronises
periodically.

On Mon, Jan 06, 2020 at 02:55:00PM +0100, Vincent Guittot wrote:
> > -                       return;
> > -               }
> > +               } else {
> >
> > -               /*
> > -                * If there is no overload, we just want to even the number of
> > -                * idle cpus.
> > -                */
> > -               env->migration_type = migrate_task;
> > -               env->imbalance = max_t(long, 0, (local->idle_cpus -
> > +                       /*
> > +                        * If there is no overload, we just want to even the number of
> > +                        * idle cpus.
> > +                        */
> > +                       env->migration_type = migrate_task;
> > +                       env->imbalance = max_t(long, 0, (local->idle_cpus -
> >                                                  busiest->idle_cpus) >> 1);
> > +               }
> > +
> > +               /* Consider allowing a small imbalance between NUMA groups */
> > +               if (env->sd->flags & SD_NUMA) {
> > +                       long imbalance_adj, imbalance_max;
> > +
> > +                       /*
> > +                        * imbalance_adj is the allowable degree of imbalance
> > +                        * to exist between two NUMA domains. imbalance_pct
> > +                        * is used to estimate the number of active tasks
> > +                        * needed before memory bandwidth may be as important
> > +                        * as memory locality.
> > +                        */
> > +                       imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1;
> 
> This looks weird to me because you use imbalance_pct, which is
> meaningful only compare a ratio, to define a number that will be then
> compared to a number of tasks without taking into account the weight
> of the node. So whatever the node size, 32 or 128 CPUs, the
> imbalance_adj will be the same: 3 with the default imbalance_pct of
> NUMA level which is 125 AFAICT
> 

The intent in this version was to only cover the low utilisation case
regardless of the NUMA node size. There were too many corner cases
where the tradeoff of local memory latency versus local memory bandwidth
cannot be quantified. See Srikar's report as an example but it's a general
problem. The use of imbalance_pct was simply to find the smallest number of
running tasks were (imbalance_pct - 100) would be 1 running task and limit
the patch to address the low utilisation case only. It could be simply
hard-coded but that would ignore cases where an architecture overrides
imbalance_pct. I'm open to suggestion on how we could identify the point
where imbalances can be ignored without hitting other corner cases.

> > +
> > +                       /*
> > +                        * Allow small imbalances when the busiest group has
> > +                        * low utilisation.
> > +                        */
> > +                       imbalance_max = imbalance_adj << 1;
> 
> Why do you add this shift ?
> 

For very low utilisation, there is no balancing between nodes. For slightly
above that, there is limited balancing. After that, the load balancing
behaviour is unchanged as I believe we cannot determine if memory latency
or memory bandwidth is more important for arbitrary workloads.

-- 
Mel Gorman
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ