linux-kernel - Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191221112503.GN3178@techsingularity.net>
Date:   Sat, 21 Dec 2019 11:25:03 +0000
From:   Mel Gorman <mgorman@...hsingularity.net>
To:     Valentin Schneider <valentin.schneider@....com>
Cc:     Vincent Guittot <vincent.guittot@...aro.org>,
        Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>, pauld@...hat.com,
        srikar@...ux.vnet.ibm.com, quentin.perret@....com,
        dietmar.eggemann@....com, Morten.Rasmussen@....com,
        hdanton@...a.com, parth@...ux.ibm.com, riel@...riel.com,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance
 between SD_NUMA domains v2

On Fri, Dec 20, 2019 at 12:40:02PM +0000, Valentin Schneider wrote:
> > +			 * it is not exactly half as imbalance_adj must be
> > +			 * accounted for or the two domains do not converge
> > +			 * as equally balanced if the number of busy tasks is
> > +			 * roughly the size of one NUMA domain.
> > +			 */
> > +			imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;
> > +			if (env->imbalance <= imbalance_adj &&
> 
> I'm confused now, have we set env->imbalance to anything at this point? AIUI
> Vincent's suggestion was to hinge this purely on the weight vs idle_cpus /
> nr_running, IOW not use imbalance:
> 
> if (sd->flags & SD_NUMA) {                                                                         
> 	imbalance_adj = max(2U, (busiest->group_weight *                                           
> 				 (env->sd->imbalance_pct - 100) / 100) >> 1);                      
> 	imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;                              
>                                                                                                      
> 	if (busiest->idle_cpus >= imbalance_max) {                                                 
> 		env->imbalance = 0;                                                                
> 		return;                                                                            
> 	}                                                                                          
> }                                                                                                  
>        

Ok, I tried this just in case and it does not work out when the number
of CPUs per NUMA node gets very large. It allows very large imbalances
between the number of CPUs calculated based on imbalance_pct and based on
group_weight >> 1 and caused some very large regressions on a modern AMD
machine. The version I had that calculated an imbalance first and decided
whether to ignore it still has been consistently the best approach across
multiple machines and workloads.

-- 
Mel Gorman
SUSE Labs