lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 19 Dec 2019 15:58:16 +0000
From:   Mel Gorman <mgorman@...hsingularity.net>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>, pauld@...hat.com,
        valentin.schneider@....com, srikar@...ux.vnet.ibm.com,
        quentin.perret@....com, dietmar.eggemann@....com,
        Morten.Rasmussen@....com, hdanton@...a.com, parth@...ux.ibm.com,
        riel@...riel.com, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance
 between SD_NUMA domains

On Thu, Dec 19, 2019 at 04:41:17PM +0100, Vincent Guittot wrote:
> > I don't understand. If I move SD_NUMA checks above the imbalance
> > calculation, how do I know whether the imbalance should be ignored?
> 
> You are only clearing env->imbalance before returning if the condition
> between sum_nr_running with weight doesn't  match so you don't care about
> what will be the value of env->imbalance in the other case so you can have 
> 
> 		if ((env->sd->flags & SD_NUMA) &&
> 			( allow some degrees of imbalance )) {
> 				env->imbalance = 0
> 				return;
> 		}
> 
> 		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> 			unsigned int nr_diff = busiest->sum_nr_running;
> 			/*
> 			 * When prefer sibling, evenly spread running tasks on
> 			 * groups.
> 			 */
> 			env->migration_type = migrate_task;
> 			lsub_positive(&nr_diff, local->sum_nr_running);
> 			env->imbalance = nr_diff >> 1;
> 			return;
> 		}
> 
> 		/*
> 		 * If there is no overload, we just want to even the number of
> 		 * idle cpus.
> 		 */
> 		env->migration_type = migrate_task;
> 		env->imbalance = max_t(long, 0, (local->idle_cpus -
> 						 busiest->idle_cpus) >> 1);
> 		return;
> 	}
> 	

Ok, that's clear. In an earlier version, I was not just resetting it, I
was adjusting the estimated imbalance but it was fragile. I can move the
check as you suggest.

> > > > @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > > >  		env->migration_type = migrate_task;
> > > >  		env->imbalance = max_t(long, 0, (local->idle_cpus -
> > > >  						 busiest->idle_cpus) >> 1);
> > > > +
> > > > +out_spare:
> > > > +		/*
> > > > +		 * Whether balancing the number of running tasks or the number
> > > > +		 * of idle CPUs, consider allowing some degree of imbalance if
> > > > +		 * migrating between NUMA domains.
> > > > +		 */
> > > > +		if (env->sd->flags & SD_NUMA) {
> > > > +			unsigned int imbalance_adj, imbalance_max;
> > > > +
> > > > +			/*
> > > > +			 * imbalance_adj is the allowable degree of imbalance
> > > > +			 * to exist between two NUMA domains. It's calculated
> > > > +			 * relative to imbalance_pct with a minimum of two
> > > > +			 * tasks or idle CPUs.
> > > > +			 */
> > > > +			imbalance_adj = (busiest->group_weight *
> > > > +				(env->sd->imbalance_pct - 100) / 100) >> 1;
> > > > +			imbalance_adj = max(imbalance_adj, 2U);
> > > > +
> > > > +			/*
> > > > +			 * Ignore imbalance unless busiest sd is close to 50%
> > > > +			 * utilisation. At that point balancing for memory
> > > > +			 * bandwidth and potentially avoiding unnecessary use
> > > > +			 * of HT siblings is as relevant as memory locality.
> > > > +			 */
> > > > +			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> > > > +			if (env->imbalance <= imbalance_adj &&
> > > > +			    busiest->sum_nr_running < imbalance_max) {i
> > > 
> > > Shouldn't you consider the number of busiest->idle_cpus instead of the busiest->sum_nr_running ?
> > > 
> > 
> > Why? CPU affinity could have stacked multiple tasks on one CPU where
> > as I'm looking for a proxy hint on the amount of bandwidth required.
> > sum_nr_running does not give me an accurate estimate but it's better than
> > idle cpus.
> 
> Because even if you have multiple tasks on one CPU, only one will run at a
> time on the CPU and others will wait so the bandwidth is effectively link to
> the number of running CPUs more than number of runnable tasks
> 

Ok, I can try that. There is the corner case where tasks can be stacked on
the one CPU without affinities being involved but it's rare. That might
change again in the future if unbound workqueues get special cased to
allow a wakee to stack on top of the workqueues CPU as it's essentially
sync but it's not likely to make that much of a difference.

-- 
Mel Gorman
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ