linux-kernel - Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191219154117.GB19614@linaro.org>
Date:   Thu, 19 Dec 2019 16:41:17 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Mel Gorman <mgorman@...hsingularity.net>
Cc:     Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>, pauld@...hat.com,
        valentin.schneider@....com, srikar@...ux.vnet.ibm.com,
        quentin.perret@....com, dietmar.eggemann@....com,
        Morten.Rasmussen@....com, hdanton@...a.com, parth@...ux.ibm.com,
        riel@...riel.com, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance
 between SD_NUMA domains

Le Thursday 19 Dec 2019 à 15:18:24 (+0000), Mel Gorman a écrit :
> On Thu, Dec 19, 2019 at 03:45:39PM +0100, Vincent Guittot wrote:
> > Hi Mel,
> > 
> > Thanks for looking at this NUMA locality vs spreading tasks point.
> > 
> 
> No problem.
> 
> > > @@ -8680,7 +8676,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > >  			env->migration_type = migrate_task;
> > >  			lsub_positive(&nr_diff, local->sum_nr_running);
> > >  			env->imbalance = nr_diff >> 1;
> > > -			return;
> > > +			goto out_spare;
> > 
> > Why are you doing this only for prefer_sibling case ? That's probably the default case of most of numa system but you should also consider others case too.
> > 
> 
> It's the common case for NUMA machines I'm aware of and from the
> perspective of allowing a slight imbalance when there are spare CPUs, I
> felt it was the same whether we were considering idle CPUs or the number
> of tasks running.
> 
> The prefer_sibling case applies to the children and the corner case is
> that balancing NUMA domains takes into account whether the MC domain
> prefers siblings which is a bit odd. I believe, but don't know, that the
> reasoning may have been to spread load for memory bandwidth usage.
> 
> > So you should probably add your
> > 
> > > +                * Whether balancing the number of running tasks or the number
> > > +                * of idle CPUs, consider allowing some degree of imbalance if
> > > +                * migrating between NUMA domains.
> > > +                */
> > > +               if (env->sd->flags & SD_NUMA) {
> > > +                       unsigned int imbalance_adj, imbalance_max;
> > 
> > ...
> > 
> > > +               }
> > 
> > before the prefer_sibling case :
> > 
> > 		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> > 			unsigned int nr_diff = busiest->sum_nr_running;
> > 			/*
> > 			 * When prefer sibling, evenly spread running tasks on
> > 			 * groups.
> > 			 */
> > 
> 
> I don't understand. If I move SD_NUMA checks above the imbalance
> calculation, how do I know whether the imbalance should be ignored?

You are only clearing env->imbalance before returning if the condition
between sum_nr_running with weight doesn't  match so you don't care about
what will be the value of env->imbalance in the other case so you can have 

		if ((env->sd->flags & SD_NUMA) &&
			( allow some degrees of imbalance )) {
				env->imbalance = 0
				return;
		}

		if (busiest->group_weight == 1 || sds->prefer_sibling) {
			unsigned int nr_diff = busiest->sum_nr_running;
			/*
			 * When prefer sibling, evenly spread running tasks on
			 * groups.
			 */
			env->migration_type = migrate_task;
			lsub_positive(&nr_diff, local->sum_nr_running);
			env->imbalance = nr_diff >> 1;
			return;
		}

		/*
		 * If there is no overload, we just want to even the number of
		 * idle cpus.
		 */
		env->migration_type = migrate_task;
		env->imbalance = max_t(long, 0, (local->idle_cpus -
						 busiest->idle_cpus) >> 1);
		return;
	}
	
		
> 
> > 
> > >
> > >  		}
> > >  
> > >  		/*
> > > @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > >  		env->migration_type = migrate_task;
> > >  		env->imbalance = max_t(long, 0, (local->idle_cpus -
> > >  						 busiest->idle_cpus) >> 1);
> > > +
> > > +out_spare:
> > > +		/*
> > > +		 * Whether balancing the number of running tasks or the number
> > > +		 * of idle CPUs, consider allowing some degree of imbalance if
> > > +		 * migrating between NUMA domains.
> > > +		 */
> > > +		if (env->sd->flags & SD_NUMA) {
> > > +			unsigned int imbalance_adj, imbalance_max;
> > > +
> > > +			/*
> > > +			 * imbalance_adj is the allowable degree of imbalance
> > > +			 * to exist between two NUMA domains. It's calculated
> > > +			 * relative to imbalance_pct with a minimum of two
> > > +			 * tasks or idle CPUs.
> > > +			 */
> > > +			imbalance_adj = (busiest->group_weight *
> > > +				(env->sd->imbalance_pct - 100) / 100) >> 1;
> > > +			imbalance_adj = max(imbalance_adj, 2U);
> > > +
> > > +			/*
> > > +			 * Ignore imbalance unless busiest sd is close to 50%
> > > +			 * utilisation. At that point balancing for memory
> > > +			 * bandwidth and potentially avoiding unnecessary use
> > > +			 * of HT siblings is as relevant as memory locality.
> > > +			 */
> > > +			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> > > +			if (env->imbalance <= imbalance_adj &&
> > > +			    busiest->sum_nr_running < imbalance_max) {i
> > 
> > Shouldn't you consider the number of busiest->idle_cpus instead of the busiest->sum_nr_running ?
> > 
> 
> Why? CPU affinity could have stacked multiple tasks on one CPU where
> as I'm looking for a proxy hint on the amount of bandwidth required.
> sum_nr_running does not give me an accurate estimate but it's better than
> idle cpus.

Because even if you have multiple tasks on one CPU, only one will run at a
time on the CPU and others will wait so the bandwidth is effectively link to
the number of running CPUs more than number of runnable tasks

> 
> > and you could simplify by 
> > 
> > 
> > 	if ((env->sd->flags & SD_NUMA) &&
> > 		((100 * busiest->group_weight) <= (env->sd->imbalance_pct * (busiest->idle_cpus << 1)))) {
> > 			env->imbalance = 0;
> > 			return;
> > 	}
> > 
> > And otherwise it will continue with the current path
> > 
> 
> I ended up doing something similar to this in v2 but it's a bit more
> expanded so I can put in comments on why the comparisons are the way
> they are. The multiplications are in the slow path.
> 
> > Also I'm a bit worry about using a 50% threshold that look a bit like a
> > heuristic which can change depending of platform and the UCs that run of the
> > system.
> > 
> 
> UCs?

Use Cases

> 
> And yes, it's a heuristic. In this case, I'm as concerned about memory
> bandwidth availability as I am about improper locality due to agressive
> balancing. We do not know the available memory bandwidth and we do not
> know how much bandwidth the tasks required so 50% was as good as threshold
> as any. I do not know of any way that can cheaply measure either bandwidth
> usage (PMUs are not cheap) or available bandwidth (theoretical bandwidth !=
> actual bandwidth).
> 
> In an earlier version that I never posted, I had no cutoff at all and
> NAS took a roughly 30% performance penalty across all computational
> kernels. Debug tracing led me to this cutoff and running a battery
> of workloads led me to believe that it was a reasonable cutoff. It's
> important.
> 
> > In fact i was hoping that we could use the numa_preferred_nid ?
> 
> Unfortunately not. For some tasks, they are not long-lived enough for NUMA
> balancing to make a decision. For longer-lived tasks, if load balancing is
> spreading the load across nodes and wakeups are pulling tasks together,
> NUMA balancing will get a mix of remote/local samples and will be unable
> to pick a node properly.
> 
> In the netperf figures I put in the changelog, I pointed out that NUMA
> balancing sampled roughly 50% of accesses were remote. With the patch,
> 100% of the samples are local.
> 
> > During the
> > detach of tasks, we don't detach the task if busiest has spare capacity and
> > preferred_nid of the task is busiest.
> > 
> 
> Sure, but again if load balancing and waker/wakees are fighting each
> other, NUMA balancing gets caught in the crossfire and cannot make a
> sensible decision.
> 
> -- 
> Mel Gorman
> SUSE Labs