linux-kernel - Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9d7e8fe1-d9d7-90df-0f30-cf82b82e7f1f@amd.com>
Date:   Fri, 4 Feb 2022 20:37:53 +0530
From:   "Nayak, KPrateek (K Prateek)" <kprateek.nayak@....com>
To:     Mel Gorman <mgorman@...hsingularity.net>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Valentin Schneider <valentin.schneider@....com>,
        Aubrey Li <aubrey.li@...ux.intel.com>,
        Barry Song <song.bao.hua@...ilicon.com>,
        Mike Galbraith <efault@....de>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
        Gautham Shenoy <gautham.shenoy@....com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when
 SD_NUMA spans multiple LLCs

Hello Mel,

On 2/3/2022 8:16 PM, Mel Gorman wrote:
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
> 
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> allows an imbalance to exist up to the point where LLCs should be balanced
> between nodes.
> 
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
> 
>                             5.17.0-rc0             5.17.0-rc0
>                                vanilla       sched-numaimb-v5
> MB/sec copy-16    162596.94 (   0.00%)   501967.12 ( 208.72%)
> MB/sec scale-16   136901.28 (   0.00%)   376531.50 ( 175.04%)
> MB/sec add-16     157300.70 (   0.00%)   569997.42 ( 262.36%)
> MB/sec triad-16   151446.88 (   0.00%)   553204.54 ( 265.28%)
> 
> STREAM can use directives to force the spread if the OpenMP is new
> enough but that doesn't help if an application uses threads and
> it's not known in advance how many threads will be created.
> 
> Coremark is a CPU and cache intensive benchmark parallelised with
> threads. When running with 1 thread per core, the vanilla kernel
> allows threads to contend on cache. With the patch;
> 
>                                5.17.0-rc0             5.17.0-rc0
>                                   vanilla       sched-numaimb-v5
> Min       Score-16   368239.36 (   0.00%)   400876.92 (   8.86%)
> Hmean     Score-16   388607.33 (   0.00%)   441447.30 *  13.60%*
> Max       Score-16   408945.69 (   0.00%)   478826.87 (  17.09%)
> Stddev    Score-16    15247.04 (   0.00%)    34061.76 (-123.40%)
> CoeffVar  Score-16        3.92 (   0.00%)        7.67 ( -95.82%)
> 
> It can also make a big difference for semi-realistic workloads
> like specjbb which can execute arbitrary numbers of threads without
> advance knowledge of how they should be placed
> 
>                                5.17.0-rc0             5.17.0-rc0
>                                   vanilla       sched-numaimb-v5
> Hmean     tput-1      71631.55 (   0.00%)    70383.46 (  -1.74%)
> Hmean     tput-8     582758.78 (   0.00%)   607290.89 *   4.21%*
> Hmean     tput-16   1020372.75 (   0.00%)  1031257.25 (   1.07%)
> Hmean     tput-24   1416430.67 (   0.00%)  1587576.33 *  12.08%*
> Hmean     tput-32   1687702.72 (   0.00%)  1724207.51 (   2.16%)
> Hmean     tput-40   1798094.90 (   0.00%)  1983053.56 *  10.29%*
> Hmean     tput-48   1972731.77 (   0.00%)  2157461.70 (   9.36%)
> Hmean     tput-56   2386872.38 (   0.00%)  2193237.42 (  -8.11%)
> Hmean     tput-64   2536954.17 (   0.00%)  2588741.08 (   2.04%)
> Hmean     tput-72   2585071.36 (   0.00%)  2654776.36 (   2.70%)
> Hmean     tput-80   2960523.94 (   0.00%)  2894657.12 (  -2.22%)
> Hmean     tput-88   3061408.57 (   0.00%)  2903167.72 (  -5.17%)
> Hmean     tput-96   3052394.82 (   0.00%)  2872605.46 (  -5.89%)
> Hmean     tput-104  2997814.76 (   0.00%)  3013660.26 (   0.53%)
> Hmean     tput-112  2955353.29 (   0.00%)  3029122.16 (   2.50%)
> Hmean     tput-120  2889770.71 (   0.00%)  2957739.88 (   2.35%)
> Hmean     tput-128  2871713.84 (   0.00%)  2912410.18 (   1.42%)
> 
> In general, the standard deviation figures also are a lot more
> stable.
> 
> Similarly, for embarassingly parallel problems like NPB-ep, there are
> improvements due to better spreading across LLC when the machine is not
> fully utilised.
> 
>                               vanilla    sched-numaimb-v5r12
> Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
> Amean     ep.D       31.86 (   0.00%)       26.26 *  17.58%*
> Stddev    ep.D        0.07 (   0.00%)        0.18 (-157.54%)
> CoeffVar  ep.D        0.22 (   0.00%)        0.69 (-212.46%)
> Max       ep.D       31.93 (   0.00%)       26.46 (  17.13%)
> 
> Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 22 +++++++-------
>  kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
>  3 files changed, 66 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8054641c0a7b..56cffe42abbc 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
>  	unsigned int busy_factor;	/* less balancing by factor if busy */
>  	unsigned int imbalance_pct;	/* No balance until over watermark */
>  	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
> +	unsigned int imb_numa_nr;	/* Nr running tasks that allows a NUMA imbalance */
>  
>  	int nohz_idle;			/* NOHZ IDLE status */
>  	int flags;			/* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4592ccf82c34..86abf97a8df6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>  
>  	int src_cpu, src_nid;
>  	int dst_cpu, dst_nid;
> +	int imb_numa_nr;
>  
>  	struct numa_stats src_stats, dst_stats;
>  
> @@ -1503,7 +1504,7 @@ struct task_numa_env {
>  static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static inline long adjust_numa_imbalance(int imbalance,
> -					int dst_running, int dst_weight);
> +					int dst_running, int imb_numa_nr);
>  
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  		dst_running = env->dst_stats.nr_running + 1;
>  		imbalance = max(0, dst_running - src_running);
>  		imbalance = adjust_numa_imbalance(imbalance, dst_running,
> -							env->dst_stats.weight);
> +						  env->imb_numa_nr);
>  
>  		/* Use idle CPU if there is no imbalance */
>  		if (!imbalance) {
> @@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p)
>  	 */
>  	rcu_read_lock();
>  	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> -	if (sd)
> +	if (sd) {
>  		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +		env.imb_numa_nr = sd->imb_numa_nr;
> +	}
>  	rcu_read_unlock();
>  
>  	/*
> @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
>   * This is an approximation as the number of running tasks may not be
>   * related to the number of busy CPUs due to sched_setaffinity.
>   */
> -static inline bool
> -allow_numa_imbalance(unsigned int running, unsigned int weight)
> +static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
>  {
> -	return (running < (weight >> 2));
> +	return running < imb_numa_nr;
>  }
>  
>  /*
> @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  			 * allowed. If there is a real need of migration,
>  			 * periodic load balance will take care of it.
>  			 */
> -			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
> +			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))

Could you please clarify why are we adding 1 to local_sgs.sum_nr_running while allowing imbalance?
allow_numa_imbalance allows the imbalance based on the following inequality:

	running < imb_numa_nr

Consider on a Zen3 CPU with 8 LLCs in the sched group of the NUMA domain.
Assume the group is running 7 task and we are finding the idlest group for the 8th task:

 	sd->imb_numa_nr = 8
	local_sgs.sum_nr_running = 7

In this case, local_sgs.sum_nr_running + 1 is equal to sd->imb_numa_nr and if we allow NUMA imbalance
and place the task in the same group, each task can be given one LLC.
However, allow_numa_imbalance returns 0 for the above case and can lead to task being placed on a different
NUMA group.

In case of Gautham's suggested fix (https://lore.kernel.org/lkml/YcHs37STv71n4erJ@BLR-5CG11610CF.amd.com/),
the v4 patch in question (https://lore.kernel.org/lkml/20211210093307.31701-3-mgorman@techsingularity.net/)
used the inequality "<=" to allow NUMA imbalance where we needed to consider the additional load CPU had
to bear. However that doesn't seem to be the case here.

>  				return NULL;
>  		}
>  
> @@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  #define NUMA_IMBALANCE_MIN 2
>  
>  static inline long adjust_numa_imbalance(int imbalance,
> -				int dst_running, int dst_weight)
> +				int dst_running, int imb_numa_nr)
>  {
> -	if (!allow_numa_imbalance(dst_running, dst_weight))
> +	if (!allow_numa_imbalance(dst_running, imb_numa_nr))
>  		return imbalance;
>  
>  	/*
> @@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		/* Consider allowing a small imbalance between NUMA groups */
>  		if (env->sd->flags & SD_NUMA) {
>  			env->imbalance = adjust_numa_imbalance(env->imbalance,
> -				local->sum_nr_running + 1, local->group_weight);
> +				local->sum_nr_running + 1, env->sd->imb_numa_nr);
>  		}
>  
>  		return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..e6cd55951304 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  		}
>  	}
>  
> +	/*
> +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> +	 * imbalanced.
> +	 */
> +	for_each_cpu(i, cpu_map) {
> +		unsigned int imb = 0;
> +		unsigned int imb_span = 1;
> +
> +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *child = sd->child;
> +
> +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +				struct sched_domain *top, *top_p;
> +				unsigned int nr_llcs;
> +
> +				/*
> +				 * For a single LLC per node, allow an
> +				 * imbalance up to 25% of the node. This is an
> +				 * arbitrary cutoff based on SMT-2 to balance
> +				 * between memory bandwidth and avoiding
> +				 * premature sharing of HT resources and SMT-4
> +				 * or SMT-8 *may* benefit from a different
> +				 * cutoff.
> +				 *
> +				 * For multiple LLCs, allow an imbalance
> +				 * until multiple tasks would share an LLC
> +				 * on one node while LLCs on another node
> +				 * remain idle.
> +				 */

To add to my point above, the comment here says -
"allow an imbalance until multiple tasks would share an LLC on one node"
Whereas, in the case I highlighted above, we see balancing kick in with possibly
one LLC being unaccounted for. 

> +				nr_llcs = sd->span_weight / child->span_weight;
> +				if (nr_llcs == 1)
> +					imb = sd->span_weight >> 2;
> +				else
> +					imb = nr_llcs;
> +				sd->imb_numa_nr = imb;
> +
> +				/* Set span based on the first NUMA domain. */
> +				top = sd;
> +				top_p = top->parent;
> +				while (top_p && !(top_p->flags & SD_NUMA)) {
> +					top = top->parent;
> +					top_p = top->parent;
> +				}
> +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
> +			} else {
> +				int factor = max(1U, (sd->span_weight / imb_span));
> +
> +				sd->imb_numa_nr = imb * factor;
> +			}
> +		}
> +	}
> +
>  	/* Calculate CPU capacity for physical packages and nodes */
>  	for (i = nr_cpumask_bits-1; i >= 0; i--) {
>  		if (!cpumask_test_cpu(i, cpu_map))

Please correct me if I'm wrong.

Thanks and Regards
Prateek