linux-kernel - Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YbcEE/mgIAhWuS+A@BLR-5CG11610CF.amd.com>
Date:   Mon, 13 Dec 2021 13:58:03 +0530
From:   "Gautham R. Shenoy" <gautham.shenoy@....com>
To:     Mel Gorman <mgorman@...hsingularity.net>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Valentin Schneider <valentin.schneider@....com>,
        Aubrey Li <aubrey.li@...ux.intel.com>,
        Barry Song <song.bao.hua@...ilicon.com>,
        Mike Galbraith <efault@....de>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when
 SD_NUMA spans multiple LLCs

Hello Mel,

On Fri, Dec 10, 2021 at 09:33:07AM +0000, Mel Gorman wrote:
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
> 
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> adjusts the imbalance on multi-LLC machines to allow an imbalance up to
> the point where LLCs should be balanced between nodes.
> 
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
> 
>                             5.16.0-rc1             5.16.0-rc1
>                                vanilla       sched-numaimb-v4
> MB/sec copy-16    166712.18 (   0.00%)   651540.22 ( 290.82%)
> MB/sec scale-16   140109.66 (   0.00%)   382254.74 ( 172.83%)
> MB/sec add-16     160791.18 (   0.00%)   623073.98 ( 287.51%)
> MB/sec triad-16   160043.84 (   0.00%)   633964.52 ( 296.12%)


Could you please share the size of the stream array ? These numbers
are higher than what I am observing.

> 
> STREAM can use directives to force the spread if the OpenMP is new
> enough but that doesn't help if an application uses threads and
> it's not known in advance how many threads will be created.
> 
> Coremark is a CPU and cache intensive benchmark parallelised with
> threads. When running with 1 thread per instance, the vanilla kernel
> allows threads to contend on cache. With the patch;
> 
>                                5.16.0-rc1             5.16.0-rc1
>                                   vanilla    sched-numaimb-v4r24
> Min       Score-16   367816.09 (   0.00%)   384015.36 (   4.40%)
> Hmean     Score-16   389627.78 (   0.00%)   431907.14 *  10.85%*
> Max       Score-16   416178.96 (   0.00%)   480120.03 (  15.36%)
> Stddev    Score-16    17361.82 (   0.00%)    32505.34 ( -87.22%)
> CoeffVar  Score-16        4.45 (   0.00%)        7.49 ( -68.30%)
> 
> It can also make a big difference for semi-realistic workloads
> like specjbb which can execute arbitrary numbers of threads without
> advance knowledge of how they should be placed
> 
>                                5.16.0-rc1             5.16.0-rc1
>                                   vanilla       sched-numaimb-v4
> Hmean     tput-1      73743.05 (   0.00%)    70258.27 *  -4.73%*
> Hmean     tput-8     563036.51 (   0.00%)   591187.39 (   5.00%)
> Hmean     tput-16   1016590.61 (   0.00%)  1032311.78 (   1.55%)
> Hmean     tput-24   1418558.41 (   0.00%)  1424005.80 (   0.38%)
> Hmean     tput-32   1608794.22 (   0.00%)  1907855.80 *  18.59%*
> Hmean     tput-40   1761338.13 (   0.00%)  2108162.23 *  19.69%*
> Hmean     tput-48   2290646.54 (   0.00%)  2214383.47 (  -3.33%)
> Hmean     tput-56   2463345.12 (   0.00%)  2780216.58 *  12.86%*
> Hmean     tput-64   2650213.53 (   0.00%)  2598196.66 (  -1.96%)
> Hmean     tput-72   2497253.28 (   0.00%)  2998882.47 *  20.09%*
> Hmean     tput-80   2820786.72 (   0.00%)  2951655.27 (   4.64%)
> Hmean     tput-88   2813541.68 (   0.00%)  3045450.86 *   8.24%*
> Hmean     tput-96   2604158.67 (   0.00%)  3035311.91 *  16.56%*
> Hmean     tput-104  2713810.62 (   0.00%)  2984270.04 (   9.97%)
> Hmean     tput-112  2558425.37 (   0.00%)  2894737.46 *  13.15%*
> Hmean     tput-120  2611434.93 (   0.00%)  2781661.01 (   6.52%)
> Hmean     tput-128  2706103.22 (   0.00%)  2811447.85 (   3.89%)


> 
> Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 36 +++++++++++++++++----------------
>  kernel/sched/topology.c        | 37 ++++++++++++++++++++++++++++++++++
>  3 files changed, 57 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index c07bfa2d80f2..54f5207154d3 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
>  	unsigned int busy_factor;	/* less balancing by factor if busy */
>  	unsigned int imbalance_pct;	/* No balance until over watermark */
>  	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
> +	unsigned int imb_numa_nr;	/* Nr imbalanced tasks allowed between nodes */
>  
>  	int nohz_idle;			/* NOHZ IDLE status */
>  	int flags;			/* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a969affca76..972ba586b113 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>  
>  	int src_cpu, src_nid;
>  	int dst_cpu, dst_nid;
> +	int imb_numa_nr;
>  
>  	struct numa_stats src_stats, dst_stats;
>  
> @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static unsigned long cpu_util(int cpu);
>  static inline long adjust_numa_imbalance(int imbalance,
> -					int dst_running, int dst_weight);
> +					int dst_running, int dst_weight,
> +					int imb_numa_nr);
>  
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1885,7 +1887,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  		dst_running = env->dst_stats.nr_running + 1;
>  		imbalance = max(0, dst_running - src_running);
>  		imbalance = adjust_numa_imbalance(imbalance, dst_running,
> -							env->dst_stats.weight);
> +						  env->dst_stats.weight,
> +						  env->imb_numa_nr);
>  
>  		/* Use idle CPU if there is no imbalance */
>  		if (!imbalance) {
> @@ -1950,8 +1953,10 @@ static int task_numa_migrate(struct task_struct *p)
>  	 */
>  	rcu_read_lock();
>  	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> -	if (sd)
> +	if (sd) {
>  		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +		env.imb_numa_nr = sd->imb_numa_nr;
> +	}
>  	rcu_read_unlock();
>  
>  	/*
> @@ -9186,12 +9191,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  				return idlest;
>  #endif
>  			/*
> -			 * Otherwise, keep the task on this node to stay close
> -			 * its wakeup source and improve locality. If there is
> -			 * a real need of migration, periodic load balance will
> -			 * take care of it.
> +			 * Otherwise, keep the task on this node to stay local
> +			 * to its wakeup source if the number of running tasks
> +			 * are below the allowed imbalance. If there is a real
> +			 * need of migration, periodic load balance will take
> +			 * care of it.
>  			 */
> -			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> +			if (local_sgs.sum_nr_running <= sd->imb_numa_nr)
>  				return NULL;
>  		}
>  
> @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	}
>  }
>  
> -#define NUMA_IMBALANCE_MIN 2
> -
>  static inline long adjust_numa_imbalance(int imbalance,
> -				int dst_running, int dst_weight)
> +				int dst_running, int dst_weight,
> +				int imb_numa_nr)
>  {
>  	if (!allow_numa_imbalance(dst_running, dst_weight))
>  		return imbalance;
>

if (4 * dst_running >= dst_weight) we return imbalance here. The
dst_weight here corresponds to the span of the domain, while
dst_running is the nr_running in busiest.

On Zen3, at the top most NUMA domain, the dst_weight = 256 across in
all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are
two groups, where each group is a socket. So, unless there are at
least 64 tasks running in one of the sockets, we would not return
imbalance here and go to the next step.


> -	/*
> -	 * Allow a small imbalance based on a simple pair of communicating
> -	 * tasks that remain local when the destination is lightly loaded.
> -	 */
> -	if (imbalance <= NUMA_IMBALANCE_MIN)
> +	if (imbalance <= imb_numa_nr)

imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains
don't have PREFER_SIBLING, we would be balancing the number of idle
CPUs. We will end up doing the imbalance, as long as the difference
between the idle CPUs is at least 8.

In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we
will not rebalance unless the difference between the idle CPUs is 16.

In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the
threshold is now bumped up to 32.

>  		return 0;



>  
>  	return imbalance;
> @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		/* Consider allowing a small imbalance between NUMA groups */
>  		if (env->sd->flags & SD_NUMA) {
>  			env->imbalance = adjust_numa_imbalance(env->imbalance,
> -				busiest->sum_nr_running, env->sd->span_weight);
> +				busiest->sum_nr_running, env->sd->span_weight,
> +				env->sd->imb_numa_nr);
>  		}
>  
>  		return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..bacec575ade2 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  		}
>  	}
>  
> +	/*
> +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> +	 * imbalanced.
> +	 */
> +	for_each_cpu(i, cpu_map) {
> +		unsigned int imb = 0;
> +		unsigned int imb_span = 1;
> +
> +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *child = sd->child;
> +
> +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +				struct sched_domain *top = sd;


We don't seem to be using top anywhere where sd may not be used since
we already have variables imb and imb_span to record the
top->imb_numa_nr and top->span_weight.


> +				unsigned int llc_sq;
> +
> +				/*
> +				 * nr_llcs = (top->span_weight / llc_weight);
> +				 * imb = (child_weight / nr_llcs) >> 2

child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2.

> +				 *
> +				 * is equivalent to
> +				 *
> +				 * imb = (llc_weight^2 / top->span_weight) >> 2
> +				 *
> +				 */
> +				llc_sq = child->span_weight * child->span_weight;
> +
> +				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> +				imb_span = sd->span_weight;

On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256.
   with NPS=1
      top = DIE.
      top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128.

   with NPS=2
      top = NODE.
      top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64.

   with NPS=4      
      top = NODE.
      top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32.

On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64.
   with NPS=1
      top = DIE.
      top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128.

   with NPS=2
      top = NODE.
      top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64.

   with NPS=4      
      top = NODE.
      top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32.


> +
> +				sd->imb_numa_nr = imb;
> +			} else {
> +				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> +			}

On Zen3,
   with NPS=1
        sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4.

   with NPS=2
        sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4
	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8

   with NPS=4
        sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8
	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16


For Zen2, since the imb_span and imb values are the same as the
corresponding NPS=x values on Zen3, the imb_numa_nr values are the
same as well since the corresponding sd->span_weight is the same.


If we look at the highest NUMA domain, there are two groups in all the
NPS configurations. There are the same number of LLCs in each of these
groups across the different NPS configurations (nr_llcs=8 on Zen3, 16
on Zen2) . However, the imb_numa_nr at this domain varies with the NPS
value, since we compute the imb_numa_nr value relative to the number
of "top" domains that can be fit within this NUMA domain. This is
because the size of the "top" domain varies with the NPS value. This
shows up in the benchmark results.



The numbers with stream, tbench and YCSB +
Mongodb are as follows:


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
Test:     tip/sched/core                 mel-v3                    mel-v4
 Copy:    113716.62 (0.00 pct)     218961.59 (92.55 pct)     217130.07 (90.93 pct)
Scale:    110996.89 (0.00 pct)     216674.73 (95.20 pct)     220765.94 (98.89 pct)
  Add:    124504.19 (0.00 pct)     253461.32 (103.57 pct     260273.88 (109.04 pct)
Triad:    122890.43 (0.00 pct)     247552.00 (101.44 pct     252615.62 (105.56 pct)


NPS=2
Test:     tip/sched/core                 mel-v3                     mel-v4
 Copy:    58217.00 (0.00 pct)      204630.34 (251.49 pct)     191312.73 (228.62 pct)
Scale:    55004.76 (0.00 pct)      212142.88 (285.68 pct)     175499.15 (219.06 pct)
  Add:    63269.04 (0.00 pct)      254752.56 (302.64 pct)     203571.50 (221.75 pct)
Triad:    62178.25 (0.00 pct)      247290.80 (297.71 pct)     198988.70 (220.02 pct)

NPS=4
Test:     tip/sched/core                 mel-v3                     mel-v4
 Copy:    37986.66 (0.00 pct)      254183.87 (569.13 pct)     48748.87 (28.33 pct)
Scale:    35471.22 (0.00 pct)      237804.76 (570.41 pct)     48317.82 (36.21 pct)
  Add:    39303.25 (0.00 pct)      292285.20 (643.66 pct)     54259.59 (38.05 pct)
Triad:    39319.85 (0.00 pct)      285284.30 (625.54 pct)     54503.98 (38.61 pct)


We can see that with the v4 patch, for NPS=2 and NPS=4, the gains
start diminishing since the thresholds are higher than NPS=1.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
Test:     tip/sched/core                 mel-v3                    mel-v4
 Copy:    137362.66 (0.00 pct)     236661.65 (72.28 pct)     241148.65 (75.55 pct)
Scale:    126742.24 (0.00 pct)     214568.17 (69.29 pct)     226416.41 (78.64 pct)
  Add:    148236.33 (0.00 pct)     257114.42 (73.44 pct)     272030.50 (83.51 pct)
Triad:    146913.25 (0.00 pct)     241880.88 (64.64 pct)     259873.61 (76.88 pct)

NPS=2
Test:     tip/sched/core                 mel-v3                    mel-v4
 Copy:    107143.94 (0.00 pct)     244922.66 (128.59 pct)    198299.91 (85.07 pct)
Scale:    102004.90 (0.00 pct)     218738.55 (114.43 pct)    177890.23 (74.39 pct)
  Add:    117760.23 (0.00 pct)     270516.24 (129.71 pct)    211458.30 (79.56 pct)
Triad:    115927.92 (0.00 pct)     255985.20 (120.81 pct)    197812.60 (70.63 pct)


NPS=4
Test:     tip/sched/core                 mel-v3                    mel-v4
 Copy:    111653.17 (0.00 pct)     253912.17 (127.41 pct)     48898.34 (-56.20 pct)
Scale:    105289.35 (0.00 pct)     223710.85 (112.47 pct)     48426.03 (-54.00 pct)
  Add:    120927.64 (0.00 pct)     277701.20 (129.64 pct)     54425.48 (-54.99 pct)
Triad:    117659.97 (0.00 pct)     259473.84 (120.52 pct)     54622.82 (-53.57 pct)

with -DNTIMES=100, each of the Copy,Scale,Add,Triad kernels runs for a
longer duration. So the test takes longer time (6-10 seconds) giving
the load-balancer sufficient time to place the tasks and balance
them. In this configuration we see that the v4 shows some degradation
on NPS=4. This is due to the imb_numa_nr being higher compared to v3.

While Stream benefits from spreading, it is fair to understand the
gains that we make with benchmarks that would prefer the tasks
co-located instead of spread out. Chose tbench and YCSB+Mongodb as
representatives of these. The numbers are as follows:


~~~~~~~~~~~~~~~~~~~~~~~~
tbench
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
Clients:     tip/sched/core                 mel-v3                  mel-v4
    1        633.25 (0.00 pct)        619.18 (-2.22 pct)      632.96 (-0.04 pct)
    2        1152.54 (0.00 pct)       1189.91 (3.24 pct)      1184.84 (2.80 pct)
    4        1946.53 (0.00 pct)       2177.45 (11.86 pct)     1979.62 (1.69 pct)
    8        3554.65 (0.00 pct)       3565.16 (0.29 pct)      3678.13 (3.47 pct)
   16        6222.00 (0.00 pct)       6484.89 (4.22 pct)      6256.02 (0.54 pct)
   32        11707.57 (0.00 pct)      12185.93 (4.08 pct)     12006.63 (2.55 pct)
   64        18433.50 (0.00 pct)      19537.03 (5.98 pct)     19088.57 (3.55 pct)
  128        27400.07 (0.00 pct)      31771.53 (15.95 pct)    27265.00 (-0.49 pct)
  256        33195.27 (0.00 pct)      24478.67 (-26.25 pct)   34065.60 (2.62 pct)
  512        41633.10 (0.00 pct)      54833.20 (31.70 pct)    46724.00 (12.22 pct)
 1024        53877.23 (0.00 pct)      56363.37 (4.61 pct)     44813.10 (-16.82 pct)


NPS=2
Clients:     tip/sched/core                 mel-v3                  mel-v4
    1        629.76 (0.00 pct)        620.94 (-1.40 pct)      629.22 (-0.08 pct)
    2        1177.01 (0.00 pct)       1203.27 (2.23 pct)      1169.12 (-0.66 pct)
    4        1990.97 (0.00 pct)       2228.18 (11.91 pct)     1888.39 (-5.15 pct)
    8        3535.45 (0.00 pct)       3620.76 (2.41 pct)      3662.72 (3.59 pct)
   16        6309.02 (0.00 pct)       6548.66 (3.79 pct)      6508.67 (3.16 pct)
   32        12038.73 (0.00 pct)      12145.97 (0.89 pct)     11411.50 (-5.21 pct)
   64        18599.67 (0.00 pct)      19448.87 (4.56 pct)     17146.07 (-7.81 pct)
  128        27861.57 (0.00 pct)      30630.53 (9.93 pct)     28217.30 (1.27 pct)
  256        28215.80 (0.00 pct)      26864.67 (-4.78 pct)    29330.47 (3.95 pct)
  512        44239.67 (0.00 pct)      52822.47 (19.40 pct)    42652.63 (-3.58 pct)
 1024        54403.53 (0.00 pct)      53905.57 (-0.91 pct)    48490.30 (-10.86 pct)



NPS=4
Clients:     tip/sched/core                 mel-v3                  mel-v4
    1        622.68 (0.00 pct)        617.87 (-0.77 pct)      667.38 (7.17 pct)
    2        1160.74 (0.00 pct)       1182.40 (1.86 pct)      1294.12 (11.49 pct)
    4        1961.29 (0.00 pct)       2172.41 (10.76 pct)     2477.76 (26.33 pct)
    8        3664.25 (0.00 pct)       3450.80 (-5.82 pct)     4067.42 (11.00 pct)
   16        6495.53 (0.00 pct)       5873.41 (-9.57 pct)     6931.66 (6.71 pct)
   32        11833.27 (0.00 pct)      12010.43 (1.49 pct)     12710.60 (7.41 pct)
   64        17723.50 (0.00 pct)      18416.23 (3.90 pct)     18793.47 (6.03 pct)
  128        27724.83 (0.00 pct)      27894.50 (0.61 pct)     27948.60 (0.80 pct)
  256        31351.70 (0.00 pct)      23944.43 (-23.62 pct)   35430.17 (13.00 pct)
  512        43383.43 (0.00 pct)      49830.63 (14.86 pct)    43877.83 (1.13 pct)
 1024        46974.27 (0.00 pct)      53583.83 (14.07 pct)    50563.23 (7.64 pct)


With NPS=4, with v4, we see no regressions with tbench compared to
tip/sched/core and there is a considerable improvement in most
cases. So, the higher imb_numa_nr helps pack the tasks which
beneficial to tbench.



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
YCSB + Mongodb.

4 client instances, 256 threads per client instance.  These threads
have a very low utilization. The overall system utilization was in the
range of 16-20%.

YCSB workload type : A
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
               tip/sched/core   mel-v3        mel-v4
Throughput     351611.0        314981.33     329026.33
                               (-10.42 pct)  (-6.42 pct)



NPS=4
               tip/sched/core   mel-v3        mel-v4
Throughput     315808.0         316600.67     331093.67
                                (0.25 pct)    (4.84 pct)

Since at NPS=4, the imb_numa_nr=8 and 16 respectively at the lower and
higher NUMA domains, the task spreading happens more reluctantly
compared to v3 where the imb_numa_nr was 1 in both the domains.


--
Thanks and Regards
gautham.