linux-kernel - Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YbnW/vLgE8MmQopN@BLR-5CG11610CF.amd.com>
Date:   Wed, 15 Dec 2021 17:22:30 +0530
From:   "Gautham R. Shenoy" <gautham.shenoy@....com>
To:     Mel Gorman <mgorman@...hsingularity.net>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Valentin Schneider <valentin.schneider@....com>,
        Aubrey Li <aubrey.li@...ux.intel.com>,
        Barry Song <song.bao.hua@...ilicon.com>,
        Mike Galbraith <efault@....de>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when
 SD_NUMA spans multiple LLCs

Hello Mel,


On Mon, Dec 13, 2021 at 08:17:37PM +0530, Gautham R. Shenoy wrote:

> 
> Thanks for the patch. I will queue this one for tonight.
>

Getting the numbers took a bit longer than I expected.

> 
> > 
> > 
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index bacec575ade2..1fa3e977521d 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >  
> >  			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> >  			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > -				struct sched_domain *top = sd;
> > +				struct sched_domain *top, *top_p;
> >  				unsigned int llc_sq;
> >  
> >  				/*
> > -				 * nr_llcs = (top->span_weight / llc_weight);
> > -				 * imb = (child_weight / nr_llcs) >> 2
> > +				 * nr_llcs = (sd->span_weight / llc_weight);
> > +				 * imb = (llc_weight / nr_llcs) >> 2
> >  				 *
> >  				 * is equivalent to
> >  				 *
> > -				 * imb = (llc_weight^2 / top->span_weight) >> 2
> > +				 * imb = (llc_weight^2 / sd->span_weight) >> 2
> >  				 *
> >  				 */
> >  				llc_sq = child->span_weight * child->span_weight;
> >  
> > -				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> > -				imb_span = sd->span_weight;
> > -
> > +				imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
> >  				sd->imb_numa_nr = imb;
> > +
> > +				/*
> > +				 * Set span based on top domain that places
> > +				 * tasks in sibling domains.
> > +				 */
> > +				top = sd;
> > +				top_p = top->parent;
> > +				while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
> > +					top = top->parent;
> > +					top_p = top->parent;
> > +				}
> > +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
> >  			} else {
> > -				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> > +				int factor = max(1U, (sd->span_weight / imb_span));
> > +


So for the first NUMA domain, the sd->imb_numa_nr will be imb, which
turns out to be 2 for Zen2 and Zen3 processors across all Nodes Per Socket Settings.

On a 2 Socket Zen3:

NPS=1
   child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
   top_p = NUMA, imb_span = 256.

   NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2

NPS=2
   child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
   top_p = NUMA, imb_span = 128.

   NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
   NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4

NPS=4:
   child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
   top_p = NUMA, imb_span = 128.

   NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
   NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4

Again, we will be more aggressively load balancing across the two
sockets in NPS=1 mode compared to NPS=2/4.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
Zen3, 64C128T per socket, 2 sockets,
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
Test:	 tip-core	mel-v3		mel-v4		mel-v4.1
 Copy:	 113666.77      214885.89 	212162.63 	226065.79
 	 (0.00%)	(89.04%)	(86.65%)	(98.88%)
	 
Scale:	 110962.00	215475.05	220002.56	227610.64
	 (0.00%)	(94.18%)	(98.26%)	(105.12%)
	 
  Add:	 124440.44	250338.11	258569.94	271368.51
  	 (0.00%)	(101.17%)	(107.78%)	(118.07%)
	 
Triad:	 122836.42 	244993.16	251075.19	265465.57
	 (0.00%)	(99.44%)	(104.39%)	(116.11%)


NPS=2
Test:	 tip-core	mel-v3		mel-v4		mel-v4.1
 Copy:	 58193.29	203242.78	191171.22	212785.96
 	 (0.00%)	(249.25%) 	(228.51%)	(265.65%)
	 
Scale:	 54980.48	209056.28 	175497.10 	223452.99
	 (0.00%)	(280.23%)	(219.19%)	(306.42%)
	 
  Add:	 63244.24	249610.19	203569.35	270487.22
  	 (0.00%)	(294.67%)	(221.87%)	(327.68%)
	 
Triad:	 62166.77 	242688.72	198971.26	262757.16
	 (0.00%)	(290.38%)	(220.06%)	(322.66%)


NPS=4
Test:	 tip-core	mel-v3		mel-v4		mel-v4.1
 Copy:	 37762.14	253412.82	48748.60	164765.83
 	 (0.00%)	(571.07%)	(29.09%)	(336.32%)
	 
Scale:	 33879.66	237736.93	48317.66	159641.67
	 (0.00%)	(601.70%)	(42.61%)	(371.20%)
	 
  Add:	 38398.90	292186.56	54259.56	184583.70
  	 (0.00%)	(660.92%)	(41.30%)	(380.70%)
	 
Triad:	 37942.38	285124.76	54503.74	181250.80
	 (0.00%)	(651.46%)	(43.64%)	(377.70%)
	 


So while in NPS=1 and NPS=2 we are able to recover the performance
that was obtained with v3, on v4, we are not able to recover as much.

However it is not only due to the fact that in, the imb_numa_nr
thresholds for NPS=4 were (1,1) for the two NUMA domains, while in
v4.1 the imb_numa_nr was (2,4), but also due to the fact that in v3,
we used the imb_numa_nr thresholds in allow_numa_imbalance() while in
v4 and v4.1 we are using those in adjust_numa_imbalance().

The distinction is that in v3, we will trigger load balance as soon as
the number of tasks in the busiest group in the NUMA domain is greater
than or equal to imb_numa_nr at that domain.

In v4 and v4.1, we will trigger the load-balance if

 - the number of tasks in the busiest group is greater than 1/4th the
   CPUs in the NUMA domain.

OR

- the difference between the idle CPUs in the busiest and this group
  is greater than imb_numa_nr.


If we retain the (2,4) thresholds from v4.1 but use them in
allow_numa_imbalance() as in v3 we get

NPS=4
Test:	 mel-v4.2
 Copy:	 225860.12 (498.11%)
Scale:	 227869.07 (572.58%)
  Add:	 278365.58 (624.93%)
Triad:	 264315.44 (596.62%)

It shouldn't have made so much of a difference considering the fact
that with 16 stream tasks, we should have hit "imbalance >
imb_numa_nr" in adjust_numa_imbalance() eventually. But these are the
numbers!


The trend is similar with -DNTIMES=100. Even in this case with v4.2 we
can recover the stream performance in NPS4 case.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100
Zen3, 64C128T per socket, 2 sockets,
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NPS=1
Test:	tip-core   mel-v3     mel-v4	mel-v4.1
 Copy:	137281.89  235886.13  240588.80  249005.99
 	(0.00 %)   (71.82%)   (75.25%)	 (81.38%)

Scale:	126559.48  213667.75  226411.39  228523.78
	(0.00 %)   (68.82%)   (78.89%)   (80.56%)

  Add:	148144.23  255780.73  272027.93  272872.49
	(0.00 %)   (72.65%)   (83.62%)	 (84.19%)

Triad:	146829.78  240400.29  259854.84  265640.70
	(0.00 %)   (63.72%)   (76.97%)	 (80.91%)


NPS=2
Test:	tip-core   mel-v3     mel-v4	mel-v4.1
 Copy:	 105873.29 243513.40  198299.43 258266.99
 	 (0.00%)   (130.00%)  (87.29%)	(143.93%)
	 
Scale:	 100510.54 217361.59  177890.00 231509.69
	 (0.00%)   (116.25%)  (76.98%)	(130.33%)
	 
  Add:	 115932.61 268488.03  211436.57	288396.07
  	 (0.00%)   (131.58%)  (82.37%)	(148.76%)
	 
Triad:	 113349.09 253865.68  197810.83	272219.89
	 (0.00%)   (123.96%)  (74.51%)	(140.16%)


NPS=4
Test:	tip-core   mel-v3     mel-v4	mel-v4.1   mel-v4.2
 Copy:	106798.80  251503.78  48898.24	171769.82  266788.03
 	(0.00%)	   (135.49%)  (-54.21%)	(60.83%)   (149.80%)
	
Scale:	101377.71  221135.30  48425.90	160748.21  232480.83
	(0.00%)	   (118.13%)  (-52.23%)	(58.56%)   (129.32%)
	
  Add:	116131.56  275008.74  54425.29  186387.91  290787.31
  	(0.00%)	   (136.80%)  (-53.13%)	(60.49%)   (150.39%)
	
Triad:	113443.70  256027.20  54622.68  180936.47  277456.83
	(0.00%)	   (125.68%)  (-51.85%)	(59.49%)   (144.57%)



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tbench
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NPS=1
======
Clients: tip-core   mel-v3    mel-v4    mel-v4.1
    1	 633.19     619.16    632.94    619.27
    	 (0.00%)    (-2.21%)  (-0.03%)	(-2.19%)
	 
    2	 1152.48    1189.88   1184.82   1189.19
    	 (0.00%)    (3.24%)   (2.80%)	(3.18%)
	 
    4	 1946.46    2177.40   1979.56	2196.09
    	 (0.00%)    (11.86%)  (1.70%)	(12.82%)
	 
    8	 3553.29    3564.50   3678.07	3668.77
    	 (0.00%)    (0.31%)   (3.51%)	(3.24%)
	 
   16	 6217.03    6484.58   6249.29	6534.73
   	 (0.00%)    (4.30%)   (0.51%)	(5.11%)
	 
   32	 11702.59   12185.77  12005.99	11917.57
   	 (0.00%)    (4.12%)   (2.59%)	(1.83%)
	 
   64	 18394.56   19535.11  19080.19	19500.55
   	 (0.00%)    (6.20%)   (3.72%)	(6.01%)
	 
  128	 27231.02   31759.92  27200.52	30358.99
  	 (0.00%)    (16.63%)  (-0.11%)	(11.48%)
	 
  256	 33166.10   24474.30  31639.98	24788.12
  	 (0.00%)    (-26.20%) (-4.60%)	(-25.26%)
	 
  512	 41605.44   54823.57  46684.48	54559.02
  	 (0.00%)    (31.77%)  (12.20%)	(31.13%)
	 
 1024	 53650.54   56329.39  44422.99	56320.66
 	 (0.00%)    (4.99%)   (-17.19%)	(4.97%) 


We see that the v4.1 performs better than v4 in most cases except when
the number of clients=256 where the spread strategy seems to be
hurting as we see degradation in both v3 and v4.1. This is true even
for NPS=2 and NPS=4 cases (see below).

NPS=2
=====
Clients: tip-core   mel-v3    mel-v4    mel-v4.1
    1	 629.76	    620.91    629.11	631.95
    	 (0.00%)    (-1.40%)  (-0.10%)	(0.34%)
	 
    2	 1176.96    1203.12   1169.09	1186.74
    	 (0.00%)    (2.22%)   (-0.66%)	(0.83%)
	 
    4	 1990.97    2228.04   1888.19	1995.21
    	 (0.00%)    (11.90%)  (-5.16%)	(0.21%)
	 
    8	 3534.57    3617.16   3660.30	3548.09
    	 (0.00%)    (2.33%)   (3.55%)	(0.38%)
	 
   16	 6294.71    6547.80   6504.13	6470.34
   	 (0.00%)    (4.02%)   (3.32%)	(2.79%)
	 
   32	 12035.73   12143.03  11396.26	11860.91
   	 (0.00%)    (0.89%)   (-5.31%)	(-1.45%)
	 
   64	 18583.39   19439.12  17126.47	18799.54
   	 (0.00%)    (4.60%)   (-7.83%)	(1.16%)
	 
  128	 27811.89   30562.84  28090.29	27468.94
  	 (0.00%)    (9.89%)   (1.00%)	(-1.23%)
	 
  256	 28148.95   26488.57  29117.13	23628.29
  	 (0.00%)    (-5.89%)  (3.43%)	(-16.05%)
	 
  512	 43934.15   52796.38  42603.49	41725.75
  	 (0.00%)    (20.17%)  (-3.02%)	(-5.02%)
	 
 1024	 54391.65   53891.83  48419.09	43913.40
 	 (0.00%)    (-0.91%)  (-10.98%)	(-19.26%)

In this case, v4.1 performs as good as v4 upto 64 clients. But after
that we see degradation. The degradation is significant in 1024
clients case.

NPS=4
=====
Clients: tip-core   mel-v3    mel-v4    mel-v4.1    mel-v4.2
    1	 622.65	    617.83    667.34	644.76	    617.58
    	 (0.00%)    (-0.77%)  (7.17%)	(3.55%)	    (-0.81%)
	 
    2	 1160.62    1182.30   1294.08	1193.88	    1182.55
    	 (0.00%)    (1.86%)   (11.49%)	(2.86%)	    (1.88%)
	 
    4	 1961.14    2171.91   2477.71	1929.56	    2116.01
    	 (0.00%)    (10.74%)  (26.34%)	(-1.61%)    (7.89%)
	 
    8	 3662.94    3447.98   4067.40	3627.43	    3580.32
    	 (0.00%)    (-5.86%)  (11.04%)	(-0.96%)    (-2.25%)
	 
   16	 6490.92    5871.93   6924.32	6660.13	    6413.34
   	 (0.00%)    (-9.53%)  (6.67%)	(2.60%)	    (-1.19%)
	 
   32	 11831.81   12004.30  12709.06	12187.78    11767.46
   	 (0.00%)    (1.45%)   (7.41%)	(3.00%)	    (-0.54%)
	 
   64	 17717.36   18406.79  18785.41	18820.33    18197.86
   	 (0.00%)    (3.89%)   (6.02%)	(6.22%)	    (2.71%)
	 
  128	 27723.35   27777.34  27939.63	27399.64    24310.93
  	 (0.00%)    (0.19%)   (0.78%)	(-1.16%)    (-12.30%)
	 
  256	 30919.69   23937.03  35412.26	26780.37    24642.24
  	 (0.00%)    (-22.58%) (14.52%)	(-13.38%)   (-20.30%)
	 
  512	 43366.03   49570.65  43830.84	43654.42    41031.90
  	 (0.00%)    (14.30%)  (1.07%)	(0.66%)	    (-5.38%)
	 
 1024	 46960.83   53576.16  50557.19	43743.07    40884.98
 	 (0.00%)    (14.08%)  (7.65%)	(-6.85%)    (-12.93%)


In the NPS=4 case, clearly v4 provides the best results.

v4.1 does better v4.2 since it is able to hold off spreading for a
longer period compared to v4.2.

--
Thanks and Regards
gautham.