lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53d53475-2720-cfbb-c567-563e900144ee@amd.com>
Date:   Sat, 18 Feb 2023 01:05:32 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     Chen Yu <yu.c.chen@...el.com>, Abel Wu <wuyun.abel@...edance.com>
Cc:     Mel Gorman <mgorman@...hsingularity.net>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Tim Chen <tim.c.chen@...el.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        Yicong Yang <yangyicong@...ilicon.com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Honglei Wang <wanghonglei@...ichuxing.com>,
        Len Brown <len.brown@...el.com>,
        Chen Yu <yu.chen.surf@...il.com>,
        Tianchen Ding <dtcccc@...ux.alibaba.com>,
        Joel Fernandes <joel@...lfernandes.org>,
        Josh Don <joshdon@...gle.com>, Hillf Danton <hdanton@...a.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 0/2] sched/fair: Wake short task on current CPU

Hello Chenyu and Abel,

I'll leave the detailed results from testing on a dual socket Zen3 system
(2 x 64C/128T) below.

tl;dr

o Most benchmark results see small wins or are comparable to tip.
o SpecJBB Max-jOPS see a small hit but Critical-jOPS improve.
o ycsb-mongodb sees small uplift in NPS1 mode.
o Numbers for Netperf runs are pending which I'll share in the
  coming week.
o Abel's suggestion on top of v5 seem promising but there are
  few regressions I notice on larger workloads.

Detailed Results:

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- tip:          6.2.0-rc6 tip sched/core
- sis_short: 	6.2.0-rc6 tip sched/core + this series

When the testing started, the tip was at:
commit 4d627628d758 "cpuidle: Fix poll_idle() noinstr annotation"

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test:			tip			sis_short
 1-groups:	   4.38 (0.00 pct)	   4.49 (-2.51 pct)
 2-groups:	   5.12 (0.00 pct)	   5.20 (-1.56 pct)
 4-groups:	   4.21 (0.00 pct)	   4.24 (-0.71 pct)
 8-groups:	   4.68 (0.00 pct)	   4.73 (-1.06 pct)
16-groups:	   6.13 (0.00 pct)	   6.35 (-3.58 pct)

o NPS2

Test:			tip			sis_short
 1-groups:	   4.51 (0.00 pct)	   4.36 (3.32 pct)
 2-groups:	   4.31 (0.00 pct)	   4.35 (0.92 pct)
 4-groups:	   4.17 (0.00 pct)	   4.08 (2.15 pct)
 8-groups:	   4.58 (0.00 pct)	   4.49 (1.96 pct)
16-groups:	   5.74 (0.00 pct)	   5.93 (-3.31 pct)

o NPS4

Test:			tip			sis_short
 1-groups:	   4.47 (0.00 pct)	   4.51 (-0.89 pct)
 2-groups:	   4.97 (0.00 pct)	   5.04 (-1.40 pct)
 4-groups:	   4.26 (0.00 pct)	   4.28 (-0.46 pct)
 8-groups:	   5.46 (0.00 pct)	   5.56 (-1.83 pct)
16-groups:	   6.38 (0.00 pct)	   6.10 (4.38 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

o NPS1

#workers:	tip			sis_short
  1:	  36.00 (0.00 pct)	  27.00 (25.00 pct)
  2:	  37.00 (0.00 pct)	  32.00 (13.51 pct)
  4:	  41.00 (0.00 pct)	  34.00 (17.07 pct)
  8:	  46.00 (0.00 pct)	  43.00 (6.52 pct)
 16:	  66.00 (0.00 pct)	  66.00 (0.00 pct)
 32:	 111.00 (0.00 pct)	 108.00 (2.70 pct)
 64:	 207.00 (0.00 pct)	 206.00 (0.48 pct)
128:	 483.00 (0.00 pct)	 481.00 (0.41 pct)
256:	 46272.00 (0.00 pct)	 45120.00 (2.48 pct)
512:	 76160.00 (0.00 pct)	 77696.00 (-2.01 pct)

o NPS2

#workers:	tip			sis_short
  1:	  33.00 (0.00 pct)	  31.00 (6.06 pct)
  2:	  35.00 (0.00 pct)	  31.00 (11.42 pct)
  4:	  38.00 (0.00 pct)	  38.00 (0.00 pct)
  8:	  51.00 (0.00 pct)	  47.00 (7.84 pct)
 16:	  64.00 (0.00 pct)	  67.00 (-4.68 pct)
 32:	 118.00 (0.00 pct)	 116.00 (1.69 pct)
 64:	 214.00 (0.00 pct)	 217.00 (-1.40 pct)
128:	 497.00 (0.00 pct)	 504.00 (-1.40 pct)
256:	 45632.00 (0.00 pct)	 44352.00 (2.80 pct)
512:	 81024.00 (0.00 pct)	 78464.00 (3.15 pct)

o NPS4

#workers:	tip			sis_short
  1:	  33.00 (0.00 pct)	  32.00 (3.03 pct)
  2:	  40.00 (0.00 pct)	  32.00 (20.00 pct)
  4:	  42.00 (0.00 pct)	  38.00 (9.52 pct)
  8:	  64.00 (0.00 pct)	  65.00 (-1.56 pct)
 16:	  73.00 (0.00 pct)	  69.00 (5.47 pct)
 32:	 112.00 (0.00 pct)	 112.00 (0.00 pct)
 64:	 215.00 (0.00 pct)	 207.00 (3.72 pct)
128:	 615.00 (0.00 pct)	 593.00 (3.73 pct)
256:	 46144.00 (0.00 pct)	 45376.00 (1.66 pct)
512:	 78208.00 (0.00 pct)	 77696.00 (0.65 pct)


~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients:	tip			sis_short
    1	 536.78 (0.00 pct)	 537.38 (0.11 pct)
    2	 1050.74 (0.00 pct)	 1058.74 (0.76 pct)
    4	 1993.47 (0.00 pct)	 1976.79 (-0.83 pct)
    8	 3498.02 (0.00 pct)	 3657.16 (4.54 pct)
   16	 6202.01 (0.00 pct)	 6014.62 (-3.02 pct)
   32	 11544.55 (0.00 pct)	 11847.47 (2.62 pct)
   64	 21828.75 (0.00 pct)	 21754.85 (-0.33 pct)
  128	 31095.92 (0.00 pct)	 31643.35 (1.76 pct)
  256	 54828.12 (0.00 pct)	 55432.29 (1.10 pct)
  512	 54888.10 (0.00 pct)	 55917.91 (1.87 pct)
 1024	 54916.75 (0.00 pct)	 53468.79 (-2.63 pct)

o NPS2

Clients:	tip			sis_short
    1	 543.08 (0.00 pct)	 544.49 (0.25 pct)
    2	 1074.55 (0.00 pct)	 1060.33 (-1.32 pct)
    4	 1980.75 (0.00 pct)	 1992.86 (0.61 pct)
    8	 3628.36 (0.00 pct)	 3507.73 (-3.32 pct)
   16	 5806.00 (0.00 pct)	 5790.82 (-0.26 pct)
   32	 11351.94 (0.00 pct)	 10937.21 (-3.26 pct)
   64	 19987.40 (0.00 pct)	 20739.38 (3.76 pct)
  128	 29554.40 (0.00 pct)	 30011.99 (1.54 pct)
  256	 53594.11 (0.00 pct)	 51473.78 (-3.95 pct)
  512	 54304.03 (0.00 pct)	 52998.31 (-2.40 pct)
 1024	 54338.25 (0.00 pct)	 53265.51 (-1.97 pct)

o NPS4

Clients:	tip			sis_short
    1	 541.29 (0.00 pct)	 536.21 (-0.93 pct)
    2	 1045.15 (0.00 pct)	 1054.94 (0.93 pct)
    4	 1973.01 (0.00 pct)	 1988.63 (0.79 pct)
    8	 3490.55 (0.00 pct)	 3535.27 (1.28 pct)
   16	 5920.12 (0.00 pct)	 5846.04 (-1.25 pct)
   32	 10933.38 (0.00 pct)	 10944.33 (0.10pct)
   64	 19628.34 (0.00 pct)	 19328.66 (1.01 pct)
  128	 29785.23 (0.00 pct)	 28749.48 (-4.55 pct)
  256	 51999.72 (0.00 pct)	 51336.20 (-1.27 pct)
  512	 53619.42 (0.00 pct)	 53269.04 (-0.65 pct)
 1024	 53956.57 (0.00 pct)	 53666.14 (-0.53 pct)


~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

10 Runs:

Test:		tip			sis_short
 Copy:	 320576.16 (0.00 pct)	 328194.56 (2.37 pct)
Scale:	 212869.80 (0.00 pct)	 216713.96 (1.80 pct)
  Add:	 241556.74 (0.00 pct)	 247467.26 (2.44 pct)
Triad:	 250637.58 (0.00 pct)	 245538.49 (-2.03 pct)

100 Runs:

Test:		tip			sis_short
 Copy:	 330058.38 (0.00 pct)	 329339.60 (-0.21 pct)
Scale:	 216475.85 (0.00 pct)	 219334.10 (1.32 pct)
  Add:	 243028.82 (0.00 pct)	 244037.77 (0.41 pct)
Triad:	 252907.98 (0.00 pct)	 257210.37 (1.70 pct)

o NPS2

10 Runs:

Test:		tip			sis_short
 Copy:	 339946.34 (0.00 pct)	 327261.79 (-3.73 pct)
Scale:	 217453.46 (0.00 pct)	 221366.66 (1.79 pct)
  Add:	 258099.63 (0.00 pct)	 258472.44 (0.14 pct)
Triad:	 264974.76 (0.00 pct)	 262618.99 (-0.88 pct)

100 Runs:

Test:		tip			sis_short
 Copy:	 335725.30 (0.00 pct)	 320797.67 (-4.44 pct)
Scale:	 229985.45 (0.00 pct)	 221706.62 (-3.59 pct)
  Add:	 260546.33 (0.00 pct)	 250668.80 (-3.79 pct)
Triad:	 267925.27 (0.00 pct)	 262959.86 (-1.85 pct)

o NPS4

10 Runs:

Test:		tip			sis_short
 Copy:   369037.34 (0.00 pct)    371514.46 (0.67 pct)
Scale:   238235.39 (0.00 pct)    237661.29 (-0.24 pct)
  Add:   263626.48 (0.00 pct)    263436.20 (-0.07 pct)
Triad:   280881.43 (0.00 pct)    288059.52 (2.55 pct)

100 Runs:

Test:		tip			sis_short
 Copy:	 339036.66 (0.00 pct)	 346904.09 (2.32 pct)
Scale:	 246638.02 (0.00 pct)	 230195.65 (-6.66 pct)
  Add:	 259898.86 (0.00 pct)	 244631.77 (-5.87 pct)
Triad:	 265719.02 (0.00 pct)	 264620.50 (-0.41 pct)

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1:

tip		:	133514.00  (var: 2.07%)
sis-short	:	137664.67  (var: 1.45%)  (3.11%)

o NPS2:

tip		:	132193.33  (var: 1.46%)
sis-short	:	131189.33  (var: 1.69%)	(-0.75%)

o NPS4:

tip		:	133285.67  (var: 1.77%)
sis-short	:	133891.33  (var: 1.58%)  (0.45%)

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

Test			Metric	  Parallelism			tip		      sis_short
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48665321.00 (   0.00%)    48553432.30 (  -0.23%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6281376826.80 (   0.00%)  6277335150.50 (  -0.06%)
unixbench-syscall       Amean     unixbench-syscall-1        2689026.67 (   0.00%)     2682044.73 *   0.26%*
unixbench-syscall       Amean     unixbench-syscall-512      7352453.23 (   0.00%)     7290524.47 *  -0.84%*
unixbench-pipe          Hmean     unixbench-pipe-1           2467955.46 (   0.00%)     2426076.17 *  -1.70%*
unixbench-pipe          Hmean     unixbench-pipe-512       295937232.39 (   0.00%)   293462420.03 *  -0.84%*
unixbench-spawn         Hmean     unixbench-spawn-1             4164.75 (   0.00%)        4229.59 (   1.56%)
unixbench-spawn         Hmean     unixbench-spawn-512          79950.80 (   0.00%)       76439.30 (  -4.39%)
unixbench-execl         Hmean     unixbench-execl-1             4112.25 (   0.00%)        4151.37 (   0.95%)
unixbench-execl         Hmean     unixbench-execl-512          11785.88 (   0.00%)       11756.46 (  -0.25%)

o NPS2

Test			Metric	  Parallelism			tip		      sis_short
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49671827.09 (   0.00%)    49077076.00 (  -1.20%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6282239821.90 (   0.00%)  6283671307.30 (   0.02%)
unixbench-syscall       Amean     unixbench-syscall-1        2688504.20 (   0.00%)     2676278.60 *   0.45%*
unixbench-syscall       Amean     unixbench-syscall-512      7321621.07 (   0.00%)     7784926.60 *   6.33%*
unixbench-pipe          Hmean     unixbench-pipe-1           2469941.97 (   0.00%)     2419584.09 *  -2.04%*
unixbench-pipe          Hmean     unixbench-pipe-512       296146392.10 (   0.00%)   293156913.86 *  -1.01%*
unixbench-spawn         Hmean     unixbench-spawn-1             5029.05 (   0.00%)        5015.18 (  -0.28%)
unixbench-spawn         Hmean     unixbench-spawn-512          77198.79 (   0.00%)       80409.23 *   4.16%*
unixbench-execl         Hmean     unixbench-execl-1             4092.59 (   0.00%)        4158.36 *   1.61%*
unixbench-execl         Hmean     unixbench-execl-512          12293.67 (   0.00%)       12169.31 (  -1.01%)

o NPS4

Test			Metric	  Parallelism			tip		      sis_short
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48944542.05 (   0.00%)    49490899.03 *   1.12%*
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6291259625.50 (   0.00%)  6299305899.90 (   0.13%)
unixbench-syscall       Amean     unixbench-syscall-1        2686991.73 (   0.00%)     2682940.53 *   0.15%*
unixbench-syscall       Amean     unixbench-syscall-512      7902201.47 (   0.00%)     7931906.47 (  -0.38%)
unixbench-pipe          Hmean     unixbench-pipe-1           2468813.43 (   0.00%)     2422272.88 *  -1.89%*
unixbench-pipe          Hmean     unixbench-pipe-512       297109244.52 (   0.00%)   294589928.27 *  -0.85%*
unixbench-spawn         Hmean     unixbench-spawn-1             5161.67 (   0.00%)        5012.58 (  -2.89%)
unixbench-spawn         Hmean     unixbench-spawn-512          78657.60 (   0.00%)       78572.80 (  -0.11%)
unixbench-execl         Hmean     unixbench-execl-1             4112.02 (   0.00%)        4122.16 (   0.25%)
unixbench-execl         Hmean     unixbench-execl-512          13700.99 (   0.00%)       14173.20 *   3.44%*

~~~~~~~~~~~
~ SpecJBB ~
~~~~~~~~~~~

o NPS1 - Normalized to baseline (tip)

Kernel			 tip		sis_short
Max-jOPS		100%		 98.53%
Critical-jOPS		100%		105.61%

~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~

o NPS1 - Normalized to baseline (tip)

Kernel		:	  tip		sis_short
8C/16T		:	100.00%		 100.54%
16C/32T		:	100.00%		 100.19%
32C/64T		:	100.00%		  98.08%
64C/128T	:	100.00%		  98.34% 


--------------- With Abel's suggestion added to v5 ---------------

I've added the hunk suggested by Abel in the thread to the v5 and
following are results for the same set of benchmarks but only for
machine running in NPS1 mode.

sis_short_v5.1: 6.2.0-rc6 tip sched/core + this series + Abel's suggestion

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

o NPS1

Test:			tip		   sis_short_v5.1
 1-groups:	   4.38 (0.00 pct)	   4.08 (6.84 pct)
 2-groups:	   5.12 (0.00 pct)	   5.10 (0.39 pct)
 4-groups:	   4.21 (0.00 pct)	   4.23 (-0.47 pct)
 8-groups:	   4.68 (0.00 pct)	   4.69 (-0.21 pct)
16-groups:	   6.13 (0.00 pct)	   5.94 (3.09 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

o NPS1

#workers:	tip		   sis_short_v5.1
  1:	  36.00 (0.00 pct)	  36.00 (0.00 pct)
  2:	  37.00 (0.00 pct)	  39.00 (-5.40 pct)
  4:	  41.00 (0.00 pct)	  40.00 (2.43 pct)
  8:	  46.00 (0.00 pct)	  46.00 (0.00 pct)
 16:	  66.00 (0.00 pct)	  68.00 (-3.03 pct)
 32:	 111.00 (0.00 pct)	 112.00 (-0.90 pct)
 64:	 207.00 (0.00 pct)	 238.00 (-14.97 pct)
 64:	 227.00 (0.00 pct)	 219.00 (3.52 pct)
128:	 483.00 (0.00 pct)	 494.00 (-2.27 pct)
256:	 46272.00 (0.00 pct)	 41280.00 (10.78 pct)
512:	 78293.00 (0.00 pct)	 79325.00 (-1.31 pct)

~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

o NPS1

Clients:	tip		   sis_short_v5.1
    1	 536.78 (0.00 pct)	 535.90 (-0.16 pct)
    2	 1050.74 (0.00 pct)	 1067.32 (1.57 pct)
    4	 1993.47 (0.00 pct)	 1971.63 (-1.09 pct)
    8	 3601.77 (0.00 pct)	 3599.17 (-0.07 pct)
   16	 6202.01 (0.00 pct)	 6115.08 (-1.40 pct)
   32	 11544.55 (0.00 pct)	 11423.52 (-1.04 pct)
   64	 21828.75 (0.00 pct)	 21403.94 (-1.94 pct)
  128	 31095.92 (0.00 pct)	 30783.55 (-1.00 pct)
  256	 54828.12 (0.00 pct)	 55328.94 (0.91 pct)
  512	 54888.10 (0.00 pct)	 53483.33 (-2.55 pct)
 1024	 48407.14 (0.00 pct)	 48998.95 (1.22 pct)

~~~~~~~~~~
~ stream ~
~~~~~~~~~~

o NPS1

10 Runs:

Test:		tip		   sis_short_v5.1
 Copy:	 320576.16 (0.00 pct)	 331810.14 (3.50 pct)
Scale:	 212869.80 (0.00 pct)	 214725.82 (0.87 pct)
  Add:	 241556.74 (0.00 pct)	 242340.92 (0.32 pct)
Triad:	 250637.58 (0.00 pct)	 251271.53 (0.25 pct)

100 Runs:

Test:		tip		   sis_short_v5.1
 Copy:	 330058.38 (0.00 pct)	 331966.60 (0.57 pct)
Scale:	 216475.85 (0.00 pct)	 222777.84 (2.91 pct)
  Add:	 243028.82 (0.00 pct)	 250873.78 (3.22 pct)
Triad:	 252907.98 (0.00 pct)	 253791.20 (0.34 pct)

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1:

tip		:	133514.00  (var: 2.07%)
sis-short_v5.1	:	129172.67  (var: 2.32%)  (-3.25%)  **

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

Test			Metric	  Parallelism			tip		      sis_short_v5.1
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49266026.90 (   0.00%)    49054799.90 (  -0.43%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6285063007.68 (   0.00%)  6280424934.15 (  -0.07%)
unixbench-syscall       Amean     unixbench-syscall-1        2689026.67 (   0.00%)     2677968.03 *   0.41%*
unixbench-syscall       Amean     unixbench-syscall-512      7352453.23 (   0.00%)     7354325.40 (  -0.03%)
unixbench-pipe          Hmean     unixbench-pipe-1           2467955.46 (   0.00%)     2351117.60 *  -4.73%*
unixbench-pipe          Hmean     unixbench-pipe-512       295937232.39 (   0.00%)   295769918.99 (  -0.06%)
unixbench-spawn         Hmean     unixbench-spawn-1             4164.75 (   0.00%)        4331.89 *   4.01%*
unixbench-spawn         Hmean     unixbench-spawn-512          79626.61 (   0.00%)       77865.32 *  -2.21%*
unixbench-execl         Hmean     unixbench-execl-1             4112.25 (   0.00%)        4145.85 (   0.82%)
unixbench-execl         Hmean     unixbench-execl-512          11785.88 (   0.00%)       11935.41 (   1.27%)

~~~~~~~~~~~
~ SpecJBB ~
~~~~~~~~~~~

o NPS1 - Normalized to baseline (tip)

Kernel			 tip	     sis_short_V5.1
Max-jOPS		100%		 91.99%  ** (-8.01%)
Critical-jOPS		100%		 99.29%

~~~~~~~~~~~~~~~~~~
~ DeathStarBench ~
~~~~~~~~~~~~~~~~~~

o NPS1 - Throughput normalized to baseline (tip)

Kernel		:	  tip	      sis_short_V5.1
8C/16T		:	100.00%		  93.75%  ** (-6.25%)
16C/32T		:	100.00%		 100.43%
32C/64T		:	100.00%		 101.12%
64C/128T	:	100.00%		 100.21%

o Follow wake_affine_bias() if waker's cpu and prev_cpu are on same LLC?

There are cases with Abel's suggestion where some of the larger
benchmark regresses. I wonder if wake_affine_bias() can still be
considered for short running tasks if the waker's CPU and the
prev_cpu share caches. In DeathStarBench 8C/16T case, the
services are all pinned to the CPUs of same MC domain. The
regression observed seems to arise from the missed opportunity
to distribute load among the CPUs sharing the same L3. I do not
have data for this currently but I'll update the thread with any
findings.

I'll also queue up a Redis run from mmtest to see if I can reproduce
Abel's observations on my system however I'm not sure if the
utilization will be high enough to emulate the same scenario as
Abel's prod environment. If the migrations within the same MC 

On 2/3/2023 10:47 AM, Chen Yu wrote:
> The main purpose is to avoid too many cross CPU wake up when it is
> unnecessary. The frequent cross CPU wake up brings significant damage
> to some workloads, especially on high core count systems.
> 
> Inhibits the cross CPU wake-up by placing the wakee on waking CPU,
> if both the waker and wakee are short-duration tasks. The short
> duration task could become a trouble maker on high-load system,
> because it could bring frequent context switch. So this strategy
> only takes effect when the system is busy.  Besides, it is unreasonable
> to inhibit the idle CPU scan when there are still idle CPUs.
> 
> First, introduce the definition of a short-duration task. Then
> leverages the first patch to choose a local CPU for wakee.
> 
> Overall there is significant performance improvement on Intel
> 2 x 56C/112T platform.  Such as will-it-scale (1200+%),
> netperf(600+%) in some cases. And no noticeable impact on
> schbench, hackbench, tbench and a OLTP workload with a commercial RDBMS.
> 
> Seeking for test results on other platforms, such as Zen3 and Kunpeng
> Arm64. Appreciated Prateek and Yicong if you can have a try on this
> version.
> 
> Changes since v4:
> 1. Dietmar has commented on the task duration calculation. So refined
>    the commit log to reduce confusion.
> 2. Change [PATCH 1/2] to only record the average duration of a task.
>    So this change could benefit UTIL_EST_FASTER[1].
> 3. As v4 reported regression on Zen3 and Kunpeng Arm64, add back
>    the system average utilization restriction that, if the system
>    is not busy, do not enable the short wake up. Above logic has
>    shown improvment on Zen3[2].
> 4. Restrict the wakeup target to be current CPU, rather than both
>    current CPU and task's previous CPU. This could also benefit
>    wakeup optimization from interrupt in the future, which is
>    suggested by Yicong.
> 
> Changes since v3:
> 1. Honglei and Josh have concern that the threshold of short
>    task duration could be too long. Decreased the threshold from
>    sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8),
>    and the '8' comes from get_update_sysctl_factor().
> 2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion.
> 3. Move the calculation of average duration from put_prev_task_fair()
>    to dequeue_task_fair(). Because there is an issue in v3 that,
>    put_prev_task_fair() will not be invoked by pick_next_task_fair()
>    in fast path, thus the dur_avg could not be updated timely.
> 4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);"
>    on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen.
> 5. Move the scan for CPU with short duration task from select_idle_cpu()
>    to select_idle_siblings(), because there is no CPU scan involved, per
>    Yicong.
> 
> Changes since v2:
> 
> 1. Peter suggested comparing the duration of waker and the cost to
>    scan for an idle CPU: If the cost is higher than the task duration,
>    do not waste time finding an idle CPU, choose the local or previous
>    CPU directly. A prototype was created based on this suggestion.
>    However, according to the test result, this prototype does not inhibit
>    the cross CPU wakeup and did not bring improvement. Because the cost
>    to find an idle CPU is small in the problematic scenario. The root
>    cause of the problem is a race condition between scanning for an idle
>    CPU and task enqueue(please refer to the commit log in PATCH 2/2).
>    So v3 does not change the core logic of v2, with some refinement based
>    on Peter's suggestion.
> 
> 2. Simplify the logic to record the task duration per Peter and Abel's suggestion.
> 
> 
> [1] https://lore.kernel.org/lkml/c56855a7-14fd-4737-fc8b-8ea21487c5f6@arm.com/
> [2] https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@intel.com/
> 
> v4: https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@intel.com/
> v3: https://lore.kernel.org/lkml/cover.1669862147.git.yu.c.chen@intel.com/
> v2: https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@intel.com/
> v1: https://lore.kernel.org/lkml/20220915165407.1776363-1-yu.c.chen@intel.com/
> 
> Chen Yu (2):
>   sched/fair: Record the average duration of a task
>   sched/fair: Introduce SIS_SHORT to wake up short task on current CPU
> 
>  include/linux/sched.h   |  3 +++
>  kernel/sched/core.c     |  2 ++
>  kernel/sched/debug.c    |  1 +
>  kernel/sched/fair.c     | 39 +++++++++++++++++++++++++++++++++++++++
>  kernel/sched/features.h |  1 +
>  5 files changed, 46 insertions(+)
> 

The netperf results are still pending and I'll update the thread
with the same in the coming week. If you would like me to test
or gather some data for specific workload on the test system,
please do let me know.
--
Thanks and Regards,
Prateek

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ