linux-kernel - Re: [PATCH v5 0/2] sched/fair: Wake short task on current CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y/MMD/6Jawc/ZeCz@chenyu5-mobl1>
Date:   Mon, 20 Feb 2023 13:58:39 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     K Prateek Nayak <kprateek.nayak@....com>
CC:     Abel Wu <wuyun.abel@...edance.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Peter Zijlstra <peterz@...radead.org>,
        "Juri Lelli" <juri.lelli@...hat.com>,
        Ingo Molnar <mingo@...hat.com>,
        "Vincent Guittot" <vincent.guittot@...aro.org>,
        Tim Chen <tim.c.chen@...el.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        Yicong Yang <yangyicong@...ilicon.com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Honglei Wang <wanghonglei@...ichuxing.com>,
        Len Brown <len.brown@...el.com>,
        Chen Yu <yu.chen.surf@...il.com>,
        Tianchen Ding <dtcccc@...ux.alibaba.com>,
        Joel Fernandes <joel@...lfernandes.org>,
        Josh Don <joshdon@...gle.com>,
        "Hillf Danton" <hdanton@...a.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v5 0/2] sched/fair: Wake short task on current CPU

Hi Prateek,
On 2023-02-18 at 01:05:32 +0530, K Prateek Nayak wrote:
> Hello Chenyu and Abel,
> 
> I'll leave the detailed results from testing on a dual socket Zen3 system
> (2 x 64C/128T) below.
> 
Thanks for the test!
> tl;dr
> 
> o Most benchmark results see small wins or are comparable to tip.
> o SpecJBB Max-jOPS see a small hit but Critical-jOPS improve.
I assume that this change should be in acceptible variance rance, because in
previous version, we did not restrict the local wakeup as strictly as current
version, and we did not see a hit on Max-jOPS in v4. Anyway, I've
enhanced the restriction per Abel's feedback and launched some tests,
so that to make Redis and SpecJBB feel better.
> o ycsb-mongodb sees small uplift in NPS1 mode.
> o Numbers for Netperf runs are pending which I'll share in the
>   coming week.
> o Abel's suggestion on top of v5 seem promising but there are
>   few regressions I notice on larger workloads.
> 
> Detailed Results:
> 
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
> 
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
> 
> Benchmark Results:
> 
> Kernel versions:
> - tip:          6.2.0-rc6 tip sched/core
> - sis_short: 	6.2.0-rc6 tip sched/core + this series
> 
> When the testing started, the tip was at:
> commit 4d627628d758 "cpuidle: Fix poll_idle() noinstr annotation"
> 
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
> 
> o NPS1
> 
> Test:			tip			sis_short
>  1-groups:	   4.38 (0.00 pct)	   4.49 (-2.51 pct)
>  2-groups:	   5.12 (0.00 pct)	   5.20 (-1.56 pct)
>  4-groups:	   4.21 (0.00 pct)	   4.24 (-0.71 pct)
>  8-groups:	   4.68 (0.00 pct)	   4.73 (-1.06 pct)
> 16-groups:	   6.13 (0.00 pct)	   6.35 (-3.58 pct)
> 
> o NPS2
> 
> Test:			tip			sis_short
>  1-groups:	   4.51 (0.00 pct)	   4.36 (3.32 pct)
>  2-groups:	   4.31 (0.00 pct)	   4.35 (0.92 pct)
>  4-groups:	   4.17 (0.00 pct)	   4.08 (2.15 pct)
>  8-groups:	   4.58 (0.00 pct)	   4.49 (1.96 pct)
> 16-groups:	   5.74 (0.00 pct)	   5.93 (-3.31 pct)
> 
> o NPS4
> 
> Test:			tip			sis_short
>  1-groups:	   4.47 (0.00 pct)	   4.51 (-0.89 pct)
>  2-groups:	   4.97 (0.00 pct)	   5.04 (-1.40 pct)
>  4-groups:	   4.26 (0.00 pct)	   4.28 (-0.46 pct)
>  8-groups:	   5.46 (0.00 pct)	   5.56 (-1.83 pct)
> 16-groups:	   6.38 (0.00 pct)	   6.10 (4.38 pct)
> 
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
> 
> o NPS1
> 
> #workers:	tip			sis_short
>   1:	  36.00 (0.00 pct)	  27.00 (25.00 pct)
>   2:	  37.00 (0.00 pct)	  32.00 (13.51 pct)
>   4:	  41.00 (0.00 pct)	  34.00 (17.07 pct)
>   8:	  46.00 (0.00 pct)	  43.00 (6.52 pct)
>  16:	  66.00 (0.00 pct)	  66.00 (0.00 pct)
>  32:	 111.00 (0.00 pct)	 108.00 (2.70 pct)
>  64:	 207.00 (0.00 pct)	 206.00 (0.48 pct)
> 128:	 483.00 (0.00 pct)	 481.00 (0.41 pct)
> 256:	 46272.00 (0.00 pct)	 45120.00 (2.48 pct)
> 512:	 76160.00 (0.00 pct)	 77696.00 (-2.01 pct)
> 
> o NPS2
> 
> #workers:	tip			sis_short
>   1:	  33.00 (0.00 pct)	  31.00 (6.06 pct)
>   2:	  35.00 (0.00 pct)	  31.00 (11.42 pct)
>   4:	  38.00 (0.00 pct)	  38.00 (0.00 pct)
>   8:	  51.00 (0.00 pct)	  47.00 (7.84 pct)
>  16:	  64.00 (0.00 pct)	  67.00 (-4.68 pct)
>  32:	 118.00 (0.00 pct)	 116.00 (1.69 pct)
>  64:	 214.00 (0.00 pct)	 217.00 (-1.40 pct)
> 128:	 497.00 (0.00 pct)	 504.00 (-1.40 pct)
> 256:	 45632.00 (0.00 pct)	 44352.00 (2.80 pct)
> 512:	 81024.00 (0.00 pct)	 78464.00 (3.15 pct)
> 
> o NPS4
> 
> #workers:	tip			sis_short
>   1:	  33.00 (0.00 pct)	  32.00 (3.03 pct)
>   2:	  40.00 (0.00 pct)	  32.00 (20.00 pct)
>   4:	  42.00 (0.00 pct)	  38.00 (9.52 pct)
>   8:	  64.00 (0.00 pct)	  65.00 (-1.56 pct)
>  16:	  73.00 (0.00 pct)	  69.00 (5.47 pct)
>  32:	 112.00 (0.00 pct)	 112.00 (0.00 pct)
>  64:	 215.00 (0.00 pct)	 207.00 (3.72 pct)
> 128:	 615.00 (0.00 pct)	 593.00 (3.73 pct)
> 256:	 46144.00 (0.00 pct)	 45376.00 (1.66 pct)
> 512:	 78208.00 (0.00 pct)	 77696.00 (0.65 pct)
> 
> 
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
> 
> o NPS1
> 
> Clients:	tip			sis_short
>     1	 536.78 (0.00 pct)	 537.38 (0.11 pct)
>     2	 1050.74 (0.00 pct)	 1058.74 (0.76 pct)
>     4	 1993.47 (0.00 pct)	 1976.79 (-0.83 pct)
>     8	 3498.02 (0.00 pct)	 3657.16 (4.54 pct)
>    16	 6202.01 (0.00 pct)	 6014.62 (-3.02 pct)
>    32	 11544.55 (0.00 pct)	 11847.47 (2.62 pct)
>    64	 21828.75 (0.00 pct)	 21754.85 (-0.33 pct)
>   128	 31095.92 (0.00 pct)	 31643.35 (1.76 pct)
>   256	 54828.12 (0.00 pct)	 55432.29 (1.10 pct)
>   512	 54888.10 (0.00 pct)	 55917.91 (1.87 pct)
>  1024	 54916.75 (0.00 pct)	 53468.79 (-2.63 pct)
> 
> o NPS2
> 
> Clients:	tip			sis_short
>     1	 543.08 (0.00 pct)	 544.49 (0.25 pct)
>     2	 1074.55 (0.00 pct)	 1060.33 (-1.32 pct)
>     4	 1980.75 (0.00 pct)	 1992.86 (0.61 pct)
>     8	 3628.36 (0.00 pct)	 3507.73 (-3.32 pct)
>    16	 5806.00 (0.00 pct)	 5790.82 (-0.26 pct)
>    32	 11351.94 (0.00 pct)	 10937.21 (-3.26 pct)
>    64	 19987.40 (0.00 pct)	 20739.38 (3.76 pct)
>   128	 29554.40 (0.00 pct)	 30011.99 (1.54 pct)
>   256	 53594.11 (0.00 pct)	 51473.78 (-3.95 pct)
>   512	 54304.03 (0.00 pct)	 52998.31 (-2.40 pct)
>  1024	 54338.25 (0.00 pct)	 53265.51 (-1.97 pct)
> 
> o NPS4
> 
> Clients:	tip			sis_short
>     1	 541.29 (0.00 pct)	 536.21 (-0.93 pct)
>     2	 1045.15 (0.00 pct)	 1054.94 (0.93 pct)
>     4	 1973.01 (0.00 pct)	 1988.63 (0.79 pct)
>     8	 3490.55 (0.00 pct)	 3535.27 (1.28 pct)
>    16	 5920.12 (0.00 pct)	 5846.04 (-1.25 pct)
>    32	 10933.38 (0.00 pct)	 10944.33 (0.10pct)
>    64	 19628.34 (0.00 pct)	 19328.66 (1.01 pct)
>   128	 29785.23 (0.00 pct)	 28749.48 (-4.55 pct)
>   256	 51999.72 (0.00 pct)	 51336.20 (-1.27 pct)
>   512	 53619.42 (0.00 pct)	 53269.04 (-0.65 pct)
>  1024	 53956.57 (0.00 pct)	 53666.14 (-0.53 pct)
> 
> 
> ~~~~~~~~~~
> ~ stream ~
> ~~~~~~~~~~
> 
> o NPS1
> 
> 10 Runs:
> 
> Test:		tip			sis_short
>  Copy:	 320576.16 (0.00 pct)	 328194.56 (2.37 pct)
> Scale:	 212869.80 (0.00 pct)	 216713.96 (1.80 pct)
>   Add:	 241556.74 (0.00 pct)	 247467.26 (2.44 pct)
> Triad:	 250637.58 (0.00 pct)	 245538.49 (-2.03 pct)
> 
> 100 Runs:
> 
> Test:		tip			sis_short
>  Copy:	 330058.38 (0.00 pct)	 329339.60 (-0.21 pct)
> Scale:	 216475.85 (0.00 pct)	 219334.10 (1.32 pct)
>   Add:	 243028.82 (0.00 pct)	 244037.77 (0.41 pct)
> Triad:	 252907.98 (0.00 pct)	 257210.37 (1.70 pct)
> 
> o NPS2
> 
> 10 Runs:
> 
> Test:		tip			sis_short
>  Copy:	 339946.34 (0.00 pct)	 327261.79 (-3.73 pct)
> Scale:	 217453.46 (0.00 pct)	 221366.66 (1.79 pct)
>   Add:	 258099.63 (0.00 pct)	 258472.44 (0.14 pct)
> Triad:	 264974.76 (0.00 pct)	 262618.99 (-0.88 pct)
> 
> 100 Runs:
> 
> Test:		tip			sis_short
>  Copy:	 335725.30 (0.00 pct)	 320797.67 (-4.44 pct)
> Scale:	 229985.45 (0.00 pct)	 221706.62 (-3.59 pct)
>   Add:	 260546.33 (0.00 pct)	 250668.80 (-3.79 pct)
> Triad:	 267925.27 (0.00 pct)	 262959.86 (-1.85 pct)
> 
> o NPS4
> 
> 10 Runs:
> 
> Test:		tip			sis_short
>  Copy:   369037.34 (0.00 pct)    371514.46 (0.67 pct)
> Scale:   238235.39 (0.00 pct)    237661.29 (-0.24 pct)
>   Add:   263626.48 (0.00 pct)    263436.20 (-0.07 pct)
> Triad:   280881.43 (0.00 pct)    288059.52 (2.55 pct)
> 
> 100 Runs:
> 
> Test:		tip			sis_short
>  Copy:	 339036.66 (0.00 pct)	 346904.09 (2.32 pct)
> Scale:	 246638.02 (0.00 pct)	 230195.65 (-6.66 pct)
>   Add:	 259898.86 (0.00 pct)	 244631.77 (-5.87 pct)
> Triad:	 265719.02 (0.00 pct)	 264620.50 (-0.41 pct)
> 
> ~~~~~~~~~~~~~~~~
> ~ ycsb-mongodb ~
> ~~~~~~~~~~~~~~~~
> 
> o NPS1:
> 
> tip		:	133514.00  (var: 2.07%)
> sis-short	:	137664.67  (var: 1.45%)  (3.11%)
> 
> o NPS2:
> 
> tip		:	132193.33  (var: 1.46%)
> sis-short	:	131189.33  (var: 1.69%)	(-0.75%)
> 
> o NPS4:
> 
> tip		:	133285.67  (var: 1.77%)
> sis-short	:	133891.33  (var: 1.58%)  (0.45%)
> 
> ~~~~~~~~~~~~~
> ~ unixbench ~
> ~~~~~~~~~~~~~
> 
> o NPS1
> 
> Test			Metric	  Parallelism			tip		      sis_short
> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48665321.00 (   0.00%)    48553432.30 (  -0.23%)
> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6281376826.80 (   0.00%)  6277335150.50 (  -0.06%)
> unixbench-syscall       Amean     unixbench-syscall-1        2689026.67 (   0.00%)     2682044.73 *   0.26%*
> unixbench-syscall       Amean     unixbench-syscall-512      7352453.23 (   0.00%)     7290524.47 *  -0.84%*
> unixbench-pipe          Hmean     unixbench-pipe-1           2467955.46 (   0.00%)     2426076.17 *  -1.70%*
> unixbench-pipe          Hmean     unixbench-pipe-512       295937232.39 (   0.00%)   293462420.03 *  -0.84%*
> unixbench-spawn         Hmean     unixbench-spawn-1             4164.75 (   0.00%)        4229.59 (   1.56%)
> unixbench-spawn         Hmean     unixbench-spawn-512          79950.80 (   0.00%)       76439.30 (  -4.39%)
> unixbench-execl         Hmean     unixbench-execl-1             4112.25 (   0.00%)        4151.37 (   0.95%)
> unixbench-execl         Hmean     unixbench-execl-512          11785.88 (   0.00%)       11756.46 (  -0.25%)
> 
> o NPS2
> 
> Test			Metric	  Parallelism			tip		      sis_short
> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49671827.09 (   0.00%)    49077076.00 (  -1.20%)
> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6282239821.90 (   0.00%)  6283671307.30 (   0.02%)
> unixbench-syscall       Amean     unixbench-syscall-1        2688504.20 (   0.00%)     2676278.60 *   0.45%*
> unixbench-syscall       Amean     unixbench-syscall-512      7321621.07 (   0.00%)     7784926.60 *   6.33%*
> unixbench-pipe          Hmean     unixbench-pipe-1           2469941.97 (   0.00%)     2419584.09 *  -2.04%*
> unixbench-pipe          Hmean     unixbench-pipe-512       296146392.10 (   0.00%)   293156913.86 *  -1.01%*
> unixbench-spawn         Hmean     unixbench-spawn-1             5029.05 (   0.00%)        5015.18 (  -0.28%)
> unixbench-spawn         Hmean     unixbench-spawn-512          77198.79 (   0.00%)       80409.23 *   4.16%*
> unixbench-execl         Hmean     unixbench-execl-1             4092.59 (   0.00%)        4158.36 *   1.61%*
> unixbench-execl         Hmean     unixbench-execl-512          12293.67 (   0.00%)       12169.31 (  -1.01%)
> 
> o NPS4
> 
> Test			Metric	  Parallelism			tip		      sis_short
> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48944542.05 (   0.00%)    49490899.03 *   1.12%*
> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6291259625.50 (   0.00%)  6299305899.90 (   0.13%)
> unixbench-syscall       Amean     unixbench-syscall-1        2686991.73 (   0.00%)     2682940.53 *   0.15%*
> unixbench-syscall       Amean     unixbench-syscall-512      7902201.47 (   0.00%)     7931906.47 (  -0.38%)
> unixbench-pipe          Hmean     unixbench-pipe-1           2468813.43 (   0.00%)     2422272.88 *  -1.89%*
> unixbench-pipe          Hmean     unixbench-pipe-512       297109244.52 (   0.00%)   294589928.27 *  -0.85%*
> unixbench-spawn         Hmean     unixbench-spawn-1             5161.67 (   0.00%)        5012.58 (  -2.89%)
> unixbench-spawn         Hmean     unixbench-spawn-512          78657.60 (   0.00%)       78572.80 (  -0.11%)
> unixbench-execl         Hmean     unixbench-execl-1             4112.02 (   0.00%)        4122.16 (   0.25%)
> unixbench-execl         Hmean     unixbench-execl-512          13700.99 (   0.00%)       14173.20 *   3.44%*
> 
> ~~~~~~~~~~~
> ~ SpecJBB ~
> ~~~~~~~~~~~
> 
> o NPS1 - Normalized to baseline (tip)
> 
> Kernel			 tip		sis_short
> Max-jOPS		100%		 98.53%
> Critical-jOPS		100%		105.61%
> 
> ~~~~~~~~~~~~~~~~~~
> ~ DeathStarBench ~
> ~~~~~~~~~~~~~~~~~~
> 
> o NPS1 - Normalized to baseline (tip)
> 
> Kernel		:	  tip		sis_short
> 8C/16T		:	100.00%		 100.54%
> 16C/32T		:	100.00%		 100.19%
> 32C/64T		:	100.00%		  98.08%
> 64C/128T	:	100.00%		  98.34% 
> 
> 
> --------------- With Abel's suggestion added to v5 ---------------
> 
> I've added the hunk suggested by Abel in the thread to the v5 and
> following are results for the same set of benchmarks but only for
> machine running in NPS1 mode.
> 
> sis_short_v5.1: 6.2.0-rc6 tip sched/core + this series + Abel's suggestion
> 
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
> 
> o NPS1
> 
> Test:			tip		   sis_short_v5.1
>  1-groups:	   4.38 (0.00 pct)	   4.08 (6.84 pct)
>  2-groups:	   5.12 (0.00 pct)	   5.10 (0.39 pct)
>  4-groups:	   4.21 (0.00 pct)	   4.23 (-0.47 pct)
>  8-groups:	   4.68 (0.00 pct)	   4.69 (-0.21 pct)
> 16-groups:	   6.13 (0.00 pct)	   5.94 (3.09 pct)
> 
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
> 
> o NPS1
> 
> #workers:	tip		   sis_short_v5.1
>   1:	  36.00 (0.00 pct)	  36.00 (0.00 pct)
>   2:	  37.00 (0.00 pct)	  39.00 (-5.40 pct)
>   4:	  41.00 (0.00 pct)	  40.00 (2.43 pct)
>   8:	  46.00 (0.00 pct)	  46.00 (0.00 pct)
>  16:	  66.00 (0.00 pct)	  68.00 (-3.03 pct)
>  32:	 111.00 (0.00 pct)	 112.00 (-0.90 pct)
>  64:	 207.00 (0.00 pct)	 238.00 (-14.97 pct)
>  64:	 227.00 (0.00 pct)	 219.00 (3.52 pct)
> 128:	 483.00 (0.00 pct)	 494.00 (-2.27 pct)
> 256:	 46272.00 (0.00 pct)	 41280.00 (10.78 pct)
> 512:	 78293.00 (0.00 pct)	 79325.00 (-1.31 pct)
> 
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
> 
> o NPS1
> 
> Clients:	tip		   sis_short_v5.1
>     1	 536.78 (0.00 pct)	 535.90 (-0.16 pct)
>     2	 1050.74 (0.00 pct)	 1067.32 (1.57 pct)
>     4	 1993.47 (0.00 pct)	 1971.63 (-1.09 pct)
>     8	 3601.77 (0.00 pct)	 3599.17 (-0.07 pct)
>    16	 6202.01 (0.00 pct)	 6115.08 (-1.40 pct)
>    32	 11544.55 (0.00 pct)	 11423.52 (-1.04 pct)
>    64	 21828.75 (0.00 pct)	 21403.94 (-1.94 pct)
>   128	 31095.92 (0.00 pct)	 30783.55 (-1.00 pct)
>   256	 54828.12 (0.00 pct)	 55328.94 (0.91 pct)
>   512	 54888.10 (0.00 pct)	 53483.33 (-2.55 pct)
>  1024	 48407.14 (0.00 pct)	 48998.95 (1.22 pct)
> 
> ~~~~~~~~~~
> ~ stream ~
> ~~~~~~~~~~
> 
> o NPS1
> 
> 10 Runs:
> 
> Test:		tip		   sis_short_v5.1
>  Copy:	 320576.16 (0.00 pct)	 331810.14 (3.50 pct)
> Scale:	 212869.80 (0.00 pct)	 214725.82 (0.87 pct)
>   Add:	 241556.74 (0.00 pct)	 242340.92 (0.32 pct)
> Triad:	 250637.58 (0.00 pct)	 251271.53 (0.25 pct)
> 
> 100 Runs:
> 
> Test:		tip		   sis_short_v5.1
>  Copy:	 330058.38 (0.00 pct)	 331966.60 (0.57 pct)
> Scale:	 216475.85 (0.00 pct)	 222777.84 (2.91 pct)
>   Add:	 243028.82 (0.00 pct)	 250873.78 (3.22 pct)
> Triad:	 252907.98 (0.00 pct)	 253791.20 (0.34 pct)
> 
> ~~~~~~~~~~~~~~~~
> ~ ycsb-mongodb ~
> ~~~~~~~~~~~~~~~~
> 
> o NPS1:
> 
> tip		:	133514.00  (var: 2.07%)
> sis-short_v5.1	:	129172.67  (var: 2.32%)  (-3.25%)  **
> 
> ~~~~~~~~~~~~~
> ~ unixbench ~
> ~~~~~~~~~~~~~
> 
> o NPS1
> 
> Test			Metric	  Parallelism			tip		      sis_short_v5.1
> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      49266026.90 (   0.00%)    49054799.90 (  -0.43%)
> unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6285063007.68 (   0.00%)  6280424934.15 (  -0.07%)
> unixbench-syscall       Amean     unixbench-syscall-1        2689026.67 (   0.00%)     2677968.03 *   0.41%*
> unixbench-syscall       Amean     unixbench-syscall-512      7352453.23 (   0.00%)     7354325.40 (  -0.03%)
> unixbench-pipe          Hmean     unixbench-pipe-1           2467955.46 (   0.00%)     2351117.60 *  -4.73%*
> unixbench-pipe          Hmean     unixbench-pipe-512       295937232.39 (   0.00%)   295769918.99 (  -0.06%)
> unixbench-spawn         Hmean     unixbench-spawn-1             4164.75 (   0.00%)        4331.89 *   4.01%*
> unixbench-spawn         Hmean     unixbench-spawn-512          79626.61 (   0.00%)       77865.32 *  -2.21%*
> unixbench-execl         Hmean     unixbench-execl-1             4112.25 (   0.00%)        4145.85 (   0.82%)
> unixbench-execl         Hmean     unixbench-execl-512          11785.88 (   0.00%)       11935.41 (   1.27%)
> 
> ~~~~~~~~~~~
> ~ SpecJBB ~
> ~~~~~~~~~~~
> 
> o NPS1 - Normalized to baseline (tip)
> 
> Kernel			 tip	     sis_short_V5.1
> Max-jOPS		100%		 91.99%  ** (-8.01%)
> Critical-jOPS		100%		 99.29%
> 
> ~~~~~~~~~~~~~~~~~~
> ~ DeathStarBench ~
> ~~~~~~~~~~~~~~~~~~
> 
> o NPS1 - Throughput normalized to baseline (tip)
> 
> Kernel		:	  tip	      sis_short_V5.1
> 8C/16T		:	100.00%		  93.75%  ** (-6.25%)
> 16C/32T		:	100.00%		 100.43%
> 32C/64T		:	100.00%		 101.12%
> 64C/128T	:	100.00%		 100.21%
> 
> o Follow wake_affine_bias() if waker's cpu and prev_cpu are on same LLC?
> 
> There are cases with Abel's suggestion where some of the larger
> benchmark regresses. I wonder if wake_affine_bias() can still be
> considered for short running tasks if the waker's CPU and the
> prev_cpu share caches. In DeathStarBench 8C/16T case, the
> services are all pinned to the CPUs of same MC domain. The
> regression observed seems to arise from the missed opportunity
> to distribute load among the CPUs sharing the same L3. I do not
> have data for this currently but I'll update the thread with any
> findings.
Good observation. Just like select_idle_sibling(), prev cpu should only
be chosen if the target cpu and prev cpu shares the LLC cache. My next
version still prefers current cpu than prev cpu, and add the wake_flips
check to aggregate tasks on current CPU only the waker and wakee wakes
up each other frequently. This could somehow mitigate the problem Abel
mentioned, too many short tasks are stacked on current CPU.

thanks,
Chenyu
> 
> I'll also queue up a Redis run from mmtest to see if I can reproduce
> Abel's observations on my system however I'm not sure if the
> utilization will be high enough to emulate the same scenario as
> Abel's prod environment. If the migrations within the same MC 
> 
> On 2/3/2023 10:47 AM, Chen Yu wrote:
> > The main purpose is to avoid too many cross CPU wake up when it is
> > unnecessary. The frequent cross CPU wake up brings significant damage
> > to some workloads, especially on high core count systems.
> > 
> > Inhibits the cross CPU wake-up by placing the wakee on waking CPU,
> > if both the waker and wakee are short-duration tasks. The short
> > duration task could become a trouble maker on high-load system,
> > because it could bring frequent context switch. So this strategy
> > only takes effect when the system is busy.  Besides, it is unreasonable
> > to inhibit the idle CPU scan when there are still idle CPUs.
> > 
> > First, introduce the definition of a short-duration task. Then
> > leverages the first patch to choose a local CPU for wakee.
> > 
> > Overall there is significant performance improvement on Intel
> > 2 x 56C/112T platform.  Such as will-it-scale (1200+%),
> > netperf(600+%) in some cases. And no noticeable impact on
> > schbench, hackbench, tbench and a OLTP workload with a commercial RDBMS.
> > 
> > Seeking for test results on other platforms, such as Zen3 and Kunpeng
> > Arm64. Appreciated Prateek and Yicong if you can have a try on this
> > version.
> > 
> > Changes since v4:
> > 1. Dietmar has commented on the task duration calculation. So refined
> >    the commit log to reduce confusion.
> > 2. Change [PATCH 1/2] to only record the average duration of a task.
> >    So this change could benefit UTIL_EST_FASTER[1].
> > 3. As v4 reported regression on Zen3 and Kunpeng Arm64, add back
> >    the system average utilization restriction that, if the system
> >    is not busy, do not enable the short wake up. Above logic has
> >    shown improvment on Zen3[2].
> > 4. Restrict the wakeup target to be current CPU, rather than both
> >    current CPU and task's previous CPU. This could also benefit
> >    wakeup optimization from interrupt in the future, which is
> >    suggested by Yicong.
> > 
> > Changes since v3:
> > 1. Honglei and Josh have concern that the threshold of short
> >    task duration could be too long. Decreased the threshold from
> >    sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8),
> >    and the '8' comes from get_update_sysctl_factor().
> > 2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion.
> > 3. Move the calculation of average duration from put_prev_task_fair()
> >    to dequeue_task_fair(). Because there is an issue in v3 that,
> >    put_prev_task_fair() will not be invoked by pick_next_task_fair()
> >    in fast path, thus the dur_avg could not be updated timely.
> > 4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);"
> >    on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen.
> > 5. Move the scan for CPU with short duration task from select_idle_cpu()
> >    to select_idle_siblings(), because there is no CPU scan involved, per
> >    Yicong.
> > 
> > Changes since v2:
> > 
> > 1. Peter suggested comparing the duration of waker and the cost to
> >    scan for an idle CPU: If the cost is higher than the task duration,
> >    do not waste time finding an idle CPU, choose the local or previous
> >    CPU directly. A prototype was created based on this suggestion.
> >    However, according to the test result, this prototype does not inhibit
> >    the cross CPU wakeup and did not bring improvement. Because the cost
> >    to find an idle CPU is small in the problematic scenario. The root
> >    cause of the problem is a race condition between scanning for an idle
> >    CPU and task enqueue(please refer to the commit log in PATCH 2/2).
> >    So v3 does not change the core logic of v2, with some refinement based
> >    on Peter's suggestion.
> > 
> > 2. Simplify the logic to record the task duration per Peter and Abel's suggestion.
> > 
> > 
> > [1] https://lore.kernel.org/lkml/c56855a7-14fd-4737-fc8b-8ea21487c5f6@arm.com/
> > [2] https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@intel.com/
> > 
> > v4: https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@intel.com/
> > v3: https://lore.kernel.org/lkml/cover.1669862147.git.yu.c.chen@intel.com/
> > v2: https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@intel.com/
> > v1: https://lore.kernel.org/lkml/20220915165407.1776363-1-yu.c.chen@intel.com/
> > 
> > Chen Yu (2):
> >   sched/fair: Record the average duration of a task
> >   sched/fair: Introduce SIS_SHORT to wake up short task on current CPU
> > 
> >  include/linux/sched.h   |  3 +++
> >  kernel/sched/core.c     |  2 ++
> >  kernel/sched/debug.c    |  1 +
> >  kernel/sched/fair.c     | 39 +++++++++++++++++++++++++++++++++++++++
> >  kernel/sched/features.h |  1 +
> >  5 files changed, 46 insertions(+)
> > 
> 
> The netperf results are still pending and I'll update the thread
> with the same in the coming week. If you would like me to test
> or gather some data for specific workload on the test system,
> please do let me know.
> --
I'll launch more tests and sent out the result later.

thanks,
Chenyu