linux-kernel - Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <01eb35f2-4a34-7fbc-da48-c9558dca7212@amd.com>
Date:   Tue, 7 Feb 2023 09:12:29 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     Abel Wu <wuyun.abel@...edance.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>, Mel Gorman <mgorman@...e.de>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Valentin Schneider <valentin.schneider@....com>
Cc:     Josh Don <joshdon@...gle.com>, Chen Yu <yu.c.chen@...el.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Aubrey Li <aubrey.li@...el.com>,
        Qais Yousef <qais.yousef@....com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Rik van Riel <riel@...riel.com>,
        Yicong Yang <yangyicong@...wei.com>,
        Barry Song <21cnbao@...il.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v6 0/4] sched/fair: Improve scan efficiency of SIS

Hello Abel,

I've retested the patches with on the updated tip and the results
are still promising.

tl;dr

o Hackbench sees improvements when the machine is overloaded.
o tbench shows improvements when the machine is overloaded.
o The unixbench regression seen previously seems to be unrelated
  to the patch as the spawn test scores are vastly different
  after a reboot/kexec for the same kernel.
o Other benchmarks show slight improvements or are comparable to
  the numbers on tip.

Following are the results from running standard benchmarks on a
dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Following are the Kernel versions:

tip:            6.2.0-rc2 tip:sched/core at
                commit: bbd0b031509b "sched/rseq: Fix concurrency ID handling of usermodehelper kthreads"
sis_short:      tip + series

The patch applied cleanly on the tip.

Benchmark Results:

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

NPS1

Test:			tip			sis_core
 1-groups:	   4.36 (0.00 pct)	   4.17 (4.35 pct)
 2-groups:	   5.17 (0.00 pct)	   5.03 (2.70 pct)
 4-groups:	   4.17 (0.00 pct)	   4.14 (0.71 pct)
 8-groups:	   4.64 (0.00 pct)	   4.63 (0.21 pct)
16-groups:	   5.43 (0.00 pct)	   5.32 (2.02 pct)

NPS2

Test:			tip			sis_core
 1-groups:	   4.43 (0.00 pct)	   4.27 (3.61 pct)
 2-groups:	   4.61 (0.00 pct)	   4.92 (-6.72 pct)	*
 2-groups:	   4.52 (0.00 pct)	   4.55 (-0.66 pct)	[Verification Run]
 4-groups:	   4.25 (0.00 pct)	   4.10 (3.52 pct)
 8-groups:	   4.91 (0.00 pct)	   4.53 (7.73 pct)
16-groups:	   5.84 (0.00 pct)	   5.54 (5.13 pct)

NPS4

Test:			tip			sis_core
 1-groups:	   4.34 (0.00 pct)	   4.23 (2.53 pct)
 2-groups:	   4.64 (0.00 pct)	   4.84 (-4.31 pct)
 4-groups:	   4.20 (0.00 pct)	   4.17 (0.71 pct)
 8-groups:	   5.21 (0.00 pct)	   5.06 (2.87 pct)
16-groups:	   6.24 (0.00 pct)	   5.60 (10.25 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

NPS1

#workers:	tip			sis_core
  1:	  36.00 (0.00 pct)	  23.00 (36.11 pct)
  2:	  37.00 (0.00 pct)	  37.00 (0.00 pct)
  4:	  37.00 (0.00 pct)	  38.00 (-2.70 pct)
  8:	  47.00 (0.00 pct)	  52.00 (-10.63 pct)
 16:	  64.00 (0.00 pct)	  65.00 (-1.56 pct)
 32:	 109.00 (0.00 pct)	 111.00 (-1.83 pct)
 64:	 222.00 (0.00 pct)	 215.00 (3.15 pct)
128:	 515.00 (0.00 pct)	 486.00 (5.63 pct)
256:	 39744.00 (0.00 pct)	 47808.00 (-20.28 pct)	* (Machine Overloaded ~ 2 tasks per rq)
256:	 43242.00 (0.00 pct)	 42293.00 (2.19 pct)	[Verification Run]
512:	 81280.00 (0.00 pct)	 76416.00 (5.98 pct)

NPS2

#workers:	tip			sis_core
  1:	  27.00 (0.00 pct)	  27.00 (0.00 pct)
  2:	  31.00 (0.00 pct)	  30.00 (3.22 pct)
  4:	  38.00 (0.00 pct)	  37.00 (2.63 pct)
  8:	  50.00 (0.00 pct)	  46.00 (8.00 pct)
 16:	  66.00 (0.00 pct)	  68.00 (-3.03 pct)
 32:	 116.00 (0.00 pct)	 113.00 (2.58 pct)
 64:	 210.00 (0.00 pct)	 228.00 (-8.57 pct)	*
 64:	 206.00 (0.00 pct)	 219.00 (-6.31 pct)	[Verification Run]
128:	 523.00 (0.00 pct)	 559.00 (-6.88 pct)	*
128:	 474.00 (0.00 pct)	 497.00 (-4.85 pct)	[Verification Run]
256:	 44864.00 (0.00 pct)	 47040.00 (-4.85 pct)
512:	 78464.00 (0.00 pct)	 81280.00 (-3.58 pct)

NPS4

#workers:	tip			sis_core
  1:	  32.00 (0.00 pct)	  27.00 (15.62 pct)
  2:	  32.00 (0.00 pct)	  35.00 (-9.37 pct)
  4:	  34.00 (0.00 pct)	  41.00 (-20.58 pct)
  8:	  58.00 (0.00 pct)	  58.00 (0.00 pct)
 16:	  67.00 (0.00 pct)	  69.00 (-2.98 pct)
 32:	 118.00 (0.00 pct)	 112.00 (5.08 pct)
 64:	 224.00 (0.00 pct)	 209.00 (6.69 pct)
128:	 533.00 (0.00 pct)	 519.00 (2.62 pct)
256:	 43456.00 (0.00 pct)	 45248.00 (-4.12 pct)
512:	 78976.00 (0.00 pct)	 76160.00 (3.56 pct)


~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

NPS1

Clients:	tip			sis_core
    1	 539.96 (0.00 pct)	 538.19 (-0.32 pct)
    2	 1068.21 (0.00 pct)	 1063.04 (-0.48 pct)
    4	 1994.76 (0.00 pct)	 1990.47 (-0.21 pct)
    8	 3602.30 (0.00 pct)	 3496.07 (-2.94 pct)
   16	 6075.49 (0.00 pct)	 6061.74 (-0.22 pct)
   32	 11641.07 (0.00 pct)	 11904.58 (2.26 pct)
   64	 21529.16 (0.00 pct)	 22124.81 (2.76 pct)
  128	 30852.92 (0.00 pct)	 31258.56 (1.31 pct)
  256	 51901.20 (0.00 pct)	 53249.69 (2.59 pct)
  512	 46797.40 (0.00 pct)	 54477.79 (16.41 pct)
 1024	 46057.28 (0.00 pct)	 53676.58 (16.54 pct)

NPS2

Clients:	tip			sis_core
    1	 536.11 (0.00 pct)	 541.18 (0.94 pct)
    2	 1044.58 (0.00 pct)	 1064.16 (1.87 pct)
    4	 2043.92 (0.00 pct)	 2017.84 (-1.27 pct)
    8	 3572.50 (0.00 pct)	 3494.83 (-2.17 pct)
   16	 6040.97 (0.00 pct)	 5530.10 (-8.45 pct)	*
   16	 5814.03 (0.00 pct)	 6012.33 (-8.45 pct)	[Verification Run]
   32	 10794.10 (0.00 pct)	 10841.68 (0.44 pct)
   64	 20905.89 (0.00 pct)	 21438.82 (2.54 pct)
  128	 30885.39 (0.00 pct)	 30064.78 (-2.65 pct)
  256	 48901.25 (0.00 pct)	 51395.08 (5.09 pct)
  512	 49673.91 (0.00 pct)	 51725.89 (4.13 pct)
 1024	 47626.34 (0.00 pct)	 52662.01 (10.57 pct)

NPS4

Clients:	tip			sis_core
    1	 544.91 (0.00 pct)	 544.66 (-0.04 pct)
    2	 1046.49 (0.00 pct)	 1072.42 (2.47 pct)
    4	 2007.11 (0.00 pct)	 1970.05 (-1.84 pct)
    8	 3590.66 (0.00 pct)	 3670.45 (2.22 pct)
   16	 5956.60 (0.00 pct)	 6045.07 (1.48 pct)
   32	 10431.73 (0.00 pct)	 10439.40 (0.07 pct)
   64	 21563.37 (0.00 pct)	 19344.05 (-10.29 pct)	*
   64	 19387.71 (0.00 pct)	 19050.47 (-1.73 pct)	[Verification Run]
  128	 30352.16 (0.00 pct)	 26998.85 (-11.04 pct)	*
  128	 29110.99 (0.00 pct)	 29690.37 (1.99 pct)	[Verification Run]
  256	 49504.51 (0.00 pct)	 50921.66 (2.86 pct)
  512	 44916.61 (0.00 pct)	 52176.11 (16.16 pct)
 1024	 49986.21 (0.00 pct)	 51639.91 (3.30 pct)


~~~~~~~~~~
~ stream ~
~~~~~~~~~~

NPS1

10 Runs:

Test:		tip			sis_core
 Copy:	 339390.30 (0.00 pct)	 324656.88 (-4.34 pct)
Scale:	 212472.78 (0.00 pct)	 210641.39 (-0.86 pct)
  Add:	 247598.48 (0.00 pct)	 241669.10 (-2.39 pct)
Triad:	 261852.07 (0.00 pct)	 252088.55 (-3.72 pct)

100 Runs:

Test:		tip			sis_core
 Copy:	 335938.02 (0.00 pct)	 331491.32 (-1.32 pct)
Scale:	 212597.92 (0.00 pct)	 218705.84 (2.87 pct)
  Add:	 248294.62 (0.00 pct)	 243830.42 (-1.79 pct)
Triad:	 258400.88 (0.00 pct)	 248178.42 (-3.95 pct)

NPS2

10 Runs:

Test:		tip			sis_core
 Copy:	 334500.32 (0.00 pct)	 335317.70 (0.24 pct)
Scale:	 216804.76 (0.00 pct)	 217862.71 (0.48 pct)
  Add:	 250787.33 (0.00 pct)	 258839.00 (3.21 pct)
Triad:	 259451.40 (0.00 pct)	 264847.88 (2.07 pct)

100 Runs:

Test:		tip			sis_core
 Copy:	 326385.13 (0.00 pct)	 338030.70 (3.56 pct)
Scale:	 216440.37 (0.00 pct)	 230053.24 (6.28 pct)
  Add:	 255062.22 (0.00 pct)	 259197.23 (1.62 pct)
Triad:	 265442.03 (0.00 pct)	 271365.65 (2.23 pct)

NPS4

10 Runs:

Test:		tip			sis_core
 Copy:   363927.86 (0.00 pct)    361014.15 (-0.80 pct)
Scale:   238190.49 (0.00 pct)    242176.02 (1.67 pct)
  Add:   262806.49 (0.00 pct)    266348.50 (1.34 pct)
Triad:   276492.33 (0.00 pct)    276769.10 (0.10 pct)

100 Runs:

Test:		tip			sis_core
 Copy:   365041.37 (0.00 pct)    349299.35 (-4.31 pct)
Scale:   239295.27 (0.00 pct)    229944.85 (-3.90 pct)
  Add:   264085.21 (0.00 pct)    252651.56 (-4.32 pct)
Triad:   279664.56 (0.00 pct)    274254.22 (-1.93 pct)

~~~~~~~~~~~~~~~~
~ ycsb-mongodb ~
~~~~~~~~~~~~~~~~

o NPS1

tip:                    131328.67 (var: 2.97%)
sis_core:               131702.33 (var: 3.61%)	(0.28%)

o NPS2:

tip:			132482.33 (var: 2.06%)
sis_core:		132338.33 (var: 0.97%)  (-0.11%)

o NPS4:

tip:                    134130.00 (var: 4.12%)
sis_core:               133224.33 (var: 4.13%)	(-0.67%)

~~~~~~~~~~~~~
~ unixbench ~
~~~~~~~~~~~~~

o NPS1

Test			Metric	  Parallelism			tip		      sis_core
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48770555.20 (   0.00%)    49025161.73 (   0.52%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6268185467.60 (   0.00%)  6266351964.20 (  -0.03%)
unixbench-syscall       Amean     unixbench-syscall-1        2685321.17 (   0.00%)     2694468.30 *  -0.34%*
unixbench-syscall       Amean     unixbench-syscall-512      7291476.20 (   0.00%)     7295087.67 (  -0.05%)
unixbench-pipe          Hmean     unixbench-pipe-1           2480858.53 (   0.00%)     2536923.44 *   2.26%*
unixbench-pipe          Hmean     unixbench-pipe-512       300739256.62 (   0.00%)   303470605.93 *   0.91%*
unixbench-spawn         Hmean     unixbench-spawn-1             4358.14 (   0.00%)        4104.88 (  -5.81%)	* (Known to be unstable)
unixbench-spawn         Hmean     unixbench-spawn-1             4711.00 (   0.00%)        4006.20 ( -14.96%)	[Verification Run]
unixbench-spawn         Hmean     unixbench-spawn-512          76497.32 (   0.00%)       75555.94 *  -1.23%*
unixbench-execl         Hmean     unixbench-execl-1             4147.12 (   0.00%)        4157.33 (   0.25%)
unixbench-execl         Hmean     unixbench-execl-512          12435.26 (   0.00%)       11992.43 (  -3.56%)

o NPS2

Test			Metric	  Parallelism			tip		      sis_core
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48872335.50 (   0.00%)    48902553.70 (   0.06%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6264134378.20 (   0.00%)  6260631689.40 (  -0.06%)
unixbench-syscall       Amean     unixbench-syscall-1        2683903.13 (   0.00%)     2694829.17 *  -0.41%*
unixbench-syscall       Amean     unixbench-syscall-512      7746773.60 (   0.00%)     7493782.67 *   3.27%*
unixbench-pipe          Hmean     unixbench-pipe-1           2476724.23 (   0.00%)     2537127.96 *   2.44%*
unixbench-pipe          Hmean     unixbench-pipe-512       300277350.41 (   0.00%)   302979776.19 *   0.90%*
unixbench-spawn         Hmean     unixbench-spawn-1             5026.50 (   0.00%)        4680.63 (  -6.88%)	*
unixbench-spawn         Hmean     unixbench-spawn-1             5421.70 (   0.00%)        5311.50 (  -2.03%)	[Verification Run]
unixbench-spawn         Hmean     unixbench-spawn-512          80549.70 (   0.00%)       78888.60 (  -2.06%)
unixbench-execl         Hmean     unixbench-execl-1             4151.70 (   0.00%)        3913.76 *  -5.73%*	*
unixbench-execl         Hmean     unixbench-execl-1             4304.30 (   0.00%)        4303.20 (  -0.02%)	[Verification run]
unixbench-execl         Hmean     unixbench-execl-512          13605.15 (   0.00%)       13129.23 (  -3.50%)

o NPS4

Test			Metric	  Parallelism			tip		      sis_core
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-1      48506771.20 (   0.00%)    48894866.70 (   0.80%)
unixbench-dhry2reg      Hmean     unixbench-dhry2reg-512  6280954362.50 (   0.00%)  6282759876.40 (   0.03%)
unixbench-syscall       Amean     unixbench-syscall-1        2687259.30 (   0.00%)     2695379.93 *  -0.30%*
unixbench-syscall       Amean     unixbench-syscall-512      7350275.67 (   0.00%)     7366923.73 (  -0.23%)
unixbench-pipe          Hmean     unixbench-pipe-1           2478893.01 (   0.00%)     2540015.88 *   2.47%*
unixbench-pipe          Hmean     unixbench-pipe-512       301830155.61 (   0.00%)   304305539.27 *   0.82%*
unixbench-spawn         Hmean     unixbench-spawn-1             5208.55 (   0.00%)        5273.11 (   1.24%)
unixbench-spawn         Hmean     unixbench-spawn-512          80745.79 (   0.00%)       81940.71 *   1.48%*
unixbench-execl         Hmean     unixbench-execl-1             4072.72 (   0.00%)        4126.13 *   1.31%*
unixbench-execl         Hmean     unixbench-execl-512          13746.56 (   0.00%)       12848.77 (  -6.53%)	*
unixbench-execl         Hmean     unixbench-execl-512          13898.30 (   0.00%)       13959.70 (   0.44%)	[Verification Run]

On 10/19/2022 5:58 PM, Abel Wu wrote:
> This patchset tries to improve SIS scan efficiency by recording idle
> cpus in a cpumask for each LLC which will be used as a target cpuset
> in the domain scan. The cpus are recorded at CORE granule to avoid
> tasks being stack on same core.
> 
> v5 -> v6:
>  - Rename SIS_FILTER to SIS_CORE as it can only be activated when
>    SMT is enabled and better describes the behavior of CORE granule
>    update & load delivery.
>  - Removed the part of limited scan for idle cores since it might be
>    better to open another thread to discuss the strategies such as
>    limited or scaled depth. But keep the part of full scan for idle
>    cores when LLC is overloaded because SIS_CORE can greatly reduce
>    the overhead of full scan in such case.
>  - Removed the state of sd_is_busy which indicates an LLC is fully
>    busy and we can safely skip the SIS domain scan. I would prefer
>    leave this to SIS_UTIL.
>  - The filter generation mechanism is replaced by in-place updates
>    during domain scan to better deal with partial scan failures.
>  - Collect Reviewed-bys from Tim Chen
> 
> v4 -> v5:
>  - Add limited scan for idle cores when overloaded, suggested by Mel
>  - Split out several patches since they are irrelevant to this scope
>  - Add quick check on ttwu_pending before core update
>  - Wrap the filter into SIS_FILTER feature, suggested by Chen Yu
>  - Move the main filter logic to the idle path, because the newidle
>    balance can bail out early if rq->avg_idle is small enough and
>    lose chances to update the filter.
> 
> v3 -> v4:
>  - Update filter in load_balance rather than in the tick
>  - Now the filter contains unoccupied cpus rather than overloaded ones
>  - Added mechanisms to deal with the false positive cases
> 
> v2 -> v3:
>  - Removed sched-idle balance feature and focus on SIS
>  - Take non-CFS tasks into consideration
>  - Several fixes/improvement suggested by Josh Don
> 
> v1 -> v2:
>  - Several optimizations on sched-idle balancing
>  - Ignore asym topos in can_migrate_task
>  - Add more benchmarks including SIS efficiency
>  - Re-organize patch as suggested by Mel Gorman
> 
> Abel Wu (4):
>   sched/fair: Skip core update if task pending
>   sched/fair: Ignore SIS_UTIL when has_idle_core
>   sched/fair: Introduce SIS_CORE
>   sched/fair: Deal with SIS scan failures
> 
>  include/linux/sched/topology.h |  15 ++++
>  kernel/sched/fair.c            | 122 +++++++++++++++++++++++++++++----
>  kernel/sched/features.h        |   7 ++
>  kernel/sched/sched.h           |   3 +
>  kernel/sched/topology.c        |   8 ++-
>  5 files changed, 141 insertions(+), 14 deletions(-)
> 

Testing with couple of larger workloads like SpecJBB are still underway.
I'll update the thread with the results once they are done. The idea
is promising. I'll also try to run schbench / hackbench pinned in a
manner such that all wakeups happen on an external LLC to spot any
impact of rapid changes to the idle cpu mask of an external LLC.
Please let me know if you would like me to test or get data for any
particular benchmark from my test setup.

--
Thanks and Regards,
Prateek