linux-kernel - Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZOc7i7wM0x4hF4vL@BLR-5CG11610CF.amd.com>
Date:   Thu, 24 Aug 2023 16:44:19 +0530
From:   "Gautham R. Shenoy" <gautham.shenoy@....com>
To:     David Vernet <void@...ifault.com>
Cc:     linux-kernel@...r.kernel.org, peterz@...radead.org,
        mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, tj@...nel.org,
        roman.gushchin@...ux.dev, kprateek.nayak@....com,
        aaron.lu@...el.com, wuyun.abel@...edance.com, kernel-team@...a.com
Subject: Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS

Hello David,

On Fri, Aug 18, 2023 at 02:19:03PM +0530, Gautham R. Shenoy wrote:
> Hello David,
> 
> On Fri, Aug 18, 2023 at 12:03:55AM -0500, David Vernet wrote:
> > On Thu, Aug 17, 2023 at 02:12:03PM +0530, Gautham R. Shenoy wrote:
> > > Hello David,
> > 
> > Hello Gautham,
> > 
> > Thanks a lot as always for running some benchmarks and analyzing these
> > changes.
> > 
> > > On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
> > > > Changes
> > > > -------
> > > > 
> > > > This is v3 of the shared runqueue patchset. This patch set is based off
> > > > of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
> > > > bandwidth in use") on the sched/core branch of tip.git.
> > > 
> > > 
> > > I tested the patches on Zen3 and Zen4 EPYC Servers like last time. I
> > > notice that apart from hackbench, every other bechmark is showing
> > > regressions with this patch series. Quick summary of my observations:
> > 
> > Just to verify per our prior conversation [0], was this latest set of
> > benchmarks run with boost disabled?
> 
> Boost is enabled by default. I will queue a run tonight with boost
> disabled.

Apologies for the delay. I didn't see any changes with boost-disabled
and with reverting the optimization to bail out of the
newidle_balance() for SMT and MC domains when there was no task to be
pulled from the shared-runq. I reran the whole thing once again, just
to rule out any possible variance. The results came out the same.

With the boost disabled, and the optimization reverted, the results
don't change much.

It doesn't appear that the optimization is the cause for increase in
the number of load-balancing attempts at the DIE and the NUMA
domains. I have shared the counts of the newidle_balance with and
without SHARED_RUNQ below for tbench and it can be noticed that the
counts are significantly higher for the 64 clients and 128 clients. I
also captured the counts/s of find_busiest_group() using funccount.py
which tells the same story. So the drop in the performance for tbench
with your patches strongly correlates with the increase in
load-balancing attempts.

newidle balance is undertaken only if the overload flag is set and the
expected idle duration is greater than the avg load balancing cost. It
is hard to imagine why should the shared runq cause the overload flag
to be set!


Detailed Results are as follows:
=============================================================
Test Machine : 2 Socket Zen4 with 128 cores per socket, SMT enabled.

tip             : commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
v3              : v3 of the shared_runq patch
v3-tgfix        : v3+ Aaron's RFC v1 patch to ratelimit the updates to tg->load_avg
v3-tgfix-no-opt : v3-tgfix + revered the optimization to bail out of
                  newidle-balance for SMT and MC domains when there
                  are no tasks in the shared-runq

In the results below, I have chosen the first row, first column in the
table as the baseline so that we get an idea of the scalability issues
as the number of groups/clients/workers increase.

==================================================================
Test          : hackbench 
Units         : Normalized time in seconds 
Interpretation: Lower is better 
Statistic     : AMean 
==================================================================
Case:         tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
 1-groups     1.00 [ -0.00]( 4.22)     0.92 [  7.75]( 9.09)     0.88 [ 11.53](10.61)     0.85 [ 15.31]( 8.20)
 2-groups     0.88 [ -0.00](11.65)     0.85 [  2.95](10.77)     0.88 [ -0.91]( 9.69)     0.88 [ -0.23]( 9.20)
 4-groups     1.08 [ -0.00]( 3.70)     0.93 [ 13.86](11.03)     0.90 [ 16.08]( 9.57)     0.83 [ 22.92]( 6.98)
 8-groups     1.32 [ -0.00]( 0.63)     1.16 [ 12.33]( 9.05)     1.21 [  8.72]( 5.54)     1.17 [ 11.13]( 5.29)
16-groups     1.71 [ -0.00]( 0.63)     1.93 [-12.65]( 4.68)     1.27 [ 25.87]( 1.31)     1.25 [ 27.15]( 1.10)


==================================================================
Test          : tbench 
Units         : Normalized throughput 
Interpretation: Higher is better 
Statistic     : AMean 
==================================================================
Clients:   tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
    1      1.00 [  0.00]( 0.18)      0.99 [ -0.99]( 0.18)      0.98 [ -2.08]( 0.10)      0.98 [ -2.19]( 0.24)
    2      1.95 [  0.00]( 0.65)      1.93 [ -1.04]( 0.72)      1.95 [ -0.37]( 0.31)      1.92 [ -1.73]( 0.39)
    4      3.80 [  0.00]( 0.59)      3.78 [ -0.53]( 0.37)      3.73 [ -1.66]( 0.58)      3.77 [ -0.79]( 0.97)
    8      7.49 [  0.00]( 0.37)      7.41 [ -1.12]( 0.39)      7.24 [ -3.42]( 1.99)      7.39 [ -1.39]( 1.53)
   16     14.78 [  0.00]( 0.84)     14.60 [ -1.24]( 1.51)     14.30 [ -3.28]( 1.28)     14.46 [ -2.18]( 0.78)
   32     28.18 [  0.00]( 1.26)     26.59 [ -5.65]( 0.46)     27.70 [ -1.71]( 0.92)     27.08 [ -3.90]( 0.83)
   64     55.05 [  0.00]( 1.56)     18.25 [-66.85]( 0.25)     48.07 [-12.68]( 1.51)     47.46 [-13.79]( 2.70)
  128    102.26 [  0.00]( 1.03)     21.74 [-78.74]( 0.65)     54.65 [-46.56]( 1.35)     54.69 [-46.52]( 1.16)
  256    156.69 [  0.00]( 0.27)     25.47 [-83.74]( 0.07)    130.85 [-16.49]( 0.57)    125.00 [-20.23]( 0.35)
  512    223.22 [  0.00]( 8.25)    236.98 [  6.17](17.10)    274.47 [ 22.96]( 0.44)    276.95 [ 24.07]( 3.37)
 1024    237.98 [  0.00]( 1.09)    299.72 [ 25.94]( 0.24)    304.89 [ 28.12]( 0.73)    300.37 [ 26.22]( 1.16)
 2048    242.13 [  0.00]( 0.37)    311.38 [ 28.60]( 0.24)    299.82 [ 23.82]( 1.35)    291.32 [ 20.31]( 0.66)


I reran tbench for v3-tgfix-no-opt, to collect the newidle balance
counts via schedstat as well as the find_busiest_group() counts via
funccount.py.

Comparison of the newidle balance counts across different
sched-domains for "v3-tgfix-no-opt" kernel with NO_SHARED_RUNQ vs
SHARED_RUNQ. We see a huge blowup for the DIE and the NUMA domains
when the number of clients are 64 and 128. The value within |xx.yy|
indicates the percentage increase when the difference is significant.

============== SMT load_balance with CPU_NEWLY_IDLE ===============================
   1 clients: count : 1986, 1960 
   2 clients: count : 5777, 6543     |  13.26|
   4 clients: count : 16775, 15274   |  -8.95|
   8 clients: count : 37086, 32715   | -11.79|
  16 clients: count : 69627, 65652   |  -5.71|
  32 clients: count : 152288, 42723  | -71.95|
  64 clients: count : 216396, 169545 | -21.65|
 128 clients: count : 219570, 649880 | 195.98|
 256 clients: count : 443595, 951933 | 114.60|
 512 clients: count : 5498, 1949     | -64.55|
1024 clients: count : 60, 3          | -95.00|
================ MC load_balance with CPU_NEWLY_IDLE ===============================
   1 clients: count : 1954, 1943
   2 clients: count : 5775, 6541      |  13.26|
   4 clients: count : 15468, 15087 
   8 clients: count : 31941, 32140 
  16 clients: count : 57312, 62553    |   9.14|
  32 clients: count : 125791, 34386   | -72.66|
  64 clients: count : 181406, 133978  | -26.14|
 128 clients: count : 191143, 607594  | 217.87|
 256 clients: count : 388696, 584568  |  50.39| 
 512 clients: count : 2677, 218       | -91.86|
1024 clients: count : 22, 3           | -86.36|
=============== DIE load_balance with CPU_NEWLY_IDLE ===============================
   1 clients: count : 10, 15          |   50.00|
   2 clients: count : 15, 56          |  273.33|
   4 clients: count : 65, 149         |  129.23|
   8 clients: count : 242, 412        |   70.25|
  16 clients: count : 509, 1235       |  142.63|
  32 clients: count : 909, 1371       |   50.83|
  64 clients: count : 1288, 59596     | 4527.02| <===
 128 clients: count : 666, 281426     |42156.16| <===
 256 clients: count : 213, 1463       |  586.85|
 512 clients: count : 28, 23          |  -17.86|
1024 clients: count : 10, 3           |  -70.00|
============== NUMA load_balance with CPU_NEWLY_IDLE ===============================
   1 clients: count : 9, 9 
   2 clients: count : 13, 14
   4 clients: count : 21, 21
   8 clients: count : 27, 29
  16 clients: count : 29, 50         |   72.41|
  32 clients: count : 29, 67         |  131.03|
  64 clients: count : 28, 9138       |32535.71|  <===
 128 clients: count : 25, 24234      |96836.00|  <===
 256 clients: count : 12, 11
 512 clients: count : 7, 3  
1024 clients: count : 4, 3 


Further, collected the find_busiest_group() count/s using
funccount.py.

Notice that with 128 clients, most samples with SHARED_RUNQ fall into
the bucket which is > 2x of the buckets where we have most of the
samples of NO_SHARED_RUNQ runs.

128 clients: find_busiest_group() count/s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fbg count bucket       NO_SHARED_RUNQ   SHARED_RUNQ
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2000000 - 2500000) :     23
[2500000 - 3000000) :     19               
[3000000 - 3500000) :     19               1
[3500000 - 4000000) :      3               3
[7500000 - 8000000) :                      5
[8000000 - 8500000) :                     54   <===

With 1024 clients, there is not a whole lot of difference in the
find_busiest_group() distribution with and without the SHARED_RUNQ.

1024 clients: find_busiest_group() count/s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fbg count bucket       NO_SHARED_RUNQ   SHARED_RUNQ
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[  4000 -   5000) :      1
[  7000 -   8000) :      2                  2
[  8000 -   9000) :      1                  2
[  9000 -  10000) :     57                 44  <===
[ 10000 -  11000) :      3                 13
[ 18000 -  19000) :      1                  1



==================================================================
Test          : stream (10  Runs)
Units         : Normalized Bandwidth, MB/s 
Interpretation: Higher is better 
Statistic     : HMean 
==================================================================
Test:     tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
 Copy     1.00 [  0.00]( 0.53)     1.00 [  0.01]( 0.77)     1.00 [ -0.22]( 0.55)     1.00 [  0.12]( 0.71)
Scale     0.95 [  0.00]( 0.23)     0.95 [  0.21]( 0.63)     0.95 [  0.13]( 0.22)     0.95 [  0.02]( 0.87)
  Add     0.97 [  0.00]( 0.27)     0.98 [  0.40]( 0.59)     0.98 [  0.52]( 0.31)     0.98 [  0.16]( 0.85)
Triad     0.98 [  0.00]( 0.28)     0.98 [  0.33]( 0.55)     0.98 [  0.34]( 0.29)     0.98 [  0.05]( 0.96)


==================================================================
Test          : stream (100 Runs)
Units         : Normalized Bandwidth, MB/s 
Interpretation: Higher is better 
Statistic     : HMean 
==================================================================
Test:     tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
 Copy     1.00 [  0.00]( 1.01)     1.00 [ -0.38]( 0.34)     1.00 [  0.08]( 1.19)     1.00 [ -0.18]( 0.38)
Scale     0.95 [  0.00]( 0.46)     0.95 [ -0.39]( 0.52)     0.94 [ -0.72]( 0.34)     0.94 [ -0.66]( 0.40)
  Add     0.98 [  0.00]( 0.16)     0.98 [ -0.40]( 0.53)     0.97 [ -0.80]( 0.26)     0.97 [ -0.79]( 0.34)
Triad     0.98 [  0.00]( 0.14)     0.98 [ -0.35]( 0.54)     0.97 [ -0.79]( 0.17)     0.97 [ -0.79]( 0.28)


==================================================================
Test          : netperf 
Units         : Normalized Througput per client
Interpretation: Higher is better 
Statistic     : AMean 
==================================================================
Clients:        tip[pct imp](CV)            v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
 1-clients      1.00 [  0.00]( 0.84)     0.99 [ -0.64]( 0.10)     0.97 [ -2.61]( 0.29)     0.98 [ -2.24]( 0.16)
 2-clients      1.00 [  0.00]( 0.47)     0.99 [ -1.07]( 0.42)     0.98 [ -2.27]( 0.33)     0.97 [ -2.75]( 0.24)
 4-clients      1.01 [  0.00]( 0.45)     0.99 [ -1.41]( 0.39)     0.98 [ -2.82]( 0.31)     0.97 [ -3.23]( 0.23)
 8-clients      1.00 [  0.00]( 0.39)     0.99 [ -1.95]( 0.29)     0.98 [ -2.78]( 0.25)     0.97 [ -3.62]( 0.39)
16-clients      1.00 [  0.00]( 1.81)     0.97 [ -2.77]( 0.41)     0.97 [ -3.26]( 0.35)     0.96 [ -3.99]( 1.45)
32-clients      1.00 [  0.00]( 1.87)     0.39 [-60.63]( 1.29)     0.95 [ -4.68]( 1.45)     0.95 [ -4.89]( 1.41)
64-clients      0.98 [  0.00]( 2.70)     0.24 [-75.29]( 1.26)     0.66 [-33.23]( 0.99)     0.65 [-34.05]( 2.39)
128-clients     0.90 [  0.00]( 2.48)     0.14 [-84.47]( 3.63)     0.36 [-60.00]( 1.37)     0.36 [-60.36]( 1.54)
256-clients     0.67 [  0.00]( 2.91)     0.08 [-87.79]( 9.27)     0.54 [-20.38]( 3.69)     0.52 [-22.94]( 3.81)
512-clients     0.36 [  0.00]( 8.11)     0.51 [ 39.96]( 4.92)     0.38 [  5.12]( 6.24)     0.39 [  5.88]( 6.13)


==================================================================
Test          : schbench throughput
Units         : Normalized Requests per second 
Interpretation: Higher is better 
Statistic     : Median 
==================================================================
#workers: tip[pct imp](CV)          v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
  1      1.00 [  0.00]( 0.24)      1.01 [  0.93]( 0.00)      1.01 [  0.93]( 0.24)      1.00 [  0.47]( 0.24)
  2      2.01 [  0.00]( 0.12)      2.03 [  0.93]( 0.00)      2.03 [  1.16]( 0.00)      2.01 [  0.00]( 0.12)
  4      4.03 [  0.00]( 0.12)      4.06 [  0.70]( 0.00)      4.07 [  0.93]( 0.00)      4.02 [ -0.23]( 0.24)
  8      8.05 [  0.00]( 0.00)      8.12 [  0.93]( 0.00)      8.14 [  1.16]( 0.00)      8.07 [  0.23]( 0.00)
 16     16.17 [  0.00]( 0.12)     16.24 [  0.46]( 0.12)     16.28 [  0.69]( 0.00)     16.17 [  0.00]( 0.12)
 32     32.34 [  0.00]( 0.12)     32.49 [  0.46]( 0.00)     32.56 [  0.69]( 0.00)     32.34 [  0.00]( 0.00)
 64     64.52 [  0.00]( 0.12)     64.82 [  0.46]( 0.00)     64.97 [  0.70]( 0.00)     64.52 [  0.00]( 0.00)
128    127.25 [  0.00]( 1.48)    121.57 [ -4.47]( 0.38)    120.37 [ -5.41]( 0.13)    120.07 [ -5.64]( 0.34)
256    135.33 [  0.00]( 0.11)    136.52 [  0.88]( 0.11)    136.22 [  0.66]( 0.11)    136.52 [  0.88]( 0.11)
512    107.81 [  0.00]( 0.29)    109.91 [  1.94]( 0.92)    109.91 [  1.94]( 0.14)    109.91 [  1.94]( 0.14)


==================================================================
Test          : schbench wakeup-latency 
Units         : Normalized 99th percentile latency in us 
Interpretation: Lower is better 
Statistic     : Median 
==================================================================

#workers: tip[pct imp](CV)          v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
  1       1.00 [ -0.00](14.08)       0.80 [ 20.00](11.92)       1.00 [ -0.00]( 9.68)       1.40 [-40.00](18.75)
  2       1.20 [ -0.00]( 4.43)       1.10 [  8.33]( 4.84)       1.10 [  8.33]( 0.00)       1.10 [  8.33]( 4.56)
  4       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 4.56)       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 0.00)
  8       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 4.56)       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 0.00)
 16       1.10 [ -0.00]( 4.84)       1.20 [ -9.09]( 0.00)       1.10 [ -0.00]( 0.00)       1.10 [ -0.00]( 0.00)
 32       1.00 [ -0.00]( 0.00)       1.10 [-10.00]( 0.00)       1.10 [-10.00]( 0.00)       1.00 [ -0.00]( 0.00)
 64       1.00 [ -0.00]( 5.34)       1.10 [-10.00]( 0.00)       1.10 [-10.00]( 0.00)       1.10 [-10.00]( 0.00)
128       1.20 [ -0.00]( 4.19)       2.10 [-75.00]( 2.50)       2.10 [-75.00]( 2.50)       2.10 [-75.00]( 0.00)
256       5.90 [ -0.00]( 0.00)      12.10 [-105.08](14.03)     11.10 [-88.14]( 4.53)      12.70 [-115.25]( 5.17)
512    2627.20 [ -0.00]( 1.21)    2288.00 [ 12.91]( 9.76)    2377.60 [  9.50]( 2.40)    2281.60 [ 13.15]( 0.77)


==================================================================
Test          : schbench request-latency 
Units         : Normalized 99th percentile latency in us 
Interpretation: Lower is better 
Statistic     : Median 
==================================================================
#workers: tip[pct imp](CV)          v3[pct imp](CV)      v3-tgfix[pct imp](CV)    v3-tgfix-no-opt[pct imp](CV)
  1     1.00 [ -0.00]( 0.35)     1.00 [  0.34]( 0.17)     0.99 [  0.67]( 0.30)     1.00 [ -0.34]( 0.00)
  2     1.00 [ -0.00]( 0.17)     1.00 [  0.34]( 0.00)     0.99 [  1.01]( 0.00)     1.00 [ -0.34]( 0.17)
  4     1.00 [ -0.00]( 0.00)     1.00 [  0.34]( 0.00)     0.99 [  1.01]( 0.00)     1.00 [ -0.00]( 0.17)
  8     1.00 [ -0.00]( 0.17)     1.00 [  0.34]( 0.17)     0.99 [  1.34]( 0.18)     1.00 [  0.34]( 0.17)
 16     1.00 [ -0.00]( 0.00)     1.00 [  0.67]( 0.17)     0.99 [  1.34]( 0.35)     1.00 [ -0.00]( 0.00)
 32     1.00 [ -0.00]( 0.00)     1.00 [  0.67]( 0.00)     0.99 [  1.34]( 0.00)     1.00 [ -0.00]( 0.00)
 64     1.00 [ -0.00]( 0.00)     1.00 [  0.34]( 0.17)     1.00 [  0.67]( 0.00)     1.00 [ -0.00]( 0.17)
128     1.82 [ -0.00]( 0.83)     1.85 [ -1.48]( 0.00)     1.85 [ -1.85]( 0.37)     1.85 [ -1.85]( 0.19)
256     1.94 [ -0.00]( 0.18)     1.96 [ -1.04]( 0.36)     1.95 [ -0.69]( 0.18)     1.95 [ -0.35]( 0.18)
512    13.27 [ -0.00]( 5.00)    16.32 [-23.00]( 8.33)    16.16 [-21.78]( 1.05)    15.46 [-16.51]( 0.89)

 
 --
 Thanks and Regards
 gautham.