linux-kernel - [RFC PATCH 09/08] [ANNOTATE] sched/fair: Stats versioning and invalidation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250316102916.10614-1-kprateek.nayak@amd.com>
Date: Sun, 16 Mar 2025 10:29:10 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Chen Yu <yu.c.chen@...el.com>,
	<linux-kernel@...r.kernel.org>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
	<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, David Vernet
	<void@...ifault.com>, "Gautham R. Shenoy" <gautham.shenoy@....com>, "Swapnil
 Sapkal" <swapnil.sapkal@....com>, Shrikanth Hegde <sshegde@...ux.ibm.com>, "K
 Prateek Nayak" <kprateek.nayak@....com>
Subject: [RFC PATCH 09/08] [ANNOTATE] sched/fair: Stats versioning and invalidation

I would have loved to spin another version of this but me being slightly
short on time before OSPM decided to add these bits on top of the RFC.
Sorry for the inconvenience.

Stats versioning
================

Earlier experiments looked at aggressive stats caching and reuse. Load
balancing instances computed and cached the stats for non-local groups
hoping that they would be reused.

With stats versioning, the load balancing CPU only caches the stats for
the local hierarchy. Instead of the jiffy based "last_update" freshness,
this moves to versioning based on sched_clock_cpu() value.

Stats cached are invalidated if the CPU doing load balance is done
allowing fresher stats to be propagated. Stats computed by a concurrent
load balancing instance can now be reused allowing idle and newidle
balance to reuse stats effectively.

Stats versioning nuances are explained in Patch 11/08. Since idle and
newidle balance can reuse stats, changes have been made in aggregation
to consider reduced capacity, but also forego computing total capacity.

Benchmark results are as follows:

  ==================================================================
  Test          : hackbench
  Units         : Normalized time in seconds
  Interpretation: Lower is better
  Statistic     : AMean
  ==================================================================
  Case:           tip[pct imp](CV)      versioning[pct imp](CV)
   1-groups     1.00 [ -0.00](10.12)     1.00 [  0.44](13.86)
   2-groups     1.00 [ -0.00]( 6.92)     1.04 [ -4.32]( 3.00)
   4-groups     1.00 [ -0.00]( 3.14)     1.00 [ -0.21]( 2.16)
   8-groups     1.00 [ -0.00]( 1.35)     1.01 [ -1.25]( 1.32)
  16-groups     1.00 [ -0.00]( 1.32)     1.01 [ -0.50]( 2.00)

  ==================================================================
  Test          : tbench
  Units         : Normalized throughput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:    tip[pct imp](CV)      versioning[pct imp](CV)
      1     1.00 [  0.00]( 0.43)     0.98 [ -1.65]( 0.15)
      2     1.00 [  0.00]( 0.58)     1.01 [  1.27]( 0.49)
      4     1.00 [  0.00]( 0.54)     1.00 [  0.47]( 0.40)
      8     1.00 [  0.00]( 0.49)     1.00 [ -0.44]( 1.18)
     16     1.00 [  0.00]( 1.06)     1.00 [ -0.07]( 1.14)
     32     1.00 [  0.00]( 1.27)     1.00 [  0.02]( 0.11)
     64     1.00 [  0.00]( 1.54)     0.99 [ -1.12]( 1.09)
    128     1.00 [  0.00]( 0.38)     0.98 [ -2.43]( 1.00)
    256     1.00 [  0.00]( 1.85)     0.99 [ -0.50]( 0.94)
    512     1.00 [  0.00]( 0.31)     0.99 [ -1.03]( 0.35)
   1024     1.00 [  0.00]( 0.19)     0.99 [ -0.56]( 0.42)

  ==================================================================
  Test          : stream-10
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)      versioning[pct imp](CV)
   Copy     1.00 [  0.00](11.31)     1.08 [  7.51]( 4.74)
  Scale     1.00 [  0.00]( 6.62)     1.00 [ -0.31]( 7.45)
    Add     1.00 [  0.00]( 7.06)     1.02 [  2.50]( 7.34)
  Triad     1.00 [  0.00]( 8.91)     1.08 [  7.78]( 2.88)

  ==================================================================
  Test          : stream-100
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)     versioning[pct imp](CV)
   Copy     1.00 [  0.00]( 2.01)     1.02 [  1.82]( 1.26)
  Scale     1.00 [  0.00]( 1.49)     1.00 [  0.26]( 0.80)
    Add     1.00 [  0.00]( 2.67)     1.01 [  0.98]( 1.29)
  Triad     1.00 [  0.00]( 2.19)     1.02 [  2.06]( 1.01)

  ==================================================================
  Test          : netperf
  Units         : Normalized Througput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:         tip[pct imp](CV)      versioning[pct imp](CV)
   1-clients     1.00 [  0.00]( 1.43)     0.99 [ -0.72]( 0.81)
   2-clients     1.00 [  0.00]( 1.02)     1.00 [ -0.09]( 1.11)
   4-clients     1.00 [  0.00]( 0.83)     1.00 [  0.31]( 0.29)
   8-clients     1.00 [  0.00]( 0.73)     1.00 [ -0.25]( 0.61)
  16-clients     1.00 [  0.00]( 0.97)     1.00 [ -0.26]( 0.89)
  32-clients     1.00 [  0.00]( 0.88)     0.99 [ -0.61]( 0.82)
  64-clients     1.00 [  0.00]( 1.49)     0.99 [ -1.11]( 1.77)
  128-clients    1.00 [  0.00]( 1.05)     1.00 [ -0.03]( 1.13)
  256-clients    1.00 [  0.00]( 3.85)     1.00 [ -0.24]( 2.63)
  512-clients    1.00 [  0.00](59.63)     0.99 [ -0.74](59.01)

  ==================================================================
  Test          : schbench
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:     tip[pct imp](CV)     versioning[pct imp](CV)
    1         1.00 [ -0.00]( 6.67)     0.93 [  6.67](15.25)
    2         1.00 [ -0.00](10.18)     0.83 [ 17.39]( 7.15)
    4         1.00 [ -0.00]( 4.49)     1.04 [ -4.26]( 6.12)
    8         1.00 [ -0.00]( 6.68)     1.06 [ -5.66](12.98)
   16         1.00 [ -0.00]( 1.87)     1.00 [ -0.00]( 3.38)
   32         1.00 [ -0.00]( 4.01)     0.98 [  2.20]( 4.79)
   64         1.00 [ -0.00]( 3.21)     1.02 [ -1.68]( 0.84)
  128         1.00 [ -0.00](44.13)     1.16 [-15.98](14.99)
  256         1.00 [ -0.00](14.46)     0.90 [  9.99](17.45)
  512         1.00 [ -0.00]( 1.95)     0.98 [  1.54]( 1.13)

  ==================================================================
  Test          : new-schbench-requests-per-second
  Units         : Normalized Requests per second
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  #workers:    tip[pct imp](CV)      versioning[pct imp](CV)
    1        1.00 [  0.00]( 0.46)     1.00 [  0.00]( 0.26)
    2        1.00 [  0.00]( 0.15)     1.00 [ -0.29]( 0.15)
    4        1.00 [  0.00]( 0.15)     1.00 [ -0.29]( 0.30)
    8        1.00 [  0.00]( 0.15)     1.00 [ -0.29]( 0.26)
   16        1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
   32        1.00 [  0.00]( 3.40)     1.06 [  5.93]( 1.22)
   64        1.00 [  0.00]( 7.09)     1.00 [  0.00]( 0.20)
  128        1.00 [  0.00]( 0.00)     0.98 [ -1.52]( 0.34)
  256        1.00 [  0.00]( 1.12)     0.98 [ -2.41]( 1.19)
  512        1.00 [  0.00]( 0.22)     1.00 [  0.00]( 0.43)

  ==================================================================
  Test          : new-schbench-wakeup-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:    tip[pct imp](CV)      versioning[pct imp](CV)
    1        1.00 [ -0.00](19.72)     1.00 [ -0.00]( 8.37)
    2        1.00 [ -0.00](15.96)     1.09 [ -9.09](11.08)
    4        1.00 [ -0.00]( 3.87)     1.15 [-15.38](17.44)
    8        1.00 [ -0.00]( 8.15)     0.92 [  8.33]( 8.85)
   16        1.00 [ -0.00]( 3.87)     1.23 [-23.08]( 5.59)
   32        1.00 [ -0.00](12.99)     0.73 [ 26.67](16.75)
   64        1.00 [ -0.00]( 6.20)     1.25 [-25.00]( 2.63)
  128        1.00 [ -0.00]( 0.96)     1.62 [-62.37]( 1.30)
  256        1.00 [ -0.00]( 2.76)     0.82 [ 17.89](10.56)
  512        1.00 [ -0.00]( 0.20)     1.00 [ -0.00]( 0.34)

  ==================================================================
  Test          : new-schbench-request-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:    tip[pct imp](CV)      versioning[pct imp](CV)
    1        1.00 [ -0.00]( 1.07)     1.02 [ -2.34]( 0.13)
    2        1.00 [ -0.00]( 0.14)     1.04 [ -3.97]( 0.13)
    4        1.00 [ -0.00]( 1.39)     1.03 [ -3.15]( 0.13)
    8        1.00 [ -0.00]( 0.36)     1.03 [ -3.43]( 0.66)
   16        1.00 [ -0.00]( 1.18)     0.99 [  0.79]( 1.22)
   32        1.00 [ -0.00]( 8.42)     0.82 [ 18.29]( 9.02)
   64        1.00 [ -0.00]( 4.85)     1.00 [ -0.44]( 1.61)
  128        1.00 [ -0.00]( 0.28)     1.06 [ -5.64]( 1.10)
  256        1.00 [ -0.00](10.52)     0.81 [ 19.18](12.55)
  512        1.00 [ -0.00]( 0.69)     1.00 [  0.33]( 1.27)

  ==================================================================
  Test          : Various longer running benchmarks
  Units         : %diff in throughput reported
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  Benchmarks:                 %diff

  ycsb-cassandra             -0.76%
  ycsb-mongodb                0.49%

  deathstarbench-1x          -2.37%
  deathstarbench-2x           0.12%
  deathstarbench-3x           2.30%
  deathstarbench-6x           1.88%

  hammerdb+mysql 16VU         3.85%
  hammerdb+mysql 64VU         0.27%

Following are the schedstats diff for sched-messaging 4-group and
16-groups:

o 4-groups:

  ----------------------------------------------------------------------------------------------------
  CPU <ALL CPUS SUMMARY>
  ----------------------------------------------------------------------------------------------------
  DESC                                                                    COUNT1      COUNT2   PCT_CHANGE    PCT_CHANGE1 PCT_CHANGE2
  ----------------------------------------------------------------------------------------------------
  sched_yield() count                                              :           0,          0  |     0.00% |
  Legacy counter can be ignored                                    :           0,          0  |     0.00% |
  schedule() called                                                :      174683,     176871  |     1.25% |
  schedule() left the processor idle                               :       86742,      88113  |     1.58% |  (    49.66%,     49.82% )
  try_to_wake_up() was called                                      :       87675,      88622  |     1.08% |
  try_to_wake_up() was called to wake up the local cpu             :          28,         26  |    -7.14% |  (     0.03%,      0.03% )
  total runtime by tasks on this processor (in jiffies)            :  2124248214, 2118780927  |    -0.26% |
  total waittime by tasks on this processor (in jiffies)           :    24160304,   16912073  |   -30.00% |  (     1.14%,      0.80% )
  total timeslices run on this cpu                                 :       87936,      88753  |     0.93% |
  ----------------------------------------------------------------------------------------------------

  ---------------------------------------- <Category newidle> ----------------------------------------
  SMT:

  load_balance() total time to balance on newly idle               :      449650,     465044  |     3.42% |
  load_balance() stats reused on newly idle                        :           0,          0  |     0.00% |
  load_balance() stats recomputed on newly idle                    :        2493,       2679  |     7.46% |

  MC:

  load_balance() total time to balance on newly idle               :      660742,     610346  |    -7.63% |
  load_balance() stats reused on newly idle                        :           0,       1898  |     0.00% |
  load_balance() stats recomputed on newly idle                    :        3985,       3527  |   -11.49% |

  PKG:

  load_balance() total time to balance on newly idle               :      725938,     530707  |   -26.89% |
  load_balance() stats reused on newly idle                        :           0,        401  |     0.00% |
  load_balance() stats recomputed on newly idle                    :         722,        474  |   -34.35% |

  NUMA:

  load_balance() total time to balance on newly idle               :      406862,     410386  |     0.87% |
  load_balance() stats reused on newly idle                        :           0,         36  |     0.00% |
  load_balance() stats recomputed on newly idle                    :          48,         39  |   -18.75% |

o 16-groups:

  ----------------------------------------------------------------------------------------------------
  CPU <ALL CPUS SUMMARY>
  ----------------------------------------------------------------------------------------------------
  DESC                                                                    COUNT1      COUNT2   PCT_CHANGE    PCT_CHANGE1 PCT_CHANGE2
  ----------------------------------------------------------------------------------------------------
  sched_yield() count                                              :           0,          0  |     0.00% |
  Legacy counter can be ignored                                    :           0,          0  |     0.00% |
  schedule() called                                                :      566558,     554784  |    -2.08% |
  schedule() left the processor idle                               :      222161,     212164  |    -4.50% |  (    39.21%,     38.24% )
  try_to_wake_up() was called                                      :      325303,     322690  |    -0.80% |
  try_to_wake_up() was called to wake up the local cpu             :         990,       1017  |     2.73% |  (     0.30%,      0.32% )
  total runtime by tasks on this processor (in jiffies)            :  8807593610, 9142526964  |     3.80% |
  total waittime by tasks on this processor (in jiffies)           :  4093286876, 4314147489  |     5.40% |  (    46.47%,     47.19% )
  total timeslices run on this cpu                                 :      344281,     342495  |    -0.52% |
  ----------------------------------------------------------------------------------------------------

  ---------------------------------------- <Category newidle> ----------------------------------------
  SMT:

  load_balance() total time to balance on newly idle               :     9841719,   11615891  |    18.03% |
  load_balance() stats reused on newly idle                        :           0,          0  |     0.00% |
  load_balance() stats recomputed on newly idle                    :       28103,      27084  |    -3.63% |

  MC:

  load_balance() total time to balance on newly idle               :    20079305,   18103792  |    -9.84% |
  load_balance() stats reused on newly idle                        :           0,      37820  |     0.00% |
  load_balance() stats recomputed on newly idle                    :       63885,      33518  |   -47.53% |

  PKG:

  load_balance() total time to balance on newly idle               :    17972213,   16430220  |    -8.58% |
  load_balance() stats reused on newly idle                        :           0,       8461  |     0.00% |
  load_balance() stats recomputed on newly idle                    :       11513,       6318  |   -45.12% |

  NUMA:

  load_balance() total time to balance on newly idle               :    11050651,    9890509  |   -10.50% |
  load_balance() stats reused on newly idle                        :           0,        496  |     0.00% |
  load_balance() stats recomputed on newly idle                    :         827,        524  |   -36.64% |

---

Note: perf sched stats cannot properly aggregate "min" and "max" fields
yet.

Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
-- 
2.43.0