linux-kernel - [RESEND RFC PATCH v2 00/29] sched/fair: Push-based load balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251208083602.31898-1-kprateek.nayak@amd.com>
Date: Mon, 8 Dec 2025 09:26:46 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Anna-Maria Behnsen <anna-maria@...utronix.de>,
	Frederic Weisbecker <frederic@...nel.org>, Thomas Gleixner
	<tglx@...utronix.de>
CC: <linux-kernel@...r.kernel.org>, Dietmar Eggemann
	<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
	<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
	<vschneid@...hat.com>, K Prateek Nayak <kprateek.nayak@....com>, "Gautham R.
 Shenoy" <gautham.shenoy@....com>, Swapnil Sapkal <swapnil.sapkal@....com>,
	Shrikanth Hegde <sshegde@...ux.ibm.com>, Chen Yu <yu.c.chen@...el.com>
Subject: [RESEND RFC PATCH v2 00/29] sched/fair: Push-based load balancing

Resending the series with the correct Email IDs this time. Sorry for the
noise.

This is the combined successor to the following two series:

https://lore.kernel.org/lkml/20250904041516.3046-1-kprateek.nayak@amd.com/
https://lore.kernel.org/lkml/20250409111539.23791-1-kprateek.nayak@amd.com/

Most bits are same except for more initial cleanups. Changelog is
attached towards the end. This topic will be discussed at LPC'25 in the
"Scheduler and Real-Time MC" - jump to "LPC 2025" to know what will be
discussed.


Problem statement
=================

This series tackles three problems:

1. The busy load balancing always uses the first CPU of
   group_balance_mask() for load balancing which puts all the load
   balancing responsibility on a single CPU.

2. The "nohz.nr_idle" and "nohz.idle_cpus" are global system-wide
   variables that can run into scalability bottlenecks on a large
   core-count system.

3. Periodic balance can take a long time to even out imbalance on
   systems with mostly large and flat sched-domain hierarchy. Preempted
   tasks may wait for a long time behind other runnable tasks increasing
   the tail latencies.

This series aims at addressing the combined problems listed above.


Implementation details
======================

Note: Sections marked EXPERIMENTAL are known to introduce regeressions
for certain benchmarks. These have been discussed in details in the next
section. These patches also may be incomplete from a schedstats
accounting standpoint.


o Busy balancing optimization

The busy balance CPU is always the first cpu of
"sched_group->scg->group_balance_mask". "sgc" object is shared by all
the CPUs on the "group_balance_mask", even for the overlapping domains.

To keep overheads minimal, a simple "busy_balance_cpu" is maintained in
the shared "sgc" object. A CPU is nominated to handle the busy
balancing. Once the CPU is done with its turn, it nominates the next CPU
on the group_balance_mask.

  - Advantages: The responsibility of busy balance is rotated among the
    CPUs on the group_balance_mask. Maintaining the "busy_balance_cpu"
    only requires a READ_ONCE() / WRITE_ONCE() modifications making it
    relatively cheap.

  - Disadvantages: The currently nominated "busy_balance_cpu" can run
    for a long time with bh disabled that can prevent balancing work
    from running however, it is no worse that the current state where
    the first CPU continues running with bh disabled for a prolonged
    period of time.


o Centralized "nohz" accounting optimizations

The centralized "nohz" tracking maintains the number and list of CPUs
that are in nohz idle state. These are done via atomic operations on
variables shared across the system which is less than ideal.

Peter suggested breaking the mask down and embedding it into the
sched_domain hierarchy which would minimize atomic operations on the
global variables.

There are 2 possible ways to implement this:

1. Maintain the idle CPUs mask in sched_domain_shared. Also construct a
   hierarchy of the sched_domain_shared objects which can be used to
   propagate a signal up to the topmost domain.

   - Advantages: Distributed tracking. Less atomic operations on the
     global variables.

   - Disadvantages: Number of atomic ops can scale with the depth of the
     hierarchy with multiple cache lines being possibly shared between
     multiple NUMA domains.

2. [Implemented in this series] Maintain the idle CPUs mask in
   sched_domain_shared. String all the sched_domain_shared objects in a
   global list which is used for traversing all nohz idle CPUs during
   load balancing.

   Maintain a global "nrr_doms" indicator that is only updated when the
   first CPU is added to the LLC local mask / last CPU leaves the LLC
   local mask.

   - Advantages: Distributed tracking. Simpler implementation.

   - Disadvantages: Number of atomic ops to global "nr_doms" can scale
     with the number of LLC domains, however the changes in the boundary
     conditions are still less frequent than the current global scheme.

The nohz_idle_cpus mask is also inherently optimized by retaining a CPU
on the mask until the first tick and is not immediately cleared when the
ticks are enabled again / at idle exit.


o [EXPERIMENTAL] Push-based load balancing

Proactively push tasks to idle CPUs within an LLC domain. Push-based
load balancing is found to be a delicate balancing act where delaying
running the tasks, especially if their runtime is small, can lead to
performance regressions.

There are cases, especially with larger benchmarks where pushing the
tasks more proactively helps with performance however, a number of
microbenchmark suffer as a result of additional work the busy CPU has to
do to to push a preempted task.


o [EXPERIMENTAL] Optimizing Intra-NUMA newidle balancing

On a CONFIG_PREEMPTION enabled kernel, newidle balance only pulls one
task to keep the latency of balancing low. Despite the effort to keep
the latency low, the newidle balance ends up computing a great deal of
stats just to pull a single task at best.

Instead of following the usual path, directly traverse CPUs for newidle
balance in search of CPUs to pull load from.

This too is found to have interesting effects on benchmarks where CPUs
can converge on single target to pull tasks from causing some amount of
lock contention.

More interestingly, a number of benchmarks seem to regress if the
newidle balance yields on spotting (nr_running > 1 || ttwu_pending)
instead of just proceeding to scan the entire domain and bail at the
end.


Benchmark results
=================

Results for some variants are incomplete as a result of setup issues
(and my sheer incompetence to revert some of the changes I made when
analyzing the benchmarks)

I'll update these as and when the runs (and re-runs) complete but as the
moment, this is how the different [EXPERIMENTAL] bits stack up from
benchmarking perspective on a dual socket 3rd Generation EPYC system (2
x 64C/128T)

  ==================================================================
  Test          : hackbench
  Units         : Normalized time in seconds
  Interpretation: Lower is better
  Statistic     : AMean
  ==================================================================
  Case:           tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV) push+newidle[pct imp](CV)
   1-groups     1.00 [ -0.00]( 5.58)     1.01 [ -1.10](14.77)     1.01 [ -0.66]( 8.80)     1.03 [ -2.85]( 6.13)
   2-groups     1.00 [ -0.00]( 9.58)     1.02 [ -2.41]( 3.09)     1.00 [  0.22]( 5.62)     1.02 [ -1.97]( 4.54)
   4-groups     1.00 [ -0.00]( 2.11)     0.99 [  1.48]( 2.30)     1.00 [ -0.21]( 2.60)     1.03 [ -2.54]( 2.82)
   8-groups     1.00 [ -0.00]( 2.07)     1.02 [ -2.31]( 2.98)     1.15 [-14.79]( 2.15)     1.13 [-12.63]( 2.57)
  16-groups     1.00 [ -0.00]( 3.55)     1.09 [ -8.57]( 7.80)     1.04 [ -3.64]( 3.89)     1.04 [ -4.33]( 1.36)


  ==================================================================
  Test          : tbench
  Units         : Normalized throughput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:    tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
      1     1.00 [  0.00]( 0.29)     1.01 [  0.63]( 0.68)     1.00 [ -0.15]( 0.96)       0.99 [ -1.46]( 0.25)
      2     1.00 [  0.00]( 0.55)     1.00 [ -0.09]( 0.21)     1.00 [  0.47]( 0.46)       0.99 [ -1.36]( 0.54)
      4     1.00 [  0.00]( 0.33)     0.99 [ -0.83]( 0.54)     1.01 [  0.76]( 0.36)       0.98 [ -1.51]( 0.20)
      8     1.00 [  0.00]( 0.75)     1.00 [ -0.42]( 1.14)     1.01 [  0.96]( 0.49)       0.99 [ -0.64]( 0.34)
     16     1.00 [  0.00]( 0.98)     0.99 [ -0.70]( 1.23)     0.97 [ -2.55]( 0.73)       0.98 [ -1.80]( 1.62)
     32     1.00 [  0.00]( 0.04)     0.98 [ -2.32]( 1.14)     0.98 [ -1.94]( 0.86)       0.98 [ -2.02]( 0.64)
     64     1.00 [  0.00]( 1.27)     0.94 [ -5.51]( 3.69)     0.97 [ -3.45]( 1.28)       0.99 [ -1.49]( 1.68)
    128     1.00 [  0.00]( 0.69)     1.00 [ -0.05]( 2.34)     1.01 [  0.79]( 0.93)       0.99 [ -1.16]( 0.68)
    256     1.00 [  0.00]( 5.60)     0.97 [ -2.67]( 5.28)     1.00 [  0.34]( 1.23)       0.98 [ -2.16]( 7.10)
    512     1.00 [  0.00]( 0.90)     1.00 [ -0.38]( 0.86)     1.01 [  0.53]( 0.10)       0.98 [ -1.88]( 0.09)
   1024     1.00 [  0.00]( 0.25)     0.99 [ -1.01]( 0.37)     1.01 [  0.91]( 0.53)       0.98 [ -1.58]( 0.32)


  ==================================================================
  Test          : stream-10
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)      push_only[pct imp](CV)   newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
   Copy     1.00 [  0.00]( 4.37)     0.97 [ -2.82]( 8.57)     0.99 [ -1.31]( 6.75)       0.97 [ -3.34]( 6.18)
  Scale     1.00 [  0.00]( 2.75)     0.99 [ -0.73]( 3.62)     0.99 [ -0.86]( 3.73)       0.99 [ -1.49]( 5.39)
    Add     1.00 [  0.00]( 3.54)     0.98 [ -2.40]( 3.99)     0.98 [ -1.51]( 4.12)       0.97 [ -3.27]( 6.28)
  Triad     1.00 [  0.00]( 4.41)     0.98 [ -1.71]( 7.00)     1.01 [  0.55]( 3.77)       0.96 [ -4.32]( 7.49)


  ==================================================================
  Test          : stream-100
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)   push+newidle[pct imp](CV)
   Copy     1.00 [  0.00]( 3.25)     0.96 [ -4.08]( 3.07)     0.98 [ -1.56]( 3.45)       0.97 [ -2.74]( 2.00)
  Scale     1.00 [  0.00]( 1.49)     0.98 [ -2.25]( 4.13)     0.98 [ -1.86]( 4.32)       0.99 [ -1.19]( 1.43)
    Add     1.00 [  0.00]( 1.75)     1.00 [ -0.47]( 2.17)     1.00 [ -0.14]( 1.31)       0.99 [ -0.81]( 2.26)
  Triad     1.00 [  0.00]( 1.95)     0.97 [ -2.82]( 4.63)     0.95 [ -4.65]( 6.59)       0.97 [ -2.80]( 4.84)


  ==================================================================
  Test          : netperf
  Units         : Normalized Througput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:         tip[pct imp](CV)       push_only[pct imp](CV)  newidle_only[pct imp](CV)   push+newidle[pct imp](CV)
   1-clients     1.00 [  0.00]( 0.25)     0.98 [ -1.51]( 0.56)      0.99 [ -1.37]( 0.32)       0.98 [ -1.91]( 0.38)
   2-clients     1.00 [  0.00]( 0.39)     0.99 [ -1.26]( 1.05)      0.99 [ -0.99]( 0.75)       0.98 [ -2.16]( 0.57)
   4-clients     1.00 [  0.00]( 0.67)     0.99 [ -0.73]( 0.68)      1.00 [ -0.22]( 0.46)       0.98 [ -1.70]( 0.30)
   8-clients     1.00 [  0.00]( 0.46)     0.99 [ -1.09]( 0.50)      1.00 [ -0.27]( 0.44)       0.98 [ -1.84]( 0.59)
  16-clients     1.00 [  0.00]( 0.76)     0.99 [ -0.79]( 0.48)      1.00 [ -0.24]( 1.35)       0.99 [ -1.31]( 0.74)
  32-clients     1.00 [  0.00]( 0.82)     0.99 [ -0.91]( 0.80)      1.00 [ -0.04]( 1.16)       0.99 [ -1.27]( 0.83)
  64-clients     1.00 [  0.00]( 1.63)     0.99 [ -0.97]( 1.37)      1.00 [  0.13]( 1.47)       0.99 [ -1.17]( 1.60)
  128-clients    1.00 [  0.00]( 1.30)     0.99 [ -1.07]( 1.42)      0.99 [ -0.92]( 1.41)       0.98 [ -1.77]( 1.19)
  256-clients    1.00 [  0.00]( 5.43)     1.02 [  1.53]( 6.74)      1.02 [  1.54]( 3.40)       1.00 [  0.25]( 6.01)
  512-clients    1.00 [  0.00](55.62)     1.00 [ -0.25](54.85)      0.98 [ -1.91](52.43)       0.98 [ -1.88](51.45)


  ==================================================================
  Test          : schbench
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)    push+newidle[pct imp](CV)
    1     1.00 [ -0.00]( 2.50)     1.00 [ -0.00](35.19)     0.88 [ 12.50](31.97)         0.95 [  5.00](33.07)
    2     1.00 [ -0.00]( 8.58)     1.02 [ -2.44]( 6.45)     1.02 [ -2.44]( 9.52)         1.00 [ -0.00]( 2.44)
    4     1.00 [ -0.00]( 7.36)     1.02 [ -2.22]( 3.30)     0.98 [  2.22]( 8.29)         1.02 [ -2.22](13.95)
    8     1.00 [ -0.00]( 8.73)     1.10 [ -9.62]( 9.02)     1.06 [ -5.77]( 6.68)         1.04 [ -3.85]( 6.46)
   16     1.00 [ -0.00]( 4.34)     1.05 [ -4.84]( 4.01)     1.03 [ -3.23]( 1.82)         1.06 [ -6.45]( 4.07)
   32     1.00 [ -0.00]( 3.27)     1.06 [ -6.19]( 4.01)     0.99 [  1.03]( 2.08)         1.00 [ -0.00]( 2.06)
   64     1.00 [ -0.00]( 2.05)     1.01 [ -1.02]( 1.27)     1.01 [ -1.02]( 5.11)         0.91 [  9.18]( 6.51)
  128     1.00 [ -0.00]( 6.08)     0.95 [  5.49]( 4.91)     1.09 [ -8.59]( 8.22)         1.08 [ -7.88](11.81)
  256     1.00 [ -0.00]( 3.28)     0.94 [  6.24]( 4.22)     1.04 [ -3.72]( 6.18)         1.04 [ -4.10]( 3.62)
  512     1.00 [ -0.00]( 2.23)     0.98 [  2.29]( 1.92)     0.98 [  1.90]( 6.93)         1.02 [ -1.90]( 1.51)


  ==================================================================
  Test          : new-schbench-requests-per-second
  Units         : Normalized Requests per second
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
    1     1.00 [  0.00]( 0.14)     1.00 [  0.00]( 0.29)     1.00 [  0.00]( 0.14)       0.99 [ -0.56]( 0.91)
    2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.14)       1.00 [  0.00]( 0.00)
    4     1.00 [  0.00]( 0.14)     1.00 [  0.00]( 0.14)     1.00 [  0.28]( 0.14)       1.00 [  0.28]( 0.14)
    8     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)       1.00 [  0.00]( 0.00)
   16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)       1.00 [  0.00]( 0.00)
   32     1.00 [  0.00]( 4.75)     1.01 [  1.13]( 0.29)     1.00 [  0.00]( 3.77)       0.99 [ -0.57]( 0.51)
   64     1.00 [  0.00]( 1.17)     1.01 [  0.69](13.90)     1.00 [  0.00](13.33)       1.01 [  0.69](13.35)
  128     1.00 [  0.00]( 0.00)     1.00 [  0.34]( 0.18)     1.01 [  0.68]( 0.00)       1.00 [  0.34]( 0.18)
  256     1.00 [  0.00]( 0.56)     1.00 [ -0.49]( 1.24)     1.00 [  0.25]( 1.47)       1.01 [  0.99]( 1.20)
  512     1.00 [  0.00]( 0.96)     1.00 [ -0.37]( 0.88)     1.00 [ -0.37]( 1.58)       1.00 [ -0.37]( 0.88)


  ==================================================================
  Test          : new-schbench-wakeup-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)   push+newidle[pct imp](CV)
    1     1.00 [ -0.00](24.81)     0.75 [ 25.00](24.12)     0.67 [ 33.33]( 6.74)        0.67 [ 33.33](11.18)
    2     1.00 [ -0.00]( 4.08)     0.77 [ 23.08]( 9.68)     0.92 [  7.69](21.56)        0.77 [ 23.08]( 8.94)
    4     1.00 [ -0.00]( 0.00)     1.08 [ -7.69](10.00)     0.85 [ 15.38]( 9.99)        0.85 [ 15.38]( 8.13)
    8     1.00 [ -0.00](12.91)     1.09 [ -9.09]( 4.43)     0.82 [ 18.18](19.99)        0.82 [ 18.18](23.66)
   16     1.00 [ -0.00](12.06)     1.18 [-18.18]( 8.37)     1.18 [-18.18](15.10)        1.18 [-18.18](15.10)
   32     1.00 [ -0.00](22.13)     1.00 [ -0.00]( 5.00)     1.10 [-10.00](19.86)        1.00 [ -0.00]( 5.34)
   64     1.00 [ -0.00](11.07)     1.00 [ -0.00](16.90)     0.92 [  7.69](15.49)        1.00 [ -0.00](13.62)
  128     1.00 [ -0.00]( 9.04)     0.98 [  2.48]( 3.01)     0.99 [  1.49]( 6.96)        0.98 [  1.98]( 5.42)
  256     1.00 [ -0.00]( 0.24)     1.00 [ -0.00]( 0.00)     1.00 [ -0.24]( 0.12)        1.00 [ -0.24]( 0.32)
  512     1.00 [ -0.00]( 0.34)     1.00 [ -0.00]( 0.40)     1.00 [  0.38]( 0.34)        0.99 [  1.15]( 0.20)


  ==================================================================
  Test          : new-schbench-request-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers: tip[pct imp](CV)      push_only[pct imp](CV)   newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
    1     1.00 [ -0.00]( 0.90)     0.99 [  0.84]( 1.82)     0.99 [  0.56]( 1.10)        1.03 [ -2.53]( 1.88)
    2     1.00 [ -0.00]( 0.00)     1.01 [ -0.57]( 0.29)     1.01 [ -0.86]( 0.81)        1.02 [ -2.28]( 1.04)
    4     1.00 [ -0.00]( 1.02)     0.98 [  1.69]( 0.15)     0.99 [  0.84]( 1.02)        1.01 [ -0.84]( 1.67)
    8     1.00 [ -0.00]( 0.15)     1.01 [ -0.57]( 0.51)     1.00 [ -0.00]( 0.26)        1.00 [ -0.00]( 0.39)
   16     1.00 [ -0.00]( 0.53)     1.01 [ -0.57]( 0.64)     1.00 [ -0.29]( 0.39)        1.01 [ -0.86]( 0.81)
   32     1.00 [ -0.00](35.40)     0.98 [  1.62]( 0.49)     0.99 [  0.81](10.03)        1.00 [ -0.00]( 0.48)
   64     1.00 [ -0.00]( 5.24)     0.92 [  7.82](26.28)     1.03 [ -2.52]( 6.65)        0.62 [ 38.02](32.78)
  128     1.00 [ -0.00]( 2.02)     0.99 [  0.75]( 1.40)     1.16 [-16.14]( 2.15)        1.17 [-16.89]( 3.16)
  256     1.00 [ -0.00]( 3.41)     0.96 [  4.08]( 3.32)     1.07 [ -7.13]( 2.60)        1.10 [ -9.94]( 4.96)
  512     1.00 [ -0.00]( 1.45)     1.00 [  0.43]( 2.77)     0.99 [  1.06]( 0.73)        0.98 [  1.92]( 0.40)


  ==================================================================
  Test          : Various longer running benchmarks
  Units         : %diff in throughput reported
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  Benchmarks:                push_only    newidle_only  push+newidle
  ycsb-cassandra              -3%             -3%          -1%
  ycsb-mongodb                -2%             -2%          -1%
  deathstarbench-1x           24%             16%
  deathstarbench-2x           12%             14%
  deathstarbench-3x           17%             14%
  deathstarbench-6x


LPC 2025
========

Further experiments carried out will be discussed at LPC'25 in the
"Scheduler and Real-Time MC" between 11:08AM and 11:30AM on 11th
December, 2025 in Hall B4.

Discussion points include:

o "sd->shared" assignment optimization.
o "nohz.idle_cpus" mask optimization
o Busy balance CPU rotation.
o Effective detection of when it is favorable to push tasks.
o The overheads of maintaining masks (even with optimizations).
o The delicate dance of newidle balance.

Please do drop by, or reach out to me directly if this work interests
you.


Changelog
=========

This series is based on tip:sched/core at commit 3eb593560146 ("Merge
tag 'v6.18-rc7' into sched/core, to pick up fixes"). All the comparisons
above are done with the same.

o rfc v1.. rfc v2

- Collected tags on Patch 1 from Srikanth (Thanks a ton for the review)

- Added a for_each_cpu_and_wrap() and cleaned up couple of sites using
  the newly introduced macro.

- Simplified conditions that referenced per-CPU "llc_size" and
  "sd->shared" using the fact that only sd_llc has sd->shared assigned.

- Merged the two series however, the idea is largely the same. Push
  based load balancing is guarded behing CONFIG_NO_HZ_COMMON since a
  bunch of NO_HZ_COMMON specific bits were put behind the config option.

- Idea of overloaded_mask was dropped since the overhead to maintain
  the mask (without any usage) was visible in many benchmark results.

- Idea of shallow_idle_cpus mask was dropped since the overhead to
  maintain the mask (without any usage) was visible in benchmarks like
  tbench that left the CPUs idle for very short duration.

- Added the patch to rotate the "busy_balance_cpu".

- Renamed "idle_cpus_mask" to "nohz_idle_cpus_mask" in anticipation of
  adding the "shallow_idle_cpus" mask which didn't pan out.


Note: Patched marked EXPERIMENTAL may be incomplete from a schedstats
accounting standpoint.

---
K Prateek Nayak (28):
  sched/fair: Simplify set_cpu_sd_state_*() with guards
  sched/fair: Use rq->nohz_tick_stopped in update_nohz_stats()
  sched/topology: Optimize sd->shared allocation and assignment
  sched/fair: Simplify the entry condition for update_idle_cpu_scan()
  sched/fair: Simplity SIS_UTIL handling in select_idle_cpu()
  cpumask: Introduce for_each_cpu_and_wrap() and bitfield helpers
  sched/fair: Use for_each_cpu_and_wrap() in select_idle_capacity()
  sched/fair: Use for_each_cpu_and_wrap() in select_idle_cpu()
  sched/fair: Rotate the CPU resposible for busy load balancing
  sched/fair: Use xchg() to set sd->nohz_idle state
  sched/topology: Attach new hierarchy in rq_attach_root()
  sched/fair: Fixup sd->nohz_idle state during hotplug / cpuset
  sched/fair: Account idle cpus instead of busy cpus in sd->shared
  sched/topology: Introduce fallback sd->shared assignment
  sched/topology: Introduce percpu sd_nohz for nohz state tracking
  sched/topology: Introduce "nohz_idle_cpus_mask" in sd->shared
  sched/topology: Introduce "nohz_shared_list" to keep track of
    sd->shared
  sched/fair: Reorder the barrier in nohz_balance_enter_idle()
  sched/fair: Extract the main _nohz_idle_balance() loop into a helper
  sched/fair: Convert find_new_ilb() to use nohz_shared_list
  sched/fair: Introduce sched_asym_prefer_idle() for ILB kick
  sched/fair: Convert sched_balance_nohz_idle() to use nohz_shared_list
  sched/fair: Remove "nohz.idle_cpus_mask"
  sched/fair: Optimize global "nohz.nr_cpus" tracking
  sched/topology: Add basic debug information for "nohz_shared_list"
  [EXPERIMENTAL] sched/fair: Proactive idle balance using push mechanism
  [EXPERIMENTAL] sched/fair: Add a local counter to rate limit task push
  [EXPERIMENTAL] sched/fair: Faster alternate for intra-NUMA newidle
    balance

Vincent Guittot (1):
  [EXPERIMENTAL] sched/fair: Add push task framework

 include/linux/cpumask.h        |  20 +
 include/linux/find.h           |  37 ++
 include/linux/sched/topology.h |  18 +-
 kernel/sched/core.c            |   4 +-
 kernel/sched/fair.c            | 828 ++++++++++++++++++++++++++-------
 kernel/sched/sched.h           |  10 +-
 kernel/sched/topology.c        | 386 +++++++++++++--
 7 files changed, 1076 insertions(+), 227 deletions(-)


base-commit: 3eb59356014674fa1b506a122cc59b57089a4d08
-- 
2.43.0