linux-kernel - Re: [RESEND RFC PATCH v2 00/29] sched/fair: Push-based load balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <48ced285-f0d2-4da1-9955-12f9c4d7692d@linux.ibm.com>
Date: Mon, 8 Dec 2025 19:34:28 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: linux-kernel@...r.kernel.org, Dietmar Eggemann
 <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
        "Gautham R. Shenoy" <gautham.shenoy@....com>,
        Swapnil Sapkal <swapnil.sapkal@....com>, Chen Yu <yu.c.chen@...el.com>,
        Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Anna-Maria Behnsen <anna-maria@...utronix.de>,
        Frederic Weisbecker <frederic@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RESEND RFC PATCH v2 00/29] sched/fair: Push-based load balancing



On 12/8/25 2:56 PM, K Prateek Nayak wrote:
> Resending the series with the correct Email IDs this time. Sorry for the
> noise.
> 
> This is the combined successor to the following two series:
> 
> https://lore.kernel.org/lkml/20250904041516.3046-1-kprateek.nayak@amd.com/
> https://lore.kernel.org/lkml/20250409111539.23791-1-kprateek.nayak@amd.com/
> 
> Most bits are same except for more initial cleanups. Changelog is
> attached towards the end. This topic will be discussed at LPC'25 in the
> "Scheduler and Real-Time MC" - jump to "LPC 2025" to know what will be
> discussed.
> 
> 
> Problem statement
> =================
> 
> This series tackles three problems:
> 
> 1. The busy load balancing always uses the first CPU of
>     group_balance_mask() for load balancing which puts all the load
>     balancing responsibility on a single CPU.
> 
> 2. The "nohz.nr_idle" and "nohz.idle_cpus" are global system-wide
>     variables that can run into scalability bottlenecks on a large
>     core-count system.
> 

I think they are nohz.nr_cpus and nohz.idle_cpus_mask. isn't it?


> 3. Periodic balance can take a long time to even out imbalance on
>     systems with mostly large and flat sched-domain hierarchy. Preempted
>     tasks may wait for a long time behind other runnable tasks increasing
>     the tail latencies.
> 
> This series aims at addressing the combined problems listed above.
> 
> 
> Implementation details
> ======================
> 
> Note: Sections marked EXPERIMENTAL are known to introduce regeressions
> for certain benchmarks. These have been discussed in details in the next
> section. These patches also may be incomplete from a schedstats
> accounting standpoint.
> 
> 
> o Busy balancing optimization
> 
> The busy balance CPU is always the first cpu of
> "sched_group->scg->group_balance_mask". "sgc" object is shared by all
> the CPUs on the "group_balance_mask", even for the overlapping domains.
> 
> To keep overheads minimal, a simple "busy_balance_cpu" is maintained in
> the shared "sgc" object. A CPU is nominated to handle the busy
> balancing. Once the CPU is done with its turn, it nominates the next CPU
> on the group_balance_mask.
> 
>    - Advantages: The responsibility of busy balance is rotated among the
>      CPUs on the group_balance_mask. Maintaining the "busy_balance_cpu"
>      only requires a READ_ONCE() / WRITE_ONCE() modifications making it
>      relatively cheap.
> 
>    - Disadvantages: The currently nominated "busy_balance_cpu" can run
>      for a long time with bh disabled that can prevent balancing work
>      from running however, it is no worse that the current state where
>      the first CPU continues running with bh disabled for a prolonged
>      period of time.
> 
> 
> o Centralized "nohz" accounting optimizations
> 
> The centralized "nohz" tracking maintains the number and list of CPUs
> that are in nohz idle state. These are done via atomic operations on
> variables shared across the system which is less than ideal.
> 
> Peter suggested breaking the mask down and embedding it into the
> sched_domain hierarchy which would minimize atomic operations on the
> global variables.
> 
> There are 2 possible ways to implement this:
> 
> 1. Maintain the idle CPUs mask in sched_domain_shared. Also construct a
>     hierarchy of the sched_domain_shared objects which can be used to
>     propagate a signal up to the topmost domain.
> 
>     - Advantages: Distributed tracking. Less atomic operations on the
>       global variables.
> 
>     - Disadvantages: Number of atomic ops can scale with the depth of the
>       hierarchy with multiple cache lines being possibly shared between
>       multiple NUMA domains.
> 
> 2. [Implemented in this series] Maintain the idle CPUs mask in
>     sched_domain_shared. String all the sched_domain_shared objects in a
>     global list which is used for traversing all nohz idle CPUs during
>     load balancing.
> 
>     Maintain a global "nrr_doms" indicator that is only updated when the
>     first CPU is added to the LLC local mask / last CPU leaves the LLC
>     local mask.
> 
>     - Advantages: Distributed tracking. Simpler implementation.
> 
>     - Disadvantages: Number of atomic ops to global "nr_doms" can scale
>       with the number of LLC domains, however the changes in the boundary
>       conditions are still less frequent than the current global scheme.
> 
> The nohz_idle_cpus mask is also inherently optimized by retaining a CPU
> on the mask until the first tick and is not immediately cleared when the
> ticks are enabled again / at idle exit.
> 
> 
> o [EXPERIMENTAL] Push-based load balancing
> 
> Proactively push tasks to idle CPUs within an LLC domain. Push-based
> load balancing is found to be a delicate balancing act where delaying
> running the tasks, especially if their runtime is small, can lead to
> performance regressions.
> 
> There are cases, especially with larger benchmarks where pushing the
> tasks more proactively helps with performance however, a number of
> microbenchmark suffer as a result of additional work the busy CPU has to
> do to to push a preempted task.
> 
> 
> o [EXPERIMENTAL] Optimizing Intra-NUMA newidle balancing
> 
> On a CONFIG_PREEMPTION enabled kernel, newidle balance only pulls one
> task to keep the latency of balancing low. Despite the effort to keep
> the latency low, the newidle balance ends up computing a great deal of
> stats just to pull a single task at best.
> 
> Instead of following the usual path, directly traverse CPUs for newidle
> balance in search of CPUs to pull load from.
> 
> This too is found to have interesting effects on benchmarks where CPUs
> can converge on single target to pull tasks from causing some amount of
> lock contention.
> 
> More interestingly, a number of benchmarks seem to regress if the
> newidle balance yields on spotting (nr_running > 1 || ttwu_pending)
> instead of just proceeding to scan the entire domain and bail at the
> end.
> 
> 
> Benchmark results
> =================
> 
> Results for some variants are incomplete as a result of setup issues
> (and my sheer incompetence to revert some of the changes I made when
> analyzing the benchmarks)
> 
> I'll update these as and when the runs (and re-runs) complete but as the
> moment, this is how the different [EXPERIMENTAL] bits stack up from
> benchmarking perspective on a dual socket 3rd Generation EPYC system (2
> x 64C/128T)
> 
>    ==================================================================
>    Test          : hackbench
>    Units         : Normalized time in seconds
>    Interpretation: Lower is better
>    Statistic     : AMean
>    ==================================================================
>    Case:           tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV) push+newidle[pct imp](CV)
>     1-groups     1.00 [ -0.00]( 5.58)     1.01 [ -1.10](14.77)     1.01 [ -0.66]( 8.80)     1.03 [ -2.85]( 6.13)
>     2-groups     1.00 [ -0.00]( 9.58)     1.02 [ -2.41]( 3.09)     1.00 [  0.22]( 5.62)     1.02 [ -1.97]( 4.54)
>     4-groups     1.00 [ -0.00]( 2.11)     0.99 [  1.48]( 2.30)     1.00 [ -0.21]( 2.60)     1.03 [ -2.54]( 2.82)
>     8-groups     1.00 [ -0.00]( 2.07)     1.02 [ -2.31]( 2.98)     1.15 [-14.79]( 2.15)     1.13 [-12.63]( 2.57)
>    16-groups     1.00 [ -0.00]( 3.55)     1.09 [ -8.57]( 7.80)     1.04 [ -3.64]( 3.89)     1.04 [ -4.33]( 1.36)
> 
> 
>    ==================================================================
>    Test          : tbench
>    Units         : Normalized throughput
>    Interpretation: Higher is better
>    Statistic     : AMean
>    ==================================================================
>    Clients:    tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
>        1     1.00 [  0.00]( 0.29)     1.01 [  0.63]( 0.68)     1.00 [ -0.15]( 0.96)       0.99 [ -1.46]( 0.25)
>        2     1.00 [  0.00]( 0.55)     1.00 [ -0.09]( 0.21)     1.00 [  0.47]( 0.46)       0.99 [ -1.36]( 0.54)
>        4     1.00 [  0.00]( 0.33)     0.99 [ -0.83]( 0.54)     1.01 [  0.76]( 0.36)       0.98 [ -1.51]( 0.20)
>        8     1.00 [  0.00]( 0.75)     1.00 [ -0.42]( 1.14)     1.01 [  0.96]( 0.49)       0.99 [ -0.64]( 0.34)
>       16     1.00 [  0.00]( 0.98)     0.99 [ -0.70]( 1.23)     0.97 [ -2.55]( 0.73)       0.98 [ -1.80]( 1.62)
>       32     1.00 [  0.00]( 0.04)     0.98 [ -2.32]( 1.14)     0.98 [ -1.94]( 0.86)       0.98 [ -2.02]( 0.64)
>       64     1.00 [  0.00]( 1.27)     0.94 [ -5.51]( 3.69)     0.97 [ -3.45]( 1.28)       0.99 [ -1.49]( 1.68)
>      128     1.00 [  0.00]( 0.69)     1.00 [ -0.05]( 2.34)     1.01 [  0.79]( 0.93)       0.99 [ -1.16]( 0.68)
>      256     1.00 [  0.00]( 5.60)     0.97 [ -2.67]( 5.28)     1.00 [  0.34]( 1.23)       0.98 [ -2.16]( 7.10)
>      512     1.00 [  0.00]( 0.90)     1.00 [ -0.38]( 0.86)     1.01 [  0.53]( 0.10)       0.98 [ -1.88]( 0.09)
>     1024     1.00 [  0.00]( 0.25)     0.99 [ -1.01]( 0.37)     1.01 [  0.91]( 0.53)       0.98 [ -1.58]( 0.32)
> 
> 
>    ==================================================================
>    Test          : stream-10
>    Units         : Normalized Bandwidth, MB/s
>    Interpretation: Higher is better
>    Statistic     : HMean
>    ==================================================================
>    Test:       tip[pct imp](CV)      push_only[pct imp](CV)   newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
>     Copy     1.00 [  0.00]( 4.37)     0.97 [ -2.82]( 8.57)     0.99 [ -1.31]( 6.75)       0.97 [ -3.34]( 6.18)
>    Scale     1.00 [  0.00]( 2.75)     0.99 [ -0.73]( 3.62)     0.99 [ -0.86]( 3.73)       0.99 [ -1.49]( 5.39)
>      Add     1.00 [  0.00]( 3.54)     0.98 [ -2.40]( 3.99)     0.98 [ -1.51]( 4.12)       0.97 [ -3.27]( 6.28)
>    Triad     1.00 [  0.00]( 4.41)     0.98 [ -1.71]( 7.00)     1.01 [  0.55]( 3.77)       0.96 [ -4.32]( 7.49)
> 
> 
>    ==================================================================
>    Test          : stream-100
>    Units         : Normalized Bandwidth, MB/s
>    Interpretation: Higher is better
>    Statistic     : HMean
>    ==================================================================
>    Test:       tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)   push+newidle[pct imp](CV)
>     Copy     1.00 [  0.00]( 3.25)     0.96 [ -4.08]( 3.07)     0.98 [ -1.56]( 3.45)       0.97 [ -2.74]( 2.00)
>    Scale     1.00 [  0.00]( 1.49)     0.98 [ -2.25]( 4.13)     0.98 [ -1.86]( 4.32)       0.99 [ -1.19]( 1.43)
>      Add     1.00 [  0.00]( 1.75)     1.00 [ -0.47]( 2.17)     1.00 [ -0.14]( 1.31)       0.99 [ -0.81]( 2.26)
>    Triad     1.00 [  0.00]( 1.95)     0.97 [ -2.82]( 4.63)     0.95 [ -4.65]( 6.59)       0.97 [ -2.80]( 4.84)
> 
> 
>    ==================================================================
>    Test          : netperf
>    Units         : Normalized Througput
>    Interpretation: Higher is better
>    Statistic     : AMean
>    ==================================================================
>    Clients:         tip[pct imp](CV)       push_only[pct imp](CV)  newidle_only[pct imp](CV)   push+newidle[pct imp](CV)
>     1-clients     1.00 [  0.00]( 0.25)     0.98 [ -1.51]( 0.56)      0.99 [ -1.37]( 0.32)       0.98 [ -1.91]( 0.38)
>     2-clients     1.00 [  0.00]( 0.39)     0.99 [ -1.26]( 1.05)      0.99 [ -0.99]( 0.75)       0.98 [ -2.16]( 0.57)
>     4-clients     1.00 [  0.00]( 0.67)     0.99 [ -0.73]( 0.68)      1.00 [ -0.22]( 0.46)       0.98 [ -1.70]( 0.30)
>     8-clients     1.00 [  0.00]( 0.46)     0.99 [ -1.09]( 0.50)      1.00 [ -0.27]( 0.44)       0.98 [ -1.84]( 0.59)
>    16-clients     1.00 [  0.00]( 0.76)     0.99 [ -0.79]( 0.48)      1.00 [ -0.24]( 1.35)       0.99 [ -1.31]( 0.74)
>    32-clients     1.00 [  0.00]( 0.82)     0.99 [ -0.91]( 0.80)      1.00 [ -0.04]( 1.16)       0.99 [ -1.27]( 0.83)
>    64-clients     1.00 [  0.00]( 1.63)     0.99 [ -0.97]( 1.37)      1.00 [  0.13]( 1.47)       0.99 [ -1.17]( 1.60)
>    128-clients    1.00 [  0.00]( 1.30)     0.99 [ -1.07]( 1.42)      0.99 [ -0.92]( 1.41)       0.98 [ -1.77]( 1.19)
>    256-clients    1.00 [  0.00]( 5.43)     1.02 [  1.53]( 6.74)      1.02 [  1.54]( 3.40)       1.00 [  0.25]( 6.01)
>    512-clients    1.00 [  0.00](55.62)     1.00 [ -0.25](54.85)      0.98 [ -1.91](52.43)       0.98 [ -1.88](51.45)
> 
> 
>    ==================================================================
>    Test          : schbench
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)    push+newidle[pct imp](CV)
>      1     1.00 [ -0.00]( 2.50)     1.00 [ -0.00](35.19)     0.88 [ 12.50](31.97)         0.95 [  5.00](33.07)
>      2     1.00 [ -0.00]( 8.58)     1.02 [ -2.44]( 6.45)     1.02 [ -2.44]( 9.52)         1.00 [ -0.00]( 2.44)
>      4     1.00 [ -0.00]( 7.36)     1.02 [ -2.22]( 3.30)     0.98 [  2.22]( 8.29)         1.02 [ -2.22](13.95)
>      8     1.00 [ -0.00]( 8.73)     1.10 [ -9.62]( 9.02)     1.06 [ -5.77]( 6.68)         1.04 [ -3.85]( 6.46)
>     16     1.00 [ -0.00]( 4.34)     1.05 [ -4.84]( 4.01)     1.03 [ -3.23]( 1.82)         1.06 [ -6.45]( 4.07)
>     32     1.00 [ -0.00]( 3.27)     1.06 [ -6.19]( 4.01)     0.99 [  1.03]( 2.08)         1.00 [ -0.00]( 2.06)
>     64     1.00 [ -0.00]( 2.05)     1.01 [ -1.02]( 1.27)     1.01 [ -1.02]( 5.11)         0.91 [  9.18]( 6.51)
>    128     1.00 [ -0.00]( 6.08)     0.95 [  5.49]( 4.91)     1.09 [ -8.59]( 8.22)         1.08 [ -7.88](11.81)
>    256     1.00 [ -0.00]( 3.28)     0.94 [  6.24]( 4.22)     1.04 [ -3.72]( 6.18)         1.04 [ -4.10]( 3.62)
>    512     1.00 [ -0.00]( 2.23)     0.98 [  2.29]( 1.92)     0.98 [  1.90]( 6.93)         1.02 [ -1.90]( 1.51)
> 
> 
>    ==================================================================
>    Test          : new-schbench-requests-per-second
>    Units         : Normalized Requests per second
>    Interpretation: Higher is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
>      1     1.00 [  0.00]( 0.14)     1.00 [  0.00]( 0.29)     1.00 [  0.00]( 0.14)       0.99 [ -0.56]( 0.91)
>      2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.14)       1.00 [  0.00]( 0.00)
>      4     1.00 [  0.00]( 0.14)     1.00 [  0.00]( 0.14)     1.00 [  0.28]( 0.14)       1.00 [  0.28]( 0.14)
>      8     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)       1.00 [  0.00]( 0.00)
>     16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)       1.00 [  0.00]( 0.00)
>     32     1.00 [  0.00]( 4.75)     1.01 [  1.13]( 0.29)     1.00 [  0.00]( 3.77)       0.99 [ -0.57]( 0.51)
>     64     1.00 [  0.00]( 1.17)     1.01 [  0.69](13.90)     1.00 [  0.00](13.33)       1.01 [  0.69](13.35)
>    128     1.00 [  0.00]( 0.00)     1.00 [  0.34]( 0.18)     1.01 [  0.68]( 0.00)       1.00 [  0.34]( 0.18)
>    256     1.00 [  0.00]( 0.56)     1.00 [ -0.49]( 1.24)     1.00 [  0.25]( 1.47)       1.01 [  0.99]( 1.20)
>    512     1.00 [  0.00]( 0.96)     1.00 [ -0.37]( 0.88)     1.00 [ -0.37]( 1.58)       1.00 [ -0.37]( 0.88)
> 
> 
>    ==================================================================
>    Test          : new-schbench-wakeup-latency
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)      push_only[pct imp](CV)  newidle_only[pct imp](CV)   push+newidle[pct imp](CV)
>      1     1.00 [ -0.00](24.81)     0.75 [ 25.00](24.12)     0.67 [ 33.33]( 6.74)        0.67 [ 33.33](11.18)
>      2     1.00 [ -0.00]( 4.08)     0.77 [ 23.08]( 9.68)     0.92 [  7.69](21.56)        0.77 [ 23.08]( 8.94)
>      4     1.00 [ -0.00]( 0.00)     1.08 [ -7.69](10.00)     0.85 [ 15.38]( 9.99)        0.85 [ 15.38]( 8.13)
>      8     1.00 [ -0.00](12.91)     1.09 [ -9.09]( 4.43)     0.82 [ 18.18](19.99)        0.82 [ 18.18](23.66)
>     16     1.00 [ -0.00](12.06)     1.18 [-18.18]( 8.37)     1.18 [-18.18](15.10)        1.18 [-18.18](15.10)
>     32     1.00 [ -0.00](22.13)     1.00 [ -0.00]( 5.00)     1.10 [-10.00](19.86)        1.00 [ -0.00]( 5.34)
>     64     1.00 [ -0.00](11.07)     1.00 [ -0.00](16.90)     0.92 [  7.69](15.49)        1.00 [ -0.00](13.62)
>    128     1.00 [ -0.00]( 9.04)     0.98 [  2.48]( 3.01)     0.99 [  1.49]( 6.96)        0.98 [  1.98]( 5.42)
>    256     1.00 [ -0.00]( 0.24)     1.00 [ -0.00]( 0.00)     1.00 [ -0.24]( 0.12)        1.00 [ -0.24]( 0.32)
>    512     1.00 [ -0.00]( 0.34)     1.00 [ -0.00]( 0.40)     1.00 [  0.38]( 0.34)        0.99 [  1.15]( 0.20)
> 
> 
>    ==================================================================
>    Test          : new-schbench-request-latency
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)      push_only[pct imp](CV)   newidle_only[pct imp](CV)  push+newidle[pct imp](CV)
>      1     1.00 [ -0.00]( 0.90)     0.99 [  0.84]( 1.82)     0.99 [  0.56]( 1.10)        1.03 [ -2.53]( 1.88)
>      2     1.00 [ -0.00]( 0.00)     1.01 [ -0.57]( 0.29)     1.01 [ -0.86]( 0.81)        1.02 [ -2.28]( 1.04)
>      4     1.00 [ -0.00]( 1.02)     0.98 [  1.69]( 0.15)     0.99 [  0.84]( 1.02)        1.01 [ -0.84]( 1.67)
>      8     1.00 [ -0.00]( 0.15)     1.01 [ -0.57]( 0.51)     1.00 [ -0.00]( 0.26)        1.00 [ -0.00]( 0.39)
>     16     1.00 [ -0.00]( 0.53)     1.01 [ -0.57]( 0.64)     1.00 [ -0.29]( 0.39)        1.01 [ -0.86]( 0.81)
>     32     1.00 [ -0.00](35.40)     0.98 [  1.62]( 0.49)     0.99 [  0.81](10.03)        1.00 [ -0.00]( 0.48)
>     64     1.00 [ -0.00]( 5.24)     0.92 [  7.82](26.28)     1.03 [ -2.52]( 6.65)        0.62 [ 38.02](32.78)
>    128     1.00 [ -0.00]( 2.02)     0.99 [  0.75]( 1.40)     1.16 [-16.14]( 2.15)        1.17 [-16.89]( 3.16)
>    256     1.00 [ -0.00]( 3.41)     0.96 [  4.08]( 3.32)     1.07 [ -7.13]( 2.60)        1.10 [ -9.94]( 4.96)
>    512     1.00 [ -0.00]( 1.45)     1.00 [  0.43]( 2.77)     0.99 [  1.06]( 0.73)        0.98 [  1.92]( 0.40)
> 
> 
>    ==================================================================
>    Test          : Various longer running benchmarks
>    Units         : %diff in throughput reported
>    Interpretation: Higher is better
>    Statistic     : Median
>    ==================================================================
>    Benchmarks:                push_only    newidle_only  push+newidle
>    ycsb-cassandra              -3%             -3%          -1%
>    ycsb-mongodb                -2%             -2%          -1%
>    deathstarbench-1x           24%             16%
>    deathstarbench-2x           12%             14%
>    deathstarbench-3x           17%             14%
>    deathstarbench-6x
> 
> 
> LPC 2025
> ========
> 
> Further experiments carried out will be discussed at LPC'25 in the
> "Scheduler and Real-Time MC" between 11:08AM and 11:30AM on 11th
> December, 2025 in Hall B4.
> 
> Discussion points include:
> 
> o "sd->shared" assignment optimization.
> o "nohz.idle_cpus" mask optimization
> o Busy balance CPU rotation.
> o Effective detection of when it is favorable to push tasks.
> o The overheads of maintaining masks (even with optimizations).
> o The delicate dance of newidle balance.
> 
> Please do drop by, or reach out to me directly if this work interests
> you.
> 
> 
> Changelog
> =========
> 
> This series is based on tip:sched/core at commit 3eb593560146 ("Merge
> tag 'v6.18-rc7' into sched/core, to pick up fixes"). All the comparisons
> above are done with the same.
> 
> o rfc v1.. rfc v2
> 
> - Collected tags on Patch 1 from Srikanth (Thanks a ton for the review)
> 
> - Added a for_each_cpu_and_wrap() and cleaned up couple of sites using
>    the newly introduced macro.
> 
> - Simplified conditions that referenced per-CPU "llc_size" and
>    "sd->shared" using the fact that only sd_llc has sd->shared assigned.
> 
> - Merged the two series however, the idea is largely the same. Push
>    based load balancing is guarded behing CONFIG_NO_HZ_COMMON since a
>    bunch of NO_HZ_COMMON specific bits were put behind the config option.
> 
> - Idea of overloaded_mask was dropped since the overhead to maintain
>    the mask (without any usage) was visible in many benchmark results.
> 
> - Idea of shallow_idle_cpus mask was dropped since the overhead to
>    maintain the mask (without any usage) was visible in benchmarks like
>    tbench that left the CPUs idle for very short duration.
> 
> - Added the patch to rotate the "busy_balance_cpu".
> 
> - Renamed "idle_cpus_mask" to "nohz_idle_cpus_mask" in anticipation of
>    adding the "shallow_idle_cpus" mask which didn't pan out.
> 
> 
> Note: Patched marked EXPERIMENTAL may be incomplete from a schedstats
> accounting standpoint.
> 
> ---
> K Prateek Nayak (28):
>    sched/fair: Simplify set_cpu_sd_state_*() with guards
>    sched/fair: Use rq->nohz_tick_stopped in update_nohz_stats()
>    sched/topology: Optimize sd->shared allocation and assignment
>    sched/fair: Simplify the entry condition for update_idle_cpu_scan()
>    sched/fair: Simplity SIS_UTIL handling in select_idle_cpu()
>    cpumask: Introduce for_each_cpu_and_wrap() and bitfield helpers
>    sched/fair: Use for_each_cpu_and_wrap() in select_idle_capacity()
>    sched/fair: Use for_each_cpu_and_wrap() in select_idle_cpu()
>    sched/fair: Rotate the CPU resposible for busy load balancing
>    sched/fair: Use xchg() to set sd->nohz_idle state
>    sched/topology: Attach new hierarchy in rq_attach_root()
>    sched/fair: Fixup sd->nohz_idle state during hotplug / cpuset
>    sched/fair: Account idle cpus instead of busy cpus in sd->shared
>    sched/topology: Introduce fallback sd->shared assignment
>    sched/topology: Introduce percpu sd_nohz for nohz state tracking
>    sched/topology: Introduce "nohz_idle_cpus_mask" in sd->shared
>    sched/topology: Introduce "nohz_shared_list" to keep track of
>      sd->shared
>    sched/fair: Reorder the barrier in nohz_balance_enter_idle()
>    sched/fair: Extract the main _nohz_idle_balance() loop into a helper
>    sched/fair: Convert find_new_ilb() to use nohz_shared_list
>    sched/fair: Introduce sched_asym_prefer_idle() for ILB kick
>    sched/fair: Convert sched_balance_nohz_idle() to use nohz_shared_list
>    sched/fair: Remove "nohz.idle_cpus_mask"
>    sched/fair: Optimize global "nohz.nr_cpus" tracking
>    sched/topology: Add basic debug information for "nohz_shared_list"
>    [EXPERIMENTAL] sched/fair: Proactive idle balance using push mechanism
>    [EXPERIMENTAL] sched/fair: Add a local counter to rate limit task push
>    [EXPERIMENTAL] sched/fair: Faster alternate for intra-NUMA newidle
>      balance
> 
> Vincent Guittot (1):
>    [EXPERIMENTAL] sched/fair: Add push task framework
> 
>   include/linux/cpumask.h        |  20 +
>   include/linux/find.h           |  37 ++
>   include/linux/sched/topology.h |  18 +-
>   kernel/sched/core.c            |   4 +-
>   kernel/sched/fair.c            | 828 ++++++++++++++++++++++++++-------
>   kernel/sched/sched.h           |  10 +-
>   kernel/sched/topology.c        | 386 +++++++++++++--
>   7 files changed, 1076 insertions(+), 227 deletions(-)
> 
> 
> base-commit: 3eb59356014674fa1b506a122cc59b57089a4d08