[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251208083602.31898-1-kprateek.nayak@amd.com>
Date: Mon, 8 Dec 2025 09:26:46 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Anna-Maria Behnsen <anna-maria@...utronix.de>,
Frederic Weisbecker <frederic@...nel.org>, Thomas Gleixner
<tglx@...utronix.de>
CC: <linux-kernel@...r.kernel.org>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
<vschneid@...hat.com>, K Prateek Nayak <kprateek.nayak@....com>, "Gautham R.
Shenoy" <gautham.shenoy@....com>, Swapnil Sapkal <swapnil.sapkal@....com>,
Shrikanth Hegde <sshegde@...ux.ibm.com>, Chen Yu <yu.c.chen@...el.com>
Subject: [RESEND RFC PATCH v2 00/29] sched/fair: Push-based load balancing
Resending the series with the correct Email IDs this time. Sorry for the
noise.
This is the combined successor to the following two series:
https://lore.kernel.org/lkml/20250904041516.3046-1-kprateek.nayak@amd.com/
https://lore.kernel.org/lkml/20250409111539.23791-1-kprateek.nayak@amd.com/
Most bits are same except for more initial cleanups. Changelog is
attached towards the end. This topic will be discussed at LPC'25 in the
"Scheduler and Real-Time MC" - jump to "LPC 2025" to know what will be
discussed.
Problem statement
=================
This series tackles three problems:
1. The busy load balancing always uses the first CPU of
group_balance_mask() for load balancing which puts all the load
balancing responsibility on a single CPU.
2. The "nohz.nr_idle" and "nohz.idle_cpus" are global system-wide
variables that can run into scalability bottlenecks on a large
core-count system.
3. Periodic balance can take a long time to even out imbalance on
systems with mostly large and flat sched-domain hierarchy. Preempted
tasks may wait for a long time behind other runnable tasks increasing
the tail latencies.
This series aims at addressing the combined problems listed above.
Implementation details
======================
Note: Sections marked EXPERIMENTAL are known to introduce regeressions
for certain benchmarks. These have been discussed in details in the next
section. These patches also may be incomplete from a schedstats
accounting standpoint.
o Busy balancing optimization
The busy balance CPU is always the first cpu of
"sched_group->scg->group_balance_mask". "sgc" object is shared by all
the CPUs on the "group_balance_mask", even for the overlapping domains.
To keep overheads minimal, a simple "busy_balance_cpu" is maintained in
the shared "sgc" object. A CPU is nominated to handle the busy
balancing. Once the CPU is done with its turn, it nominates the next CPU
on the group_balance_mask.
- Advantages: The responsibility of busy balance is rotated among the
CPUs on the group_balance_mask. Maintaining the "busy_balance_cpu"
only requires a READ_ONCE() / WRITE_ONCE() modifications making it
relatively cheap.
- Disadvantages: The currently nominated "busy_balance_cpu" can run
for a long time with bh disabled that can prevent balancing work
from running however, it is no worse that the current state where
the first CPU continues running with bh disabled for a prolonged
period of time.
o Centralized "nohz" accounting optimizations
The centralized "nohz" tracking maintains the number and list of CPUs
that are in nohz idle state. These are done via atomic operations on
variables shared across the system which is less than ideal.
Peter suggested breaking the mask down and embedding it into the
sched_domain hierarchy which would minimize atomic operations on the
global variables.
There are 2 possible ways to implement this:
1. Maintain the idle CPUs mask in sched_domain_shared. Also construct a
hierarchy of the sched_domain_shared objects which can be used to
propagate a signal up to the topmost domain.
- Advantages: Distributed tracking. Less atomic operations on the
global variables.
- Disadvantages: Number of atomic ops can scale with the depth of the
hierarchy with multiple cache lines being possibly shared between
multiple NUMA domains.
2. [Implemented in this series] Maintain the idle CPUs mask in
sched_domain_shared. String all the sched_domain_shared objects in a
global list which is used for traversing all nohz idle CPUs during
load balancing.
Maintain a global "nrr_doms" indicator that is only updated when the
first CPU is added to the LLC local mask / last CPU leaves the LLC
local mask.
- Advantages: Distributed tracking. Simpler implementation.
- Disadvantages: Number of atomic ops to global "nr_doms" can scale
with the number of LLC domains, however the changes in the boundary
conditions are still less frequent than the current global scheme.
The nohz_idle_cpus mask is also inherently optimized by retaining a CPU
on the mask until the first tick and is not immediately cleared when the
ticks are enabled again / at idle exit.
o [EXPERIMENTAL] Push-based load balancing
Proactively push tasks to idle CPUs within an LLC domain. Push-based
load balancing is found to be a delicate balancing act where delaying
running the tasks, especially if their runtime is small, can lead to
performance regressions.
There are cases, especially with larger benchmarks where pushing the
tasks more proactively helps with performance however, a number of
microbenchmark suffer as a result of additional work the busy CPU has to
do to to push a preempted task.
o [EXPERIMENTAL] Optimizing Intra-NUMA newidle balancing
On a CONFIG_PREEMPTION enabled kernel, newidle balance only pulls one
task to keep the latency of balancing low. Despite the effort to keep
the latency low, the newidle balance ends up computing a great deal of
stats just to pull a single task at best.
Instead of following the usual path, directly traverse CPUs for newidle
balance in search of CPUs to pull load from.
This too is found to have interesting effects on benchmarks where CPUs
can converge on single target to pull tasks from causing some amount of
lock contention.
More interestingly, a number of benchmarks seem to regress if the
newidle balance yields on spotting (nr_running > 1 || ttwu_pending)
instead of just proceeding to scan the entire domain and bail at the
end.
Benchmark results
=================
Results for some variants are incomplete as a result of setup issues
(and my sheer incompetence to revert some of the changes I made when
analyzing the benchmarks)
I'll update these as and when the runs (and re-runs) complete but as the
moment, this is how the different [EXPERIMENTAL] bits stack up from
benchmarking perspective on a dual socket 3rd Generation EPYC system (2
x 64C/128T)
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
1-groups 1.00 [ -0.00]( 5.58) 1.01 [ -1.10](14.77) 1.01 [ -0.66]( 8.80) 1.03 [ -2.85]( 6.13)
2-groups 1.00 [ -0.00]( 9.58) 1.02 [ -2.41]( 3.09) 1.00 [ 0.22]( 5.62) 1.02 [ -1.97]( 4.54)
4-groups 1.00 [ -0.00]( 2.11) 0.99 [ 1.48]( 2.30) 1.00 [ -0.21]( 2.60) 1.03 [ -2.54]( 2.82)
8-groups 1.00 [ -0.00]( 2.07) 1.02 [ -2.31]( 2.98) 1.15 [-14.79]( 2.15) 1.13 [-12.63]( 2.57)
16-groups 1.00 [ -0.00]( 3.55) 1.09 [ -8.57]( 7.80) 1.04 [ -3.64]( 3.89) 1.04 [ -4.33]( 1.36)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
1 1.00 [ 0.00]( 0.29) 1.01 [ 0.63]( 0.68) 1.00 [ -0.15]( 0.96) 0.99 [ -1.46]( 0.25)
2 1.00 [ 0.00]( 0.55) 1.00 [ -0.09]( 0.21) 1.00 [ 0.47]( 0.46) 0.99 [ -1.36]( 0.54)
4 1.00 [ 0.00]( 0.33) 0.99 [ -0.83]( 0.54) 1.01 [ 0.76]( 0.36) 0.98 [ -1.51]( 0.20)
8 1.00 [ 0.00]( 0.75) 1.00 [ -0.42]( 1.14) 1.01 [ 0.96]( 0.49) 0.99 [ -0.64]( 0.34)
16 1.00 [ 0.00]( 0.98) 0.99 [ -0.70]( 1.23) 0.97 [ -2.55]( 0.73) 0.98 [ -1.80]( 1.62)
32 1.00 [ 0.00]( 0.04) 0.98 [ -2.32]( 1.14) 0.98 [ -1.94]( 0.86) 0.98 [ -2.02]( 0.64)
64 1.00 [ 0.00]( 1.27) 0.94 [ -5.51]( 3.69) 0.97 [ -3.45]( 1.28) 0.99 [ -1.49]( 1.68)
128 1.00 [ 0.00]( 0.69) 1.00 [ -0.05]( 2.34) 1.01 [ 0.79]( 0.93) 0.99 [ -1.16]( 0.68)
256 1.00 [ 0.00]( 5.60) 0.97 [ -2.67]( 5.28) 1.00 [ 0.34]( 1.23) 0.98 [ -2.16]( 7.10)
512 1.00 [ 0.00]( 0.90) 1.00 [ -0.38]( 0.86) 1.01 [ 0.53]( 0.10) 0.98 [ -1.88]( 0.09)
1024 1.00 [ 0.00]( 0.25) 0.99 [ -1.01]( 0.37) 1.01 [ 0.91]( 0.53) 0.98 [ -1.58]( 0.32)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
Copy 1.00 [ 0.00]( 4.37) 0.97 [ -2.82]( 8.57) 0.99 [ -1.31]( 6.75) 0.97 [ -3.34]( 6.18)
Scale 1.00 [ 0.00]( 2.75) 0.99 [ -0.73]( 3.62) 0.99 [ -0.86]( 3.73) 0.99 [ -1.49]( 5.39)
Add 1.00 [ 0.00]( 3.54) 0.98 [ -2.40]( 3.99) 0.98 [ -1.51]( 4.12) 0.97 [ -3.27]( 6.28)
Triad 1.00 [ 0.00]( 4.41) 0.98 [ -1.71]( 7.00) 1.01 [ 0.55]( 3.77) 0.96 [ -4.32]( 7.49)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
Copy 1.00 [ 0.00]( 3.25) 0.96 [ -4.08]( 3.07) 0.98 [ -1.56]( 3.45) 0.97 [ -2.74]( 2.00)
Scale 1.00 [ 0.00]( 1.49) 0.98 [ -2.25]( 4.13) 0.98 [ -1.86]( 4.32) 0.99 [ -1.19]( 1.43)
Add 1.00 [ 0.00]( 1.75) 1.00 [ -0.47]( 2.17) 1.00 [ -0.14]( 1.31) 0.99 [ -0.81]( 2.26)
Triad 1.00 [ 0.00]( 1.95) 0.97 [ -2.82]( 4.63) 0.95 [ -4.65]( 6.59) 0.97 [ -2.80]( 4.84)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.25) 0.98 [ -1.51]( 0.56) 0.99 [ -1.37]( 0.32) 0.98 [ -1.91]( 0.38)
2-clients 1.00 [ 0.00]( 0.39) 0.99 [ -1.26]( 1.05) 0.99 [ -0.99]( 0.75) 0.98 [ -2.16]( 0.57)
4-clients 1.00 [ 0.00]( 0.67) 0.99 [ -0.73]( 0.68) 1.00 [ -0.22]( 0.46) 0.98 [ -1.70]( 0.30)
8-clients 1.00 [ 0.00]( 0.46) 0.99 [ -1.09]( 0.50) 1.00 [ -0.27]( 0.44) 0.98 [ -1.84]( 0.59)
16-clients 1.00 [ 0.00]( 0.76) 0.99 [ -0.79]( 0.48) 1.00 [ -0.24]( 1.35) 0.99 [ -1.31]( 0.74)
32-clients 1.00 [ 0.00]( 0.82) 0.99 [ -0.91]( 0.80) 1.00 [ -0.04]( 1.16) 0.99 [ -1.27]( 0.83)
64-clients 1.00 [ 0.00]( 1.63) 0.99 [ -0.97]( 1.37) 1.00 [ 0.13]( 1.47) 0.99 [ -1.17]( 1.60)
128-clients 1.00 [ 0.00]( 1.30) 0.99 [ -1.07]( 1.42) 0.99 [ -0.92]( 1.41) 0.98 [ -1.77]( 1.19)
256-clients 1.00 [ 0.00]( 5.43) 1.02 [ 1.53]( 6.74) 1.02 [ 1.54]( 3.40) 1.00 [ 0.25]( 6.01)
512-clients 1.00 [ 0.00](55.62) 1.00 [ -0.25](54.85) 0.98 [ -1.91](52.43) 0.98 [ -1.88](51.45)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
1 1.00 [ -0.00]( 2.50) 1.00 [ -0.00](35.19) 0.88 [ 12.50](31.97) 0.95 [ 5.00](33.07)
2 1.00 [ -0.00]( 8.58) 1.02 [ -2.44]( 6.45) 1.02 [ -2.44]( 9.52) 1.00 [ -0.00]( 2.44)
4 1.00 [ -0.00]( 7.36) 1.02 [ -2.22]( 3.30) 0.98 [ 2.22]( 8.29) 1.02 [ -2.22](13.95)
8 1.00 [ -0.00]( 8.73) 1.10 [ -9.62]( 9.02) 1.06 [ -5.77]( 6.68) 1.04 [ -3.85]( 6.46)
16 1.00 [ -0.00]( 4.34) 1.05 [ -4.84]( 4.01) 1.03 [ -3.23]( 1.82) 1.06 [ -6.45]( 4.07)
32 1.00 [ -0.00]( 3.27) 1.06 [ -6.19]( 4.01) 0.99 [ 1.03]( 2.08) 1.00 [ -0.00]( 2.06)
64 1.00 [ -0.00]( 2.05) 1.01 [ -1.02]( 1.27) 1.01 [ -1.02]( 5.11) 0.91 [ 9.18]( 6.51)
128 1.00 [ -0.00]( 6.08) 0.95 [ 5.49]( 4.91) 1.09 [ -8.59]( 8.22) 1.08 [ -7.88](11.81)
256 1.00 [ -0.00]( 3.28) 0.94 [ 6.24]( 4.22) 1.04 [ -3.72]( 6.18) 1.04 [ -4.10]( 3.62)
512 1.00 [ -0.00]( 2.23) 0.98 [ 2.29]( 1.92) 0.98 [ 1.90]( 6.93) 1.02 [ -1.90]( 1.51)
==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
1 1.00 [ 0.00]( 0.14) 1.00 [ 0.00]( 0.29) 1.00 [ 0.00]( 0.14) 0.99 [ -0.56]( 0.91)
2 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.14) 1.00 [ 0.00]( 0.00)
4 1.00 [ 0.00]( 0.14) 1.00 [ 0.00]( 0.14) 1.00 [ 0.28]( 0.14) 1.00 [ 0.28]( 0.14)
8 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
16 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
32 1.00 [ 0.00]( 4.75) 1.01 [ 1.13]( 0.29) 1.00 [ 0.00]( 3.77) 0.99 [ -0.57]( 0.51)
64 1.00 [ 0.00]( 1.17) 1.01 [ 0.69](13.90) 1.00 [ 0.00](13.33) 1.01 [ 0.69](13.35)
128 1.00 [ 0.00]( 0.00) 1.00 [ 0.34]( 0.18) 1.01 [ 0.68]( 0.00) 1.00 [ 0.34]( 0.18)
256 1.00 [ 0.00]( 0.56) 1.00 [ -0.49]( 1.24) 1.00 [ 0.25]( 1.47) 1.01 [ 0.99]( 1.20)
512 1.00 [ 0.00]( 0.96) 1.00 [ -0.37]( 0.88) 1.00 [ -0.37]( 1.58) 1.00 [ -0.37]( 0.88)
==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
1 1.00 [ -0.00](24.81) 0.75 [ 25.00](24.12) 0.67 [ 33.33]( 6.74) 0.67 [ 33.33](11.18)
2 1.00 [ -0.00]( 4.08) 0.77 [ 23.08]( 9.68) 0.92 [ 7.69](21.56) 0.77 [ 23.08]( 8.94)
4 1.00 [ -0.00]( 0.00) 1.08 [ -7.69](10.00) 0.85 [ 15.38]( 9.99) 0.85 [ 15.38]( 8.13)
8 1.00 [ -0.00](12.91) 1.09 [ -9.09]( 4.43) 0.82 [ 18.18](19.99) 0.82 [ 18.18](23.66)
16 1.00 [ -0.00](12.06) 1.18 [-18.18]( 8.37) 1.18 [-18.18](15.10) 1.18 [-18.18](15.10)
32 1.00 [ -0.00](22.13) 1.00 [ -0.00]( 5.00) 1.10 [-10.00](19.86) 1.00 [ -0.00]( 5.34)
64 1.00 [ -0.00](11.07) 1.00 [ -0.00](16.90) 0.92 [ 7.69](15.49) 1.00 [ -0.00](13.62)
128 1.00 [ -0.00]( 9.04) 0.98 [ 2.48]( 3.01) 0.99 [ 1.49]( 6.96) 0.98 [ 1.98]( 5.42)
256 1.00 [ -0.00]( 0.24) 1.00 [ -0.00]( 0.00) 1.00 [ -0.24]( 0.12) 1.00 [ -0.24]( 0.32)
512 1.00 [ -0.00]( 0.34) 1.00 [ -0.00]( 0.40) 1.00 [ 0.38]( 0.34) 0.99 [ 1.15]( 0.20)
==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
1 1.00 [ -0.00]( 0.90) 0.99 [ 0.84]( 1.82) 0.99 [ 0.56]( 1.10) 1.03 [ -2.53]( 1.88)
2 1.00 [ -0.00]( 0.00) 1.01 [ -0.57]( 0.29) 1.01 [ -0.86]( 0.81) 1.02 [ -2.28]( 1.04)
4 1.00 [ -0.00]( 1.02) 0.98 [ 1.69]( 0.15) 0.99 [ 0.84]( 1.02) 1.01 [ -0.84]( 1.67)
8 1.00 [ -0.00]( 0.15) 1.01 [ -0.57]( 0.51) 1.00 [ -0.00]( 0.26) 1.00 [ -0.00]( 0.39)
16 1.00 [ -0.00]( 0.53) 1.01 [ -0.57]( 0.64) 1.00 [ -0.29]( 0.39) 1.01 [ -0.86]( 0.81)
32 1.00 [ -0.00](35.40) 0.98 [ 1.62]( 0.49) 0.99 [ 0.81](10.03) 1.00 [ -0.00]( 0.48)
64 1.00 [ -0.00]( 5.24) 0.92 [ 7.82](26.28) 1.03 [ -2.52]( 6.65) 0.62 [ 38.02](32.78)
128 1.00 [ -0.00]( 2.02) 0.99 [ 0.75]( 1.40) 1.16 [-16.14]( 2.15) 1.17 [-16.89]( 3.16)
256 1.00 [ -0.00]( 3.41) 0.96 [ 4.08]( 3.32) 1.07 [ -7.13]( 2.60) 1.10 [ -9.94]( 4.96)
512 1.00 [ -0.00]( 1.45) 1.00 [ 0.43]( 2.77) 0.99 [ 1.06]( 0.73) 0.98 [ 1.92]( 0.40)
==================================================================
Test : Various longer running benchmarks
Units : %diff in throughput reported
Interpretation: Higher is better
Statistic : Median
==================================================================
Benchmarks: push_only newidle_only push+newidle
ycsb-cassandra -3% -3% -1%
ycsb-mongodb -2% -2% -1%
deathstarbench-1x 24% 16%
deathstarbench-2x 12% 14%
deathstarbench-3x 17% 14%
deathstarbench-6x
LPC 2025
========
Further experiments carried out will be discussed at LPC'25 in the
"Scheduler and Real-Time MC" between 11:08AM and 11:30AM on 11th
December, 2025 in Hall B4.
Discussion points include:
o "sd->shared" assignment optimization.
o "nohz.idle_cpus" mask optimization
o Busy balance CPU rotation.
o Effective detection of when it is favorable to push tasks.
o The overheads of maintaining masks (even with optimizations).
o The delicate dance of newidle balance.
Please do drop by, or reach out to me directly if this work interests
you.
Changelog
=========
This series is based on tip:sched/core at commit 3eb593560146 ("Merge
tag 'v6.18-rc7' into sched/core, to pick up fixes"). All the comparisons
above are done with the same.
o rfc v1.. rfc v2
- Collected tags on Patch 1 from Srikanth (Thanks a ton for the review)
- Added a for_each_cpu_and_wrap() and cleaned up couple of sites using
the newly introduced macro.
- Simplified conditions that referenced per-CPU "llc_size" and
"sd->shared" using the fact that only sd_llc has sd->shared assigned.
- Merged the two series however, the idea is largely the same. Push
based load balancing is guarded behing CONFIG_NO_HZ_COMMON since a
bunch of NO_HZ_COMMON specific bits were put behind the config option.
- Idea of overloaded_mask was dropped since the overhead to maintain
the mask (without any usage) was visible in many benchmark results.
- Idea of shallow_idle_cpus mask was dropped since the overhead to
maintain the mask (without any usage) was visible in benchmarks like
tbench that left the CPUs idle for very short duration.
- Added the patch to rotate the "busy_balance_cpu".
- Renamed "idle_cpus_mask" to "nohz_idle_cpus_mask" in anticipation of
adding the "shallow_idle_cpus" mask which didn't pan out.
Note: Patched marked EXPERIMENTAL may be incomplete from a schedstats
accounting standpoint.
---
K Prateek Nayak (28):
sched/fair: Simplify set_cpu_sd_state_*() with guards
sched/fair: Use rq->nohz_tick_stopped in update_nohz_stats()
sched/topology: Optimize sd->shared allocation and assignment
sched/fair: Simplify the entry condition for update_idle_cpu_scan()
sched/fair: Simplity SIS_UTIL handling in select_idle_cpu()
cpumask: Introduce for_each_cpu_and_wrap() and bitfield helpers
sched/fair: Use for_each_cpu_and_wrap() in select_idle_capacity()
sched/fair: Use for_each_cpu_and_wrap() in select_idle_cpu()
sched/fair: Rotate the CPU resposible for busy load balancing
sched/fair: Use xchg() to set sd->nohz_idle state
sched/topology: Attach new hierarchy in rq_attach_root()
sched/fair: Fixup sd->nohz_idle state during hotplug / cpuset
sched/fair: Account idle cpus instead of busy cpus in sd->shared
sched/topology: Introduce fallback sd->shared assignment
sched/topology: Introduce percpu sd_nohz for nohz state tracking
sched/topology: Introduce "nohz_idle_cpus_mask" in sd->shared
sched/topology: Introduce "nohz_shared_list" to keep track of
sd->shared
sched/fair: Reorder the barrier in nohz_balance_enter_idle()
sched/fair: Extract the main _nohz_idle_balance() loop into a helper
sched/fair: Convert find_new_ilb() to use nohz_shared_list
sched/fair: Introduce sched_asym_prefer_idle() for ILB kick
sched/fair: Convert sched_balance_nohz_idle() to use nohz_shared_list
sched/fair: Remove "nohz.idle_cpus_mask"
sched/fair: Optimize global "nohz.nr_cpus" tracking
sched/topology: Add basic debug information for "nohz_shared_list"
[EXPERIMENTAL] sched/fair: Proactive idle balance using push mechanism
[EXPERIMENTAL] sched/fair: Add a local counter to rate limit task push
[EXPERIMENTAL] sched/fair: Faster alternate for intra-NUMA newidle
balance
Vincent Guittot (1):
[EXPERIMENTAL] sched/fair: Add push task framework
include/linux/cpumask.h | 20 +
include/linux/find.h | 37 ++
include/linux/sched/topology.h | 18 +-
kernel/sched/core.c | 4 +-
kernel/sched/fair.c | 828 ++++++++++++++++++++++++++-------
kernel/sched/sched.h | 10 +-
kernel/sched/topology.c | 386 +++++++++++++--
7 files changed, 1076 insertions(+), 227 deletions(-)
base-commit: 3eb59356014674fa1b506a122cc59b57089a4d08
--
2.43.0
Powered by blists - more mailing lists