[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <48ced285-f0d2-4da1-9955-12f9c4d7692d@linux.ibm.com>
Date: Mon, 8 Dec 2025 19:34:28 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: linux-kernel@...r.kernel.org, Dietmar Eggemann
<dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
"Gautham R. Shenoy" <gautham.shenoy@....com>,
Swapnil Sapkal <swapnil.sapkal@....com>, Chen Yu <yu.c.chen@...el.com>,
Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Anna-Maria Behnsen <anna-maria@...utronix.de>,
Frederic Weisbecker <frederic@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RESEND RFC PATCH v2 00/29] sched/fair: Push-based load balancing
On 12/8/25 2:56 PM, K Prateek Nayak wrote:
> Resending the series with the correct Email IDs this time. Sorry for the
> noise.
>
> This is the combined successor to the following two series:
>
> https://lore.kernel.org/lkml/20250904041516.3046-1-kprateek.nayak@amd.com/
> https://lore.kernel.org/lkml/20250409111539.23791-1-kprateek.nayak@amd.com/
>
> Most bits are same except for more initial cleanups. Changelog is
> attached towards the end. This topic will be discussed at LPC'25 in the
> "Scheduler and Real-Time MC" - jump to "LPC 2025" to know what will be
> discussed.
>
>
> Problem statement
> =================
>
> This series tackles three problems:
>
> 1. The busy load balancing always uses the first CPU of
> group_balance_mask() for load balancing which puts all the load
> balancing responsibility on a single CPU.
>
> 2. The "nohz.nr_idle" and "nohz.idle_cpus" are global system-wide
> variables that can run into scalability bottlenecks on a large
> core-count system.
>
I think they are nohz.nr_cpus and nohz.idle_cpus_mask. isn't it?
> 3. Periodic balance can take a long time to even out imbalance on
> systems with mostly large and flat sched-domain hierarchy. Preempted
> tasks may wait for a long time behind other runnable tasks increasing
> the tail latencies.
>
> This series aims at addressing the combined problems listed above.
>
>
> Implementation details
> ======================
>
> Note: Sections marked EXPERIMENTAL are known to introduce regeressions
> for certain benchmarks. These have been discussed in details in the next
> section. These patches also may be incomplete from a schedstats
> accounting standpoint.
>
>
> o Busy balancing optimization
>
> The busy balance CPU is always the first cpu of
> "sched_group->scg->group_balance_mask". "sgc" object is shared by all
> the CPUs on the "group_balance_mask", even for the overlapping domains.
>
> To keep overheads minimal, a simple "busy_balance_cpu" is maintained in
> the shared "sgc" object. A CPU is nominated to handle the busy
> balancing. Once the CPU is done with its turn, it nominates the next CPU
> on the group_balance_mask.
>
> - Advantages: The responsibility of busy balance is rotated among the
> CPUs on the group_balance_mask. Maintaining the "busy_balance_cpu"
> only requires a READ_ONCE() / WRITE_ONCE() modifications making it
> relatively cheap.
>
> - Disadvantages: The currently nominated "busy_balance_cpu" can run
> for a long time with bh disabled that can prevent balancing work
> from running however, it is no worse that the current state where
> the first CPU continues running with bh disabled for a prolonged
> period of time.
>
>
> o Centralized "nohz" accounting optimizations
>
> The centralized "nohz" tracking maintains the number and list of CPUs
> that are in nohz idle state. These are done via atomic operations on
> variables shared across the system which is less than ideal.
>
> Peter suggested breaking the mask down and embedding it into the
> sched_domain hierarchy which would minimize atomic operations on the
> global variables.
>
> There are 2 possible ways to implement this:
>
> 1. Maintain the idle CPUs mask in sched_domain_shared. Also construct a
> hierarchy of the sched_domain_shared objects which can be used to
> propagate a signal up to the topmost domain.
>
> - Advantages: Distributed tracking. Less atomic operations on the
> global variables.
>
> - Disadvantages: Number of atomic ops can scale with the depth of the
> hierarchy with multiple cache lines being possibly shared between
> multiple NUMA domains.
>
> 2. [Implemented in this series] Maintain the idle CPUs mask in
> sched_domain_shared. String all the sched_domain_shared objects in a
> global list which is used for traversing all nohz idle CPUs during
> load balancing.
>
> Maintain a global "nrr_doms" indicator that is only updated when the
> first CPU is added to the LLC local mask / last CPU leaves the LLC
> local mask.
>
> - Advantages: Distributed tracking. Simpler implementation.
>
> - Disadvantages: Number of atomic ops to global "nr_doms" can scale
> with the number of LLC domains, however the changes in the boundary
> conditions are still less frequent than the current global scheme.
>
> The nohz_idle_cpus mask is also inherently optimized by retaining a CPU
> on the mask until the first tick and is not immediately cleared when the
> ticks are enabled again / at idle exit.
>
>
> o [EXPERIMENTAL] Push-based load balancing
>
> Proactively push tasks to idle CPUs within an LLC domain. Push-based
> load balancing is found to be a delicate balancing act where delaying
> running the tasks, especially if their runtime is small, can lead to
> performance regressions.
>
> There are cases, especially with larger benchmarks where pushing the
> tasks more proactively helps with performance however, a number of
> microbenchmark suffer as a result of additional work the busy CPU has to
> do to to push a preempted task.
>
>
> o [EXPERIMENTAL] Optimizing Intra-NUMA newidle balancing
>
> On a CONFIG_PREEMPTION enabled kernel, newidle balance only pulls one
> task to keep the latency of balancing low. Despite the effort to keep
> the latency low, the newidle balance ends up computing a great deal of
> stats just to pull a single task at best.
>
> Instead of following the usual path, directly traverse CPUs for newidle
> balance in search of CPUs to pull load from.
>
> This too is found to have interesting effects on benchmarks where CPUs
> can converge on single target to pull tasks from causing some amount of
> lock contention.
>
> More interestingly, a number of benchmarks seem to regress if the
> newidle balance yields on spotting (nr_running > 1 || ttwu_pending)
> instead of just proceeding to scan the entire domain and bail at the
> end.
>
>
> Benchmark results
> =================
>
> Results for some variants are incomplete as a result of setup issues
> (and my sheer incompetence to revert some of the changes I made when
> analyzing the benchmarks)
>
> I'll update these as and when the runs (and re-runs) complete but as the
> moment, this is how the different [EXPERIMENTAL] bits stack up from
> benchmarking perspective on a dual socket 3rd Generation EPYC system (2
> x 64C/128T)
>
> ==================================================================
> Test : hackbench
> Units : Normalized time in seconds
> Interpretation: Lower is better
> Statistic : AMean
> ==================================================================
> Case: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
> 1-groups 1.00 [ -0.00]( 5.58) 1.01 [ -1.10](14.77) 1.01 [ -0.66]( 8.80) 1.03 [ -2.85]( 6.13)
> 2-groups 1.00 [ -0.00]( 9.58) 1.02 [ -2.41]( 3.09) 1.00 [ 0.22]( 5.62) 1.02 [ -1.97]( 4.54)
> 4-groups 1.00 [ -0.00]( 2.11) 0.99 [ 1.48]( 2.30) 1.00 [ -0.21]( 2.60) 1.03 [ -2.54]( 2.82)
> 8-groups 1.00 [ -0.00]( 2.07) 1.02 [ -2.31]( 2.98) 1.15 [-14.79]( 2.15) 1.13 [-12.63]( 2.57)
> 16-groups 1.00 [ -0.00]( 3.55) 1.09 [ -8.57]( 7.80) 1.04 [ -3.64]( 3.89) 1.04 [ -4.33]( 1.36)
>
>
> ==================================================================
> Test : tbench
> Units : Normalized throughput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
> 1 1.00 [ 0.00]( 0.29) 1.01 [ 0.63]( 0.68) 1.00 [ -0.15]( 0.96) 0.99 [ -1.46]( 0.25)
> 2 1.00 [ 0.00]( 0.55) 1.00 [ -0.09]( 0.21) 1.00 [ 0.47]( 0.46) 0.99 [ -1.36]( 0.54)
> 4 1.00 [ 0.00]( 0.33) 0.99 [ -0.83]( 0.54) 1.01 [ 0.76]( 0.36) 0.98 [ -1.51]( 0.20)
> 8 1.00 [ 0.00]( 0.75) 1.00 [ -0.42]( 1.14) 1.01 [ 0.96]( 0.49) 0.99 [ -0.64]( 0.34)
> 16 1.00 [ 0.00]( 0.98) 0.99 [ -0.70]( 1.23) 0.97 [ -2.55]( 0.73) 0.98 [ -1.80]( 1.62)
> 32 1.00 [ 0.00]( 0.04) 0.98 [ -2.32]( 1.14) 0.98 [ -1.94]( 0.86) 0.98 [ -2.02]( 0.64)
> 64 1.00 [ 0.00]( 1.27) 0.94 [ -5.51]( 3.69) 0.97 [ -3.45]( 1.28) 0.99 [ -1.49]( 1.68)
> 128 1.00 [ 0.00]( 0.69) 1.00 [ -0.05]( 2.34) 1.01 [ 0.79]( 0.93) 0.99 [ -1.16]( 0.68)
> 256 1.00 [ 0.00]( 5.60) 0.97 [ -2.67]( 5.28) 1.00 [ 0.34]( 1.23) 0.98 [ -2.16]( 7.10)
> 512 1.00 [ 0.00]( 0.90) 1.00 [ -0.38]( 0.86) 1.01 [ 0.53]( 0.10) 0.98 [ -1.88]( 0.09)
> 1024 1.00 [ 0.00]( 0.25) 0.99 [ -1.01]( 0.37) 1.01 [ 0.91]( 0.53) 0.98 [ -1.58]( 0.32)
>
>
> ==================================================================
> Test : stream-10
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
> Copy 1.00 [ 0.00]( 4.37) 0.97 [ -2.82]( 8.57) 0.99 [ -1.31]( 6.75) 0.97 [ -3.34]( 6.18)
> Scale 1.00 [ 0.00]( 2.75) 0.99 [ -0.73]( 3.62) 0.99 [ -0.86]( 3.73) 0.99 [ -1.49]( 5.39)
> Add 1.00 [ 0.00]( 3.54) 0.98 [ -2.40]( 3.99) 0.98 [ -1.51]( 4.12) 0.97 [ -3.27]( 6.28)
> Triad 1.00 [ 0.00]( 4.41) 0.98 [ -1.71]( 7.00) 1.01 [ 0.55]( 3.77) 0.96 [ -4.32]( 7.49)
>
>
> ==================================================================
> Test : stream-100
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
> Copy 1.00 [ 0.00]( 3.25) 0.96 [ -4.08]( 3.07) 0.98 [ -1.56]( 3.45) 0.97 [ -2.74]( 2.00)
> Scale 1.00 [ 0.00]( 1.49) 0.98 [ -2.25]( 4.13) 0.98 [ -1.86]( 4.32) 0.99 [ -1.19]( 1.43)
> Add 1.00 [ 0.00]( 1.75) 1.00 [ -0.47]( 2.17) 1.00 [ -0.14]( 1.31) 0.99 [ -0.81]( 2.26)
> Triad 1.00 [ 0.00]( 1.95) 0.97 [ -2.82]( 4.63) 0.95 [ -4.65]( 6.59) 0.97 [ -2.80]( 4.84)
>
>
> ==================================================================
> Test : netperf
> Units : Normalized Througput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
> 1-clients 1.00 [ 0.00]( 0.25) 0.98 [ -1.51]( 0.56) 0.99 [ -1.37]( 0.32) 0.98 [ -1.91]( 0.38)
> 2-clients 1.00 [ 0.00]( 0.39) 0.99 [ -1.26]( 1.05) 0.99 [ -0.99]( 0.75) 0.98 [ -2.16]( 0.57)
> 4-clients 1.00 [ 0.00]( 0.67) 0.99 [ -0.73]( 0.68) 1.00 [ -0.22]( 0.46) 0.98 [ -1.70]( 0.30)
> 8-clients 1.00 [ 0.00]( 0.46) 0.99 [ -1.09]( 0.50) 1.00 [ -0.27]( 0.44) 0.98 [ -1.84]( 0.59)
> 16-clients 1.00 [ 0.00]( 0.76) 0.99 [ -0.79]( 0.48) 1.00 [ -0.24]( 1.35) 0.99 [ -1.31]( 0.74)
> 32-clients 1.00 [ 0.00]( 0.82) 0.99 [ -0.91]( 0.80) 1.00 [ -0.04]( 1.16) 0.99 [ -1.27]( 0.83)
> 64-clients 1.00 [ 0.00]( 1.63) 0.99 [ -0.97]( 1.37) 1.00 [ 0.13]( 1.47) 0.99 [ -1.17]( 1.60)
> 128-clients 1.00 [ 0.00]( 1.30) 0.99 [ -1.07]( 1.42) 0.99 [ -0.92]( 1.41) 0.98 [ -1.77]( 1.19)
> 256-clients 1.00 [ 0.00]( 5.43) 1.02 [ 1.53]( 6.74) 1.02 [ 1.54]( 3.40) 1.00 [ 0.25]( 6.01)
> 512-clients 1.00 [ 0.00](55.62) 1.00 [ -0.25](54.85) 0.98 [ -1.91](52.43) 0.98 [ -1.88](51.45)
>
>
> ==================================================================
> Test : schbench
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
> 1 1.00 [ -0.00]( 2.50) 1.00 [ -0.00](35.19) 0.88 [ 12.50](31.97) 0.95 [ 5.00](33.07)
> 2 1.00 [ -0.00]( 8.58) 1.02 [ -2.44]( 6.45) 1.02 [ -2.44]( 9.52) 1.00 [ -0.00]( 2.44)
> 4 1.00 [ -0.00]( 7.36) 1.02 [ -2.22]( 3.30) 0.98 [ 2.22]( 8.29) 1.02 [ -2.22](13.95)
> 8 1.00 [ -0.00]( 8.73) 1.10 [ -9.62]( 9.02) 1.06 [ -5.77]( 6.68) 1.04 [ -3.85]( 6.46)
> 16 1.00 [ -0.00]( 4.34) 1.05 [ -4.84]( 4.01) 1.03 [ -3.23]( 1.82) 1.06 [ -6.45]( 4.07)
> 32 1.00 [ -0.00]( 3.27) 1.06 [ -6.19]( 4.01) 0.99 [ 1.03]( 2.08) 1.00 [ -0.00]( 2.06)
> 64 1.00 [ -0.00]( 2.05) 1.01 [ -1.02]( 1.27) 1.01 [ -1.02]( 5.11) 0.91 [ 9.18]( 6.51)
> 128 1.00 [ -0.00]( 6.08) 0.95 [ 5.49]( 4.91) 1.09 [ -8.59]( 8.22) 1.08 [ -7.88](11.81)
> 256 1.00 [ -0.00]( 3.28) 0.94 [ 6.24]( 4.22) 1.04 [ -3.72]( 6.18) 1.04 [ -4.10]( 3.62)
> 512 1.00 [ -0.00]( 2.23) 0.98 [ 2.29]( 1.92) 0.98 [ 1.90]( 6.93) 1.02 [ -1.90]( 1.51)
>
>
> ==================================================================
> Test : new-schbench-requests-per-second
> Units : Normalized Requests per second
> Interpretation: Higher is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
> 1 1.00 [ 0.00]( 0.14) 1.00 [ 0.00]( 0.29) 1.00 [ 0.00]( 0.14) 0.99 [ -0.56]( 0.91)
> 2 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.14) 1.00 [ 0.00]( 0.00)
> 4 1.00 [ 0.00]( 0.14) 1.00 [ 0.00]( 0.14) 1.00 [ 0.28]( 0.14) 1.00 [ 0.28]( 0.14)
> 8 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
> 16 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
> 32 1.00 [ 0.00]( 4.75) 1.01 [ 1.13]( 0.29) 1.00 [ 0.00]( 3.77) 0.99 [ -0.57]( 0.51)
> 64 1.00 [ 0.00]( 1.17) 1.01 [ 0.69](13.90) 1.00 [ 0.00](13.33) 1.01 [ 0.69](13.35)
> 128 1.00 [ 0.00]( 0.00) 1.00 [ 0.34]( 0.18) 1.01 [ 0.68]( 0.00) 1.00 [ 0.34]( 0.18)
> 256 1.00 [ 0.00]( 0.56) 1.00 [ -0.49]( 1.24) 1.00 [ 0.25]( 1.47) 1.01 [ 0.99]( 1.20)
> 512 1.00 [ 0.00]( 0.96) 1.00 [ -0.37]( 0.88) 1.00 [ -0.37]( 1.58) 1.00 [ -0.37]( 0.88)
>
>
> ==================================================================
> Test : new-schbench-wakeup-latency
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
> 1 1.00 [ -0.00](24.81) 0.75 [ 25.00](24.12) 0.67 [ 33.33]( 6.74) 0.67 [ 33.33](11.18)
> 2 1.00 [ -0.00]( 4.08) 0.77 [ 23.08]( 9.68) 0.92 [ 7.69](21.56) 0.77 [ 23.08]( 8.94)
> 4 1.00 [ -0.00]( 0.00) 1.08 [ -7.69](10.00) 0.85 [ 15.38]( 9.99) 0.85 [ 15.38]( 8.13)
> 8 1.00 [ -0.00](12.91) 1.09 [ -9.09]( 4.43) 0.82 [ 18.18](19.99) 0.82 [ 18.18](23.66)
> 16 1.00 [ -0.00](12.06) 1.18 [-18.18]( 8.37) 1.18 [-18.18](15.10) 1.18 [-18.18](15.10)
> 32 1.00 [ -0.00](22.13) 1.00 [ -0.00]( 5.00) 1.10 [-10.00](19.86) 1.00 [ -0.00]( 5.34)
> 64 1.00 [ -0.00](11.07) 1.00 [ -0.00](16.90) 0.92 [ 7.69](15.49) 1.00 [ -0.00](13.62)
> 128 1.00 [ -0.00]( 9.04) 0.98 [ 2.48]( 3.01) 0.99 [ 1.49]( 6.96) 0.98 [ 1.98]( 5.42)
> 256 1.00 [ -0.00]( 0.24) 1.00 [ -0.00]( 0.00) 1.00 [ -0.24]( 0.12) 1.00 [ -0.24]( 0.32)
> 512 1.00 [ -0.00]( 0.34) 1.00 [ -0.00]( 0.40) 1.00 [ 0.38]( 0.34) 0.99 [ 1.15]( 0.20)
>
>
> ==================================================================
> Test : new-schbench-request-latency
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) push_only[pct imp](CV) newidle_only[pct imp](CV) push+newidle[pct imp](CV)
> 1 1.00 [ -0.00]( 0.90) 0.99 [ 0.84]( 1.82) 0.99 [ 0.56]( 1.10) 1.03 [ -2.53]( 1.88)
> 2 1.00 [ -0.00]( 0.00) 1.01 [ -0.57]( 0.29) 1.01 [ -0.86]( 0.81) 1.02 [ -2.28]( 1.04)
> 4 1.00 [ -0.00]( 1.02) 0.98 [ 1.69]( 0.15) 0.99 [ 0.84]( 1.02) 1.01 [ -0.84]( 1.67)
> 8 1.00 [ -0.00]( 0.15) 1.01 [ -0.57]( 0.51) 1.00 [ -0.00]( 0.26) 1.00 [ -0.00]( 0.39)
> 16 1.00 [ -0.00]( 0.53) 1.01 [ -0.57]( 0.64) 1.00 [ -0.29]( 0.39) 1.01 [ -0.86]( 0.81)
> 32 1.00 [ -0.00](35.40) 0.98 [ 1.62]( 0.49) 0.99 [ 0.81](10.03) 1.00 [ -0.00]( 0.48)
> 64 1.00 [ -0.00]( 5.24) 0.92 [ 7.82](26.28) 1.03 [ -2.52]( 6.65) 0.62 [ 38.02](32.78)
> 128 1.00 [ -0.00]( 2.02) 0.99 [ 0.75]( 1.40) 1.16 [-16.14]( 2.15) 1.17 [-16.89]( 3.16)
> 256 1.00 [ -0.00]( 3.41) 0.96 [ 4.08]( 3.32) 1.07 [ -7.13]( 2.60) 1.10 [ -9.94]( 4.96)
> 512 1.00 [ -0.00]( 1.45) 1.00 [ 0.43]( 2.77) 0.99 [ 1.06]( 0.73) 0.98 [ 1.92]( 0.40)
>
>
> ==================================================================
> Test : Various longer running benchmarks
> Units : %diff in throughput reported
> Interpretation: Higher is better
> Statistic : Median
> ==================================================================
> Benchmarks: push_only newidle_only push+newidle
> ycsb-cassandra -3% -3% -1%
> ycsb-mongodb -2% -2% -1%
> deathstarbench-1x 24% 16%
> deathstarbench-2x 12% 14%
> deathstarbench-3x 17% 14%
> deathstarbench-6x
>
>
> LPC 2025
> ========
>
> Further experiments carried out will be discussed at LPC'25 in the
> "Scheduler and Real-Time MC" between 11:08AM and 11:30AM on 11th
> December, 2025 in Hall B4.
>
> Discussion points include:
>
> o "sd->shared" assignment optimization.
> o "nohz.idle_cpus" mask optimization
> o Busy balance CPU rotation.
> o Effective detection of when it is favorable to push tasks.
> o The overheads of maintaining masks (even with optimizations).
> o The delicate dance of newidle balance.
>
> Please do drop by, or reach out to me directly if this work interests
> you.
>
>
> Changelog
> =========
>
> This series is based on tip:sched/core at commit 3eb593560146 ("Merge
> tag 'v6.18-rc7' into sched/core, to pick up fixes"). All the comparisons
> above are done with the same.
>
> o rfc v1.. rfc v2
>
> - Collected tags on Patch 1 from Srikanth (Thanks a ton for the review)
>
> - Added a for_each_cpu_and_wrap() and cleaned up couple of sites using
> the newly introduced macro.
>
> - Simplified conditions that referenced per-CPU "llc_size" and
> "sd->shared" using the fact that only sd_llc has sd->shared assigned.
>
> - Merged the two series however, the idea is largely the same. Push
> based load balancing is guarded behing CONFIG_NO_HZ_COMMON since a
> bunch of NO_HZ_COMMON specific bits were put behind the config option.
>
> - Idea of overloaded_mask was dropped since the overhead to maintain
> the mask (without any usage) was visible in many benchmark results.
>
> - Idea of shallow_idle_cpus mask was dropped since the overhead to
> maintain the mask (without any usage) was visible in benchmarks like
> tbench that left the CPUs idle for very short duration.
>
> - Added the patch to rotate the "busy_balance_cpu".
>
> - Renamed "idle_cpus_mask" to "nohz_idle_cpus_mask" in anticipation of
> adding the "shallow_idle_cpus" mask which didn't pan out.
>
>
> Note: Patched marked EXPERIMENTAL may be incomplete from a schedstats
> accounting standpoint.
>
> ---
> K Prateek Nayak (28):
> sched/fair: Simplify set_cpu_sd_state_*() with guards
> sched/fair: Use rq->nohz_tick_stopped in update_nohz_stats()
> sched/topology: Optimize sd->shared allocation and assignment
> sched/fair: Simplify the entry condition for update_idle_cpu_scan()
> sched/fair: Simplity SIS_UTIL handling in select_idle_cpu()
> cpumask: Introduce for_each_cpu_and_wrap() and bitfield helpers
> sched/fair: Use for_each_cpu_and_wrap() in select_idle_capacity()
> sched/fair: Use for_each_cpu_and_wrap() in select_idle_cpu()
> sched/fair: Rotate the CPU resposible for busy load balancing
> sched/fair: Use xchg() to set sd->nohz_idle state
> sched/topology: Attach new hierarchy in rq_attach_root()
> sched/fair: Fixup sd->nohz_idle state during hotplug / cpuset
> sched/fair: Account idle cpus instead of busy cpus in sd->shared
> sched/topology: Introduce fallback sd->shared assignment
> sched/topology: Introduce percpu sd_nohz for nohz state tracking
> sched/topology: Introduce "nohz_idle_cpus_mask" in sd->shared
> sched/topology: Introduce "nohz_shared_list" to keep track of
> sd->shared
> sched/fair: Reorder the barrier in nohz_balance_enter_idle()
> sched/fair: Extract the main _nohz_idle_balance() loop into a helper
> sched/fair: Convert find_new_ilb() to use nohz_shared_list
> sched/fair: Introduce sched_asym_prefer_idle() for ILB kick
> sched/fair: Convert sched_balance_nohz_idle() to use nohz_shared_list
> sched/fair: Remove "nohz.idle_cpus_mask"
> sched/fair: Optimize global "nohz.nr_cpus" tracking
> sched/topology: Add basic debug information for "nohz_shared_list"
> [EXPERIMENTAL] sched/fair: Proactive idle balance using push mechanism
> [EXPERIMENTAL] sched/fair: Add a local counter to rate limit task push
> [EXPERIMENTAL] sched/fair: Faster alternate for intra-NUMA newidle
> balance
>
> Vincent Guittot (1):
> [EXPERIMENTAL] sched/fair: Add push task framework
>
> include/linux/cpumask.h | 20 +
> include/linux/find.h | 37 ++
> include/linux/sched/topology.h | 18 +-
> kernel/sched/core.c | 4 +-
> kernel/sched/fair.c | 828 ++++++++++++++++++++++++++-------
> kernel/sched/sched.h | 10 +-
> kernel/sched/topology.c | 386 +++++++++++++--
> 7 files changed, 1076 insertions(+), 227 deletions(-)
>
>
> base-commit: 3eb59356014674fa1b506a122cc59b57089a4d08
Powered by blists - more mailing lists