[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250904041516.3046-1-kprateek.nayak@amd.com>
Date: Thu, 4 Sep 2025 04:14:56 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Anna-Maria Behnsen <anna-maria@...utronix.de>,
Frederic Weisbecker <frederic@...nel.org>, Thomas Gleixner
<tglx@...utronix.de>, <linux-kernel@...r.kernel.org>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, K Prateek Nayak
<kprateek.nayak@....com>, "Gautham R. Shenoy" <gautham.shenoy@....com>,
Swapnil Sapkal <swapnil.sapkal@....com>
Subject: [RFC PATCH 00/19] sched/fair: Distributed nohz idle CPU tracking for idle load balancing
Hello folks,
Introduction
============
Atomic operations on a single global variable is costly and may result
in noticeable performance overhead as highlighted by reports in
[1][2][3]. Peter noted that the "nohz.idle_cpus_maks" and "nohz.nr_cpus"
can become the next point of contention when reviewing the push based
load balancing series [4] and favored the idea of splitting the global
cpumask and the idle state indicator into per-LLC ones.
Global indicators have a few key advantages, namely:
- It is independent of the sd hierarchy and is not affected by hotplug
or cpuset activity.
- The implementation is simple. Setting the signal is simple and the
cpumask traversal is simple.
However, having a distributed tracking infrastructure can significantly
reduce contention when this data is frequently accessed and modified
which will be the case when it is used to implement a push-based load
balancing mechanism for the fair class based off the RFC [5].
Implementation
==============
There are few ideas to split the centralized nohz tracking using the
sched_domain topology, namely:
1. Maintain a cpumask per sd_llc_shared and starting sd_llc, propagate a
signal up the sched domain hierarchy to indicate the presence of
"nohz.idle_cpus" in the hierarchy.
o PROS:
- Distributed tracking
- Atomic updates past the LLC level are only done at the boundary
conditions (#idle_cpus in LLC goes from 0 -> 1 or back from 1 -> 0)
- During hotplug, the new hierarchy automatically is initialized with
the correct nohz idle states (idle_cpus, idle_cpus_mask).
o CONS:
- If there are multiple PKG/NODE/NUMA levels above the MC domain,
there can be multiple atomic oprations and in exotic topologies
with single CPU groups, single CPU transitioning in and out of nohz
idle frequently can cause a storm of these atomic operations and
interfere with other CPUs in the same hierarchy.
- Requires constructing a sd_llc_shared hierarchy to get the full
view of the system.
- An equivalent view of "nohz.idle_cpus_mask" needs to be constructed
by traversing the sd hierarchy.
2. [Implemented in this series] Maintain a cpumask per sd_llc_shared
keep track of all sd_llc_shared in a global list.
o PROS:
- Distributed tracking
- Atomic updates to the central "nohz.nr_cpus" tracking are only done
at the boundary conditions when the number of idle_cpus in LLC goes
from 0 -> 1 or back from 1 -> 0.
o CONS:
- Maintaining the central lists of all the "sd->shared" tracking the
nohz idle states of CPUs in a domain addes more complication in the
topology bits.
- Correcting the nohz signals during hotplug or cpuset activity is
tricky since the local "sd->shared" data can influence the global
"nohz.nr_doms" indicator.
- An equivalent view of "nohz.idle_cpus_mask" needs to be constructed
by traversing the global RCU protected "nohz_shared_list".
Since hotplug and cpuset activities are rare, the latter approach is
implemented as it saves on several atomic operations that can happen
more frequently.
Structure
=========
The series has been divided as follows:
Patch 01-03 - Trivial cleanup and optimization of the current
infrastructure.
Patch 03-08 - Preparation in topology.c and idle tracking infrastructure
to support the distributed tracking.
Patch 09-11 - Introduced new members for distributed nohz tracking.
Patch 12-13 - Preparation in fair.c to convert users of
"nohz.idle_cpus_mask" to use the new infrastructure.
Patch 14-17 - Convert the bits that use the centralized nohz tracking to
use the distributed version.
Patch 18 - Convert "nohz.nr_cpus" to "nohz.nr_doms" and optimize the
central tracking to only modify the global count at the
boundary condition.
Patch 19 - Simple debug patch for sanity testing.
Benchmarking
============
Following are the results from benchmarking on a dual socket 3rd
Generation EPYC system with 128C/256T (Boost on, C2 disabled):
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) sd_nohz[pct imp](CV)
1-groups 1.00 [ -0.00]( 8.58) 0.96 [ 3.99]( 5.47)
2-groups 1.00 [ -0.00]( 3.53) 1.00 [ -0.00]( 3.57)
4-groups 1.00 [ -0.00]( 1.63) 0.98 [ 1.88]( 1.46)
8-groups 1.00 [ -0.00]( 1.80) 0.99 [ 0.68]( 2.12)
16-groups 1.00 [ -0.00]( 2.95) 1.01 [ -0.75]( 1.78)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) sd_nohz[pct imp](CV)
1 1.00 [ 0.00]( 1.34) 1.02 [ 2.00]( 0.23)
2 1.00 [ 0.00]( 0.65) 1.01 [ 1.22]( 0.18)
4 1.00 [ 0.00]( 0.58) 1.00 [ 0.17]( 0.77)
8 1.00 [ 0.00]( 0.55) 1.01 [ 0.86]( 0.16)
16 1.00 [ 0.00]( 0.52) 0.99 [ -0.59]( 0.66)
32 1.00 [ 0.00]( 1.27) 0.99 [ -1.40]( 2.55)
64 1.00 [ 0.00]( 1.60) 1.00 [ -0.09]( 2.04)
128 1.00 [ 0.00]( 0.14) 1.02 [ 1.55]( 0.79)
256 1.00 [ 0.00]( 0.75) 1.01 [ 1.00]( 1.28)
512 1.00 [ 0.00]( 0.18) 1.01 [ 0.60]( 0.36)
1024 1.00 [ 0.00]( 0.05) 1.01 [ 1.47]( 0.44)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) sd_nohz[pct imp](CV)
Copy 1.00 [ 0.00](10.72) 0.97 [ -2.76](11.31)
Scale 1.00 [ 0.00]( 5.00) 0.97 [ -3.14]( 7.20)
Add 1.00 [ 0.00]( 5.75) 0.95 [ -4.71]( 7.33)
Triad 1.00 [ 0.00]( 5.83) 0.97 [ -2.60](10.21)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) sd_nohz[pct imp](CV)
Copy 1.00 [ 0.00]( 2.31) 1.01 [ 0.72]( 1.21)
Scale 1.00 [ 0.00]( 4.58) 1.00 [ 0.44]( 4.69)
Add 1.00 [ 0.00]( 1.12) 0.99 [ -0.54]( 4.21)
Triad 1.00 [ 0.00]( 2.21) 0.98 [ -1.82]( 5.94)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) sd_nohz[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.86) 1.01 [ 0.71]( 0.45)
2-clients 1.00 [ 0.00]( 0.56) 1.01 [ 0.86]( 0.63)
4-clients 1.00 [ 0.00]( 0.50) 1.01 [ 0.74]( 0.48)
8-clients 1.00 [ 0.00]( 0.70) 1.01 [ 0.61]( 0.39)
16-clients 1.00 [ 0.00]( 0.44) 1.00 [ 0.49]( 0.57)
32-clients 1.00 [ 0.00]( 0.54) 1.00 [ 0.16]( 0.90)
64-clients 1.00 [ 0.00]( 1.66) 1.00 [ 0.45]( 1.42)
128-clients 1.00 [ 0.00]( 1.12) 1.01 [ 0.57]( 0.91)
256-clients 1.00 [ 0.00]( 3.98) 0.98 [ -1.70]( 4.77)
512-clients 1.00 [ 0.00](51.74) 0.97 [ -2.58](44.26)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) sd_nohz[pct imp](CV)
1 1.00 [ -0.00](15.32) 1.00 [ -0.00]( 8.79)
2 1.00 [ -0.00](19.67) 1.00 [ -0.00](10.14)
4 1.00 [ -0.00](13.01) 1.06 [ -6.38](15.61)
8 1.00 [ -0.00](10.33) 0.98 [ 1.79]( 3.21)
16 1.00 [ -0.00]( 5.00) 1.02 [ -1.67]( 2.79)
32 1.00 [ -0.00]( 1.05) 1.01 [ -1.05]( 8.56)
64 1.00 [ -0.00]( 2.37) 0.94 [ 5.67]( 0.83)
128 1.00 [ -0.00](13.58) 0.90 [ 9.98]( 6.67)
256 1.00 [ -0.00](34.92) 1.88 [-88.43](12.68)
512 1.00 [ -0.00]( 1.72) 1.03 [ -2.57]( 1.26)
==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) sd_nohz[pct imp](CV)
1 1.00 [ 0.00]( 0.15) 1.00 [ -0.30]( 0.55)
2 1.00 [ 0.00]( 0.30) 1.01 [ 0.59]( 0.15)
4 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.15)
8 1.00 [ 0.00]( 0.15) 1.00 [ 0.29]( 0.00)
16 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
32 1.00 [ 0.00]( 3.80) 1.05 [ 4.86]( 0.55)
64 1.00 [ 0.00]( 0.20) 0.99 [ -0.76]( 5.08)
128 1.00 [ 0.00]( 0.20) 1.00 [ 0.38]( 0.20)
256 1.00 [ 0.00]( 0.70) 1.02 [ 2.50]( 0.63)
512 1.00 [ 0.00]( 0.25) 1.01 [ 0.95]( 0.00)
==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) sd_nohz[pct imp](CV)
1 1.00 [ -0.00](13.47) 1.14 [-14.29](11.92)
2 1.00 [ -0.00](16.40) 1.12 [-12.50]( 0.00)
4 1.00 [ -0.00]( 9.94) 0.89 [ 11.11](11.92)
8 1.00 [ -0.00]( 5.53) 1.11 [-11.11](14.13)
16 1.00 [ -0.00](13.22) 1.00 [ -0.00]( 0.00)
32 1.00 [ -0.00](11.71) 0.83 [ 16.67]( 0.00)
64 1.00 [ -0.00]( 3.87) 1.08 [ -7.69]( 6.39)
128 1.00 [ -0.00]( 3.51) 0.98 [ 1.56]( 4.83)
256 1.00 [ -0.00]( 4.91) 1.18 [-17.95]( 8.31)
512 1.00 [ -0.00]( 0.20) 1.00 [ -0.39]( 0.20)
==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) sd_nohz[pct imp](CV)
1 1.00 [ -0.00]( 0.13) 0.98 [ 1.78]( 2.11)
2 1.00 [ -0.00]( 0.13) 0.94 [ 6.11]( 2.83)
4 1.00 [ -0.00]( 2.40) 0.96 [ 3.82]( 1.21)
8 1.00 [ -0.00]( 0.00) 0.96 [ 3.82]( 0.55)
16 1.00 [ -0.00]( 2.88) 0.99 [ 1.32]( 1.20)
32 1.00 [ -0.00](12.96) 0.89 [ 11.17]( 2.34)
64 1.00 [ -0.00]( 4.80) 0.99 [ 0.88]( 5.56)
128 1.00 [ -0.00]( 2.49) 1.02 [ -1.81]( 2.72)
256 1.00 [ -0.00]( 3.95) 1.12 [-11.52]( 7.78)
512 1.00 [ -0.00]( 0.88) 1.00 [ -0.00]( 1.02)
==================================================================
Test : Various longer running benchmarks
Units : %diff in throughput reported
Interpretation: Higher is better
Statistic : Median
==================================================================
Benchmarks: %diff
ycsb-cassandra 1.36%
ycsb-mongodb -1.66%
deathstarbench-1x 0.04%
deathstarbench-2x -0.10%
deathstarbench-3x 3.84%
deathstarbench-6x 1.54%
hammerdb+mysql 16VU 0.03%
hammerdb+mysql 64VU 1.22%
The schbench datapoints were rerun to discover the regressions are
mostly the result of run-to-run variance. All the datapoints have
noticeable variance and the results can swing either way depending on
the noise in the system.
References
==========
[1] https://lore.kernel.org/lkml/20240531205452.65781-1-tim.c.chen@linux.intel.com/
[2] https://lore.kernel.org/lkml/20250416035823.1846307-1-tim.c.chen@linux.intel.com/
[3] https://lore.kernel.org/lkml/20250423174634.3009657-1-edumazet@google.com/
[4] https://lore.kernel.org/lkml/20250410102945.GD30687@noisy.programming.kicks-ass.net/
[5] https://lore.kernel.org/lkml/20250409111539.23791-1-kprateek.nayak@amd.com/
This series was tested on tip:sched/core at commit 1b5f1454091e
("sched/idle: Remove play_idle()") with commit 99b773d720ae ("sched/psi:
Fix psi_seq initialization") fron v6.17-rc1 cherry-picked on top.
CONFIG_HZ_PERIODIC was only build and boot tested.
The series is based on tip:sched/coreat commit 5b726e9bf954
("sched/fair: Get rid of throttled_lb_pair()")
Special thanks to Gautham for his help with the tricky bits and an early
review of the series.
---
K Prateek Nayak (19):
sched/fair: Simplify set_cpu_sd_state_*() with guards
sched/topology: Optimize sd->shared allocation and assignment
sched/fair: Use rq->nohz_tick_stopped in update_nohz_stats()
sched/fair: Use xchg() to set sd->nohz_idle state
sched/topology: Attach new hierarchy in rq_attach_root()
sched/fair: Fixup sd->nohz_idle state during hotplug / cpuset
sched/fair: Account idle cpus instead of busy cpus in sd->shared
sched/topology: Introduce fallback sd->shared assignment
sched/topology: Introduce percpu sd_nohz for nohz state tracking
sched/topology: Introduce "idle_cpus_mask" in sd->shared
sched/topology: Introduce "nohz_shared_list" to keep track of
sd->shared
sched/fair: Reorder the barrier in nohz_balance_enter_idle()
sched/fair: Extract the main _nohz_idle_balance() loop into a helper
sched/fair: Convert find_new_ilb() to use nohz_shared_list
sched/fair: Introduce sched_asym_prefer_idle() for ILB kick
sched/fair: Convert sched_balance_nohz_idle() to use nohz_shared_list
sched/fair: Remove "nohz.idle_cpus_mask"
sched/fair: Optimize global "nohz.nr_cpus" tracking
sched/topology: Add basic debug information for "nohz_shared_list"
include/linux/sched/topology.h | 14 +-
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 432 ++++++++++++++++++++++++---------
kernel/sched/sched.h | 6 +-
kernel/sched/topology.c | 381 +++++++++++++++++++++++++----
5 files changed, 663 insertions(+), 172 deletions(-)
base-commit: 5b726e9bf9544a349090879a513a5e00da486c14
--
2.34.1
Powered by blists - more mailing lists