linux-kernel - [RFC PATCH 00/19] sched/fair: Distributed nohz idle CPU tracking for idle load balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250904041516.3046-1-kprateek.nayak@amd.com>
Date: Thu, 4 Sep 2025 04:14:56 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Anna-Maria Behnsen <anna-maria@...utronix.de>,
	Frederic Weisbecker <frederic@...nel.org>, Thomas Gleixner
	<tglx@...utronix.de>, <linux-kernel@...r.kernel.org>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
	<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, K Prateek Nayak
	<kprateek.nayak@....com>, "Gautham R. Shenoy" <gautham.shenoy@....com>,
	Swapnil Sapkal <swapnil.sapkal@....com>
Subject: [RFC PATCH 00/19] sched/fair: Distributed nohz idle CPU tracking for idle load balancing

Hello folks,

Introduction
============

Atomic operations on a single global variable is costly and may result
in noticeable performance overhead as highlighted by reports in
[1][2][3]. Peter noted that the "nohz.idle_cpus_maks" and "nohz.nr_cpus"
can become the next point of contention when reviewing the push based
load balancing series [4] and favored the idea of splitting the global
cpumask and the idle state indicator into per-LLC ones.

Global indicators have a few key advantages, namely:

- It is independent of the sd hierarchy and is not affected by hotplug
  or cpuset activity.

- The implementation is simple. Setting the signal is simple and the
  cpumask traversal is simple.

However, having a distributed tracking infrastructure can significantly
reduce contention when this data is frequently accessed and modified
which will be the case when it is used to implement a push-based load
balancing mechanism for the fair class based off the RFC [5].


Implementation
==============

There are few ideas to split the centralized nohz tracking using the
sched_domain topology, namely:

1. Maintain a cpumask per sd_llc_shared and starting sd_llc, propagate a
   signal up the sched domain hierarchy to indicate the presence of
   "nohz.idle_cpus" in the hierarchy.

   o PROS:

   - Distributed tracking
   - Atomic updates past the LLC level are only done at the boundary
     conditions (#idle_cpus in LLC goes from 0 -> 1 or back from 1 -> 0)
   - During hotplug, the new hierarchy automatically is initialized with
     the correct nohz idle states (idle_cpus, idle_cpus_mask).

   o CONS:

   - If there are multiple PKG/NODE/NUMA levels above the MC domain,
     there can be multiple atomic oprations and in exotic topologies
     with single CPU groups, single CPU transitioning in and out of nohz
     idle frequently can cause a storm of these atomic operations and
     interfere with other CPUs in the same hierarchy.
   - Requires constructing a sd_llc_shared hierarchy to get the full
     view of the system.
   - An equivalent view of "nohz.idle_cpus_mask" needs to be constructed
     by traversing the sd hierarchy.

2. [Implemented in this series] Maintain a cpumask per sd_llc_shared
   keep track of all sd_llc_shared in a global list.

   o PROS:

   - Distributed tracking
   - Atomic updates to the central "nohz.nr_cpus" tracking are only done
     at the boundary conditions when the number of idle_cpus in LLC goes
     from 0 -> 1 or back from 1 -> 0.

   o CONS:

   - Maintaining the central lists of all the "sd->shared" tracking the
     nohz idle states of CPUs in a domain addes more complication in the
     topology bits.
   - Correcting the nohz signals during hotplug or cpuset activity is
     tricky since the local "sd->shared" data can influence the global
     "nohz.nr_doms" indicator.
   - An equivalent view of "nohz.idle_cpus_mask" needs to be constructed
     by traversing the global RCU protected "nohz_shared_list".

Since hotplug and cpuset activities are rare, the latter approach is
implemented as it saves on several atomic operations that can happen
more frequently.


Structure
=========

The series has been divided as follows:

Patch 01-03 - Trivial cleanup and optimization of the current
              infrastructure.
Patch 03-08 - Preparation in topology.c and idle tracking infrastructure
              to support the distributed tracking.
Patch 09-11 - Introduced new members for distributed nohz tracking.
Patch 12-13 - Preparation in fair.c to convert users of
              "nohz.idle_cpus_mask" to use the new infrastructure.
Patch 14-17 - Convert the bits that use the centralized nohz tracking to
              use the distributed version.
Patch 18    - Convert "nohz.nr_cpus" to "nohz.nr_doms" and optimize the
              central tracking to only modify the global count at the
              boundary condition.
Patch 19    - Simple debug patch for sanity testing.


Benchmarking
============

Following are the results from benchmarking on a dual socket 3rd
Generation EPYC system with 128C/256T (Boost on, C2 disabled):

    ==================================================================
    Test          : hackbench
    Units         : Normalized time in seconds
    Interpretation: Lower is better
    Statistic     : AMean
    ==================================================================
    Case:           tip[pct imp](CV)       sd_nohz[pct imp](CV)
     1-groups     1.00 [ -0.00]( 8.58)     0.96 [  3.99]( 5.47)
     2-groups     1.00 [ -0.00]( 3.53)     1.00 [ -0.00]( 3.57)
     4-groups     1.00 [ -0.00]( 1.63)     0.98 [  1.88]( 1.46)
     8-groups     1.00 [ -0.00]( 1.80)     0.99 [  0.68]( 2.12)
    16-groups     1.00 [ -0.00]( 2.95)     1.01 [ -0.75]( 1.78)


    ==================================================================
    Test          : tbench
    Units         : Normalized throughput
    Interpretation: Higher is better
    Statistic     : AMean
    ==================================================================
    Clients:    tip[pct imp](CV)       sd_nohz[pct imp](CV)
        1     1.00 [  0.00]( 1.34)     1.02 [  2.00]( 0.23)
        2     1.00 [  0.00]( 0.65)     1.01 [  1.22]( 0.18)
        4     1.00 [  0.00]( 0.58)     1.00 [  0.17]( 0.77)
        8     1.00 [  0.00]( 0.55)     1.01 [  0.86]( 0.16)
       16     1.00 [  0.00]( 0.52)     0.99 [ -0.59]( 0.66)
       32     1.00 [  0.00]( 1.27)     0.99 [ -1.40]( 2.55)
       64     1.00 [  0.00]( 1.60)     1.00 [ -0.09]( 2.04)
      128     1.00 [  0.00]( 0.14)     1.02 [  1.55]( 0.79)
      256     1.00 [  0.00]( 0.75)     1.01 [  1.00]( 1.28)
      512     1.00 [  0.00]( 0.18)     1.01 [  0.60]( 0.36)
     1024     1.00 [  0.00]( 0.05)     1.01 [  1.47]( 0.44)


    ==================================================================
    Test          : stream-10
    Units         : Normalized Bandwidth, MB/s
    Interpretation: Higher is better
    Statistic     : HMean
    ==================================================================
    Test:       tip[pct imp](CV)       sd_nohz[pct imp](CV)
     Copy     1.00 [  0.00](10.72)     0.97 [ -2.76](11.31)
    Scale     1.00 [  0.00]( 5.00)     0.97 [ -3.14]( 7.20)
      Add     1.00 [  0.00]( 5.75)     0.95 [ -4.71]( 7.33)
    Triad     1.00 [  0.00]( 5.83)     0.97 [ -2.60](10.21)


    ==================================================================
    Test          : stream-100
    Units         : Normalized Bandwidth, MB/s
    Interpretation: Higher is better
    Statistic     : HMean
    ==================================================================
    Test:       tip[pct imp](CV)       sd_nohz[pct imp](CV)
     Copy     1.00 [  0.00]( 2.31)     1.01 [  0.72]( 1.21)
    Scale     1.00 [  0.00]( 4.58)     1.00 [  0.44]( 4.69)
      Add     1.00 [  0.00]( 1.12)     0.99 [ -0.54]( 4.21)
    Triad     1.00 [  0.00]( 2.21)     0.98 [ -1.82]( 5.94)


    ==================================================================
    Test          : netperf
    Units         : Normalized Througput
    Interpretation: Higher is better
    Statistic     : AMean
    ==================================================================
    Clients:         tip[pct imp](CV)       sd_nohz[pct imp](CV)
     1-clients     1.00 [  0.00]( 0.86)     1.01 [  0.71]( 0.45)
     2-clients     1.00 [  0.00]( 0.56)     1.01 [  0.86]( 0.63)
     4-clients     1.00 [  0.00]( 0.50)     1.01 [  0.74]( 0.48)
     8-clients     1.00 [  0.00]( 0.70)     1.01 [  0.61]( 0.39)
    16-clients     1.00 [  0.00]( 0.44)     1.00 [  0.49]( 0.57)
    32-clients     1.00 [  0.00]( 0.54)     1.00 [  0.16]( 0.90)
    64-clients     1.00 [  0.00]( 1.66)     1.00 [  0.45]( 1.42)
    128-clients    1.00 [  0.00]( 1.12)     1.01 [  0.57]( 0.91)
    256-clients    1.00 [  0.00]( 3.98)     0.98 [ -1.70]( 4.77)
    512-clients    1.00 [  0.00](51.74)     0.97 [ -2.58](44.26)


    ==================================================================
    Test          : schbench
    Units         : Normalized 99th percentile latency in us
    Interpretation: Lower is better
    Statistic     : Median
    ==================================================================
    #workers: tip[pct imp](CV)       sd_nohz[pct imp](CV)
      1     1.00 [ -0.00](15.32)     1.00 [ -0.00]( 8.79)
      2     1.00 [ -0.00](19.67)     1.00 [ -0.00](10.14)
      4     1.00 [ -0.00](13.01)     1.06 [ -6.38](15.61)
      8     1.00 [ -0.00](10.33)     0.98 [  1.79]( 3.21)
     16     1.00 [ -0.00]( 5.00)     1.02 [ -1.67]( 2.79)
     32     1.00 [ -0.00]( 1.05)     1.01 [ -1.05]( 8.56)
     64     1.00 [ -0.00]( 2.37)     0.94 [  5.67]( 0.83)
    128     1.00 [ -0.00](13.58)     0.90 [  9.98]( 6.67)
    256     1.00 [ -0.00](34.92)     1.88 [-88.43](12.68)
    512     1.00 [ -0.00]( 1.72)     1.03 [ -2.57]( 1.26)


    ==================================================================
    Test          : new-schbench-requests-per-second
    Units         : Normalized Requests per second
    Interpretation: Higher is better
    Statistic     : Median
    ==================================================================
    #workers: tip[pct imp](CV)       sd_nohz[pct imp](CV)
      1     1.00 [  0.00]( 0.15)     1.00 [ -0.30]( 0.55)
      2     1.00 [  0.00]( 0.30)     1.01 [  0.59]( 0.15)
      4     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.15)
      8     1.00 [  0.00]( 0.15)     1.00 [  0.29]( 0.00)
     16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
     32     1.00 [  0.00]( 3.80)     1.05 [  4.86]( 0.55)
     64     1.00 [  0.00]( 0.20)     0.99 [ -0.76]( 5.08)
    128     1.00 [  0.00]( 0.20)     1.00 [  0.38]( 0.20)
    256     1.00 [  0.00]( 0.70)     1.02 [  2.50]( 0.63)
    512     1.00 [  0.00]( 0.25)     1.01 [  0.95]( 0.00)


    ==================================================================
    Test          : new-schbench-wakeup-latency
    Units         : Normalized 99th percentile latency in us
    Interpretation: Lower is better
    Statistic     : Median
    ==================================================================
    #workers: tip[pct imp](CV)       sd_nohz[pct imp](CV)
      1     1.00 [ -0.00](13.47)     1.14 [-14.29](11.92)
      2     1.00 [ -0.00](16.40)     1.12 [-12.50]( 0.00)
      4     1.00 [ -0.00]( 9.94)     0.89 [ 11.11](11.92)
      8     1.00 [ -0.00]( 5.53)     1.11 [-11.11](14.13)
     16     1.00 [ -0.00](13.22)     1.00 [ -0.00]( 0.00)
     32     1.00 [ -0.00](11.71)     0.83 [ 16.67]( 0.00)
     64     1.00 [ -0.00]( 3.87)     1.08 [ -7.69]( 6.39)
    128     1.00 [ -0.00]( 3.51)     0.98 [  1.56]( 4.83)
    256     1.00 [ -0.00]( 4.91)     1.18 [-17.95]( 8.31)
    512     1.00 [ -0.00]( 0.20)     1.00 [ -0.39]( 0.20)


    ==================================================================
    Test          : new-schbench-request-latency
    Units         : Normalized 99th percentile latency in us
    Interpretation: Lower is better
    Statistic     : Median
    ==================================================================
    #workers: tip[pct imp](CV)       sd_nohz[pct imp](CV)
      1     1.00 [ -0.00]( 0.13)     0.98 [  1.78]( 2.11)
      2     1.00 [ -0.00]( 0.13)     0.94 [  6.11]( 2.83)
      4     1.00 [ -0.00]( 2.40)     0.96 [  3.82]( 1.21)
      8     1.00 [ -0.00]( 0.00)     0.96 [  3.82]( 0.55)
     16     1.00 [ -0.00]( 2.88)     0.99 [  1.32]( 1.20)
     32     1.00 [ -0.00](12.96)     0.89 [ 11.17]( 2.34)
     64     1.00 [ -0.00]( 4.80)     0.99 [  0.88]( 5.56)
    128     1.00 [ -0.00]( 2.49)     1.02 [ -1.81]( 2.72)
    256     1.00 [ -0.00]( 3.95)     1.12 [-11.52]( 7.78)
    512     1.00 [ -0.00]( 0.88)     1.00 [ -0.00]( 1.02)


    ==================================================================
    Test          : Various longer running benchmarks
    Units         : %diff in throughput reported
    Interpretation: Higher is better
    Statistic     : Median
    ==================================================================
    Benchmarks:                  %diff
    ycsb-cassandra               1.36%
    ycsb-mongodb                -1.66%
    deathstarbench-1x            0.04%
    deathstarbench-2x           -0.10%
    deathstarbench-3x            3.84%
    deathstarbench-6x            1.54%
    hammerdb+mysql 16VU          0.03%
    hammerdb+mysql 64VU          1.22%


The schbench datapoints were rerun to discover the regressions are
mostly the result of run-to-run variance. All the datapoints have
noticeable variance and the results can swing either way depending on
the noise in the system.


References
==========

[1] https://lore.kernel.org/lkml/20240531205452.65781-1-tim.c.chen@linux.intel.com/
[2] https://lore.kernel.org/lkml/20250416035823.1846307-1-tim.c.chen@linux.intel.com/
[3] https://lore.kernel.org/lkml/20250423174634.3009657-1-edumazet@google.com/
[4] https://lore.kernel.org/lkml/20250410102945.GD30687@noisy.programming.kicks-ass.net/
[5] https://lore.kernel.org/lkml/20250409111539.23791-1-kprateek.nayak@amd.com/

This series was tested on tip:sched/core at commit 1b5f1454091e
("sched/idle: Remove play_idle()") with commit 99b773d720ae ("sched/psi:
Fix psi_seq initialization") fron v6.17-rc1 cherry-picked on top.

CONFIG_HZ_PERIODIC was only build and boot tested.

The series is based on tip:sched/coreat commit 5b726e9bf954
("sched/fair: Get rid of throttled_lb_pair()")

Special thanks to Gautham for his help with the tricky bits and an early
review of the series.

---
K Prateek Nayak (19):
  sched/fair: Simplify set_cpu_sd_state_*() with guards
  sched/topology: Optimize sd->shared allocation and assignment
  sched/fair: Use rq->nohz_tick_stopped in update_nohz_stats()
  sched/fair: Use xchg() to set sd->nohz_idle state
  sched/topology: Attach new hierarchy in rq_attach_root()
  sched/fair: Fixup sd->nohz_idle state during hotplug / cpuset
  sched/fair: Account idle cpus instead of busy cpus in sd->shared
  sched/topology: Introduce fallback sd->shared assignment
  sched/topology: Introduce percpu sd_nohz for nohz state tracking
  sched/topology: Introduce "idle_cpus_mask" in sd->shared
  sched/topology: Introduce "nohz_shared_list" to keep track of
    sd->shared
  sched/fair: Reorder the barrier in nohz_balance_enter_idle()
  sched/fair: Extract the main _nohz_idle_balance() loop into a helper
  sched/fair: Convert find_new_ilb() to use nohz_shared_list
  sched/fair: Introduce sched_asym_prefer_idle() for ILB kick
  sched/fair: Convert sched_balance_nohz_idle() to use nohz_shared_list
  sched/fair: Remove "nohz.idle_cpus_mask"
  sched/fair: Optimize global "nohz.nr_cpus" tracking
  sched/topology: Add basic debug information for "nohz_shared_list"

 include/linux/sched/topology.h |  14 +-
 kernel/sched/core.c            |   2 +-
 kernel/sched/fair.c            | 432 ++++++++++++++++++++++++---------
 kernel/sched/sched.h           |   6 +-
 kernel/sched/topology.c        | 381 +++++++++++++++++++++++++----
 5 files changed, 663 insertions(+), 172 deletions(-)


base-commit: 5b726e9bf9544a349090879a513a5e00da486c14
-- 
2.34.1