linux-kernel - [PATCH v3 00/21] Cache Aware Scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <cover.1770760558.git.tim.c.chen@linux.intel.com>
Date: Tue, 10 Feb 2026 14:18:40 -0800
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>,
	Vincent Guittot <vincent.guittot@...aro.org>
Cc: Tim Chen <tim.c.chen@...ux.intel.com>,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>,
	Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
	Hillf Danton <hdanton@...a.com>,
	Shrikanth Hegde <sshegde@...ux.ibm.com>,
	Jianyong Wu <jianyong.wu@...look.com>,
	Yangyu Chen <cyy@...self.name>,
	Tingyin Duan <tingyin.duan@...il.com>,
	Vern Hao <vernhao@...cent.com>,
	Vern Hao <haoxing990@...il.com>,
	Len Brown <len.brown@...el.com>,
	Aubrey Li <aubrey.li@...el.com>,
	Zhao Liu <zhao1.liu@...el.com>,
	Chen Yu <yu.chen.surf@...il.com>,
	Chen Yu <yu.c.chen@...el.com>,
	Adam Li <adamli@...amperecomputing.com>,
	Aaron Lu <ziqianlu@...edance.com>,
	Tim Chen <tim.c.chen@...el.com>,
	Josh Don <joshdon@...gle.com>,
	Gavin Guo <gavinguo@...lia.com>,
	Qais Yousef <qyousef@...alina.io>,
	Libo Chen <libchen@...estorage.com>,
	linux-kernel@...r.kernel.org
Subject: [PATCH v3 00/21] Cache Aware Scheduling 

This patch series introduces infrastructure for cache-aware load
balancing, with the goal of co-locating tasks that share data within
the same Last Level Cache (LLC) domain. By improving cache locality,
the scheduler can reduce cache bouncing and cache misses, ultimately
improving data access efficiency. The design builds on the initial
prototype from Peter [1].

This initial implementation treats threads within the same process as
entities that are likely to share data. During load balancing, the
scheduler attempts to aggregate such threads onto the same LLC domain
whenever possible.

Most of the feedback received on v2 has been addressed. There were
discussions around grouping tasks using mechanisms other than process
membership. While we agree that more flexible grouping is desirable, this
series intentionally focuses on establishing the basic process-based
grouping first, with alternative grouping mechanisms to be explored
in a follow-on series. As a step in that direction, cache aware
scheduling statistics have been separated from the mm structure into a
new sched_cache_stats structure. Thanks for the many useful feedbacks
at LPC 2025 and for v2, we'd like to create another separate thread to
discuss the possible user interfaces.

The load balancing algorithms remain largely unchanged. The main
changes in v3 are:

1. Cache-aware scheduling is skipped after repeated load balance
failures (up to cache_nice_tries). This avoids repeatedly attempting
cache-aware migrations when no movable tasks prefer the destination
LLC.

2. The busiest runqueue is no longer sorted to select tasks that prefer
the destination LLC. This sorting was costly, and equivalent
behavior can be achieved by skipping tasks that do not prefer the
destination LLC during cache-aware migrations.

3. The calculation of the LLC ID switches to using
sched_domain_topology_level data directly that simplifies
the ID derivation.

4. Accounting of the number of tasks preferring each LLC is now kept in
the lowest-level sched domain per CPU. This simplifies handling of
LLC resizing and changes in the number of LLC domains.

Test results:
The patch series was applied and tested on v6.19-rc3.
See: https://github.com/timcchen1298/linux/commits/cache_aware_v3

The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.

The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.

hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched
on these two platforms.

[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows overall wakeup latency improvement.
ChaCha20-xiangshan(risc-v simulator) shows good throughput
improvement. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Genoa:
Significant improvement is observed in hackbench when
the active number of threads is lower than the number
of CPUs within 1 LLC. On v2, Aaron reported improvement
of hackbench/redis when system is underloaded.
ChaCha20-xiangshan shows huge throughput improvement.
Phoronix has tested v1 and shows good improvements in 30+
cases[2]. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Detail:
Due to length constraints, data without much difference with baseline is not
presented.

Sapphire Rapids:
[hackbench pipe]
case                    load            baseline(std%)  compare%( std%)
threads-pipe-2          1-groups         1.00 (  3.19)  +29.06 (  3.31)*
threads-pipe-2          2-groups         1.00 (  9.61)  +19.19 (  0.55)*
threads-pipe-2          4-groups         1.00 (  6.69)  +15.02 (  1.34)*
threads-pipe-2          8-groups         1.00 (  1.83)  +25.59 (  1.46)*
threads-pipe-4          1-groups         1.00 (  3.41)  +28.63 (  1.17)*
threads-pipe-4          2-groups         1.00 ( 15.62)  +19.51 (  0.82)
threads-pipe-4          4-groups         1.00 (  0.19)  +27.05 (  0.74)*
threads-pipe-4          8-groups         1.00 (  4.32)   +5.64 (  3.18)
threads-pipe-8          1-groups         1.00 (  0.44)  +24.68 (  0.49)*
threads-pipe-8          2-groups         1.00 (  2.03)  +23.76 (  0.52)*
threads-pipe-8          4-groups         1.00 (  3.77)   +7.16 (  1.58)
threads-pipe-8          8-groups         1.00 (  4.53)   +6.88 (  2.36)
threads-pipe-16         1-groups         1.00 (  1.71)  +28.46 (  0.68)*
threads-pipe-16         2-groups         1.00 (  4.25)   -0.23 (  0.97)
threads-pipe-16         4-groups         1.00 (  0.64)   -0.95 (  3.74)
threads-pipe-16         8-groups         1.00 (  1.23)   +1.77 (  0.31)

Note: The default number of fd in hackbench is changed from 20 to various
values to ensure that threads fit within a single LLC, especially on AMD
systems. Take "threads-pipe-8, 2-groups" for example, the number of fd
is 8, and 2 groups are created.

[schbench]
The 99th percentile wakeup latency shows overall improvements, while
the 99th percentile request latency exhibits increased some run-to-run
variance. The cache-aware scheduling logic, which scans all online CPUs
to identify the hottest LLC, may be the root cause of the elevated
request latency. It delays the task from returning to user space
due to the costly task_cache_work(). This issue should be mitigated by
restricting the scan to a limited set of NUMA nodes [3], and the fix is
planned to be integrated after the current version is in good shape.

99th Wakeup Latencies	Base (mean±std)      Compare (mean±std)   Change
--------------------------------------------------------------------------------
thread = 2		13.33(1.15)          13.00(1.73)          +2.48%
thread = 4		12.33(1.53)          9.67(1.53)           +21.57%
thread = 8		10.00(0.00)          10.67(0.58)          -6.70%
thread = 16		10.00(1.00)          9.33(0.58)           +6.70%
thread = 32		10.33(0.58)          9.67(1.53)           +6.39%
thread = 64		10.33(0.58)          9.33(1.53)           +9.68%
thread = 128		12.67(0.58)          12.00(0.00)          +5.29%

run-to-run variance regress at 1 messager + 8 worker:
Request Latencies 99.0th  3981.33(260.16)    4877.33(1880.57)     -22.51%

[chacha200]
Time reduced by 20%

Genoa:
[hackbench pipe]
The default number of fd is 20, which exceed the number of CPUs
in a LLC. So the fd is adjusted to 2, 4, 8, 16 respectively.
Exclude the result with large run-to-run variance, 20% ~ 50%
improvement is observed when the system is underloaded:

case                    load            baseline(std%)  compare%( std%)
threads-pipe-2          1-groups         1.00 (  4.04)  +47.22 (  4.77)*
threads-pipe-2          2-groups         1.00 (  5.04)  +33.79 (  8.92)*
threads-pipe-2          4-groups         1.00 (  5.82)   +5.93 (  7.97)
threads-pipe-2          8-groups         1.00 ( 16.15)   -4.11 (  6.85)
threads-pipe-4          1-groups         1.00 (  7.28)  +50.43 (  2.39)*
threads-pipe-4          2-groups         1.00 ( 10.77)   -4.31 (  7.71)
threads-pipe-4          4-groups         1.00 ( 11.16)   +8.12 ( 11.21)
threads-pipe-4          8-groups         1.00 ( 12.79)  -10.10 ( 12.92)
threads-pipe-8          1-groups         1.00 (  5.57)   -1.50 (  6.55)
threads-pipe-8          2-groups         1.00 ( 10.72)   +0.69 (  6.38)
threads-pipe-8          4-groups         1.00 (  7.04)  +19.70 (  5.58)*
threads-pipe-8          8-groups         1.00 (  7.11)  +27.46 (  2.34)*
threads-pipe-16         1-groups         1.00 (  2.86)  -12.82 (  8.97)
threads-pipe-16         2-groups         1.00 (  8.55)   +2.96 (  1.65)
threads-pipe-16         4-groups         1.00 (  5.12)  +20.49 (  5.33)*
threads-pipe-16         8-groups         1.00 (  3.23)   +9.06 (  2.87)

[chacha200]
baseline:
Host time spent: 51432ms

sched_cache:
Host time spent: 28664ms

Time reduced by 45%

[1] https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/
[2] https://www.phoronix.com/review/cache-aware-scheduling-amd-turin
[3] https://lore.kernel.org/all/865b852e3fdef6561c9e0a5be9a94aec8a68cdea.1760206683.git.tim.c.chen@linux.intel.com/

Change history:
**v3 Changes:**
1. Cache-aware scheduling is skipped after repeated load balance
   failures (up to cache_nice_tries). This avoids repeatedly attempting
   cache-aware migrations when no movable tasks prefer the destination
   LLC.
2. The busiest runqueue is no longer sorted to select tasks that prefer
   the destination LLC. This sorting was costly, and equivalent
   behavior can be achieved by skipping tasks that do not prefer the
   destination LLC during cache-aware migrations.
3. Accounting of the number of tasks preferring each LLC is now kept in
   the lowest-level sched domain per CPU. This simplifies handling of
   LLC resizing and changes in the number of LLC domains.
4. Other changes from v2 are detailed in each patch's change log.

 
**v2 Changes:**
v2 link: https://lore.kernel.org/all/cover.1764801860.git.tim.c.chen@linux.intel.com/
1. Align NUMA balancing and cache affinity by
   prioritizing NUMA balancing when their decisions differ.
2. Dynamically resize per-LLC statistics structures based on the LLC
   size.
3. Switch to a contiguous LLC-ID space so these IDs can be used
   directly as array indices for LLC statistics.
4. Add clarification comments.
5. Add 3 debug patches (not meant for merging).
6. Other changes to address feedbacks from review of v1 patch set
   (see individual patch change log).

**v1**
v1 link: https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/

Chen Yu (10):
  sched/cache: Record per LLC utilization to guide cache aware
    scheduling decisions
  sched/cache: Introduce helper functions to enforce LLC migration
    policy
  sched/cache: Make LLC id continuous
  sched/cache: Disable cache aware scheduling for processes with high
    thread counts
  sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  sched/cache: Enable cache aware scheduling for multi LLCs NUMA node
  sched/cache: Allow the user space to turn on and off cache aware
    scheduling
  sched/cache: Add user control to adjust the aggressiveness of
    cache-aware scheduling
  -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy
    for each process via proc fs
  -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load
    balance statistics

Peter Zijlstra (Intel) (1):
  sched/cache: Introduce infrastructure for cache-aware load balancing

Tim Chen (10):
  sched/cache: Assign preferred LLC ID to processes
  sched/cache: Track LLC-preferred tasks per runqueue
  sched/cache: Introduce per CPU's tasks LLC preference counter
  sched/cache: Calculate the percpu sd task LLC preference
  sched/cache: Count tasks prefering destination LLC in a sched group
  sched/cache: Check local_group only once in update_sg_lb_stats()
  sched/cache: Prioritize tasks preferring destination LLC during
    balancing
  sched/cache: Add migrate_llc_task migration type for cache-aware
    balancing
  sched/cache: Handle moving single tasks to/from their preferred LLC
  sched/cache: Respect LLC preference in task migration and detach

 fs/proc/base.c                 |   31 +
 include/linux/cacheinfo.h      |   21 +-
 include/linux/mm_types.h       |   43 ++
 include/linux/sched.h          |   32 +
 include/linux/sched/topology.h |    8 +
 include/trace/events/sched.h   |   79 +++
 init/Kconfig                   |   11 +
 init/init_task.c               |    3 +
 kernel/fork.c                  |    6 +
 kernel/sched/core.c            |   11 +
 kernel/sched/debug.c           |   55 ++
 kernel/sched/fair.c            | 1088 +++++++++++++++++++++++++++++++-
 kernel/sched/sched.h           |   44 ++
 kernel/sched/topology.c        |  194 +++++-
 14 files changed, 1598 insertions(+), 28 deletions(-)

-- 
2.32.0