[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <cover.1764801860.git.tim.c.chen@linux.intel.com>
Date: Wed, 3 Dec 2025 15:07:19 -0800
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
Vincent Guittot <vincent.guittot@...aro.org>
Cc: Tim Chen <tim.c.chen@...ux.intel.com>,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Hillf Danton <hdanton@...a.com>,
Shrikanth Hegde <sshegde@...ux.ibm.com>,
Jianyong Wu <jianyong.wu@...look.com>,
Yangyu Chen <cyy@...self.name>,
Tingyin Duan <tingyin.duan@...il.com>,
Vern Hao <vernhao@...cent.com>,
Vern Hao <haoxing990@...il.com>,
Len Brown <len.brown@...el.com>,
Aubrey Li <aubrey.li@...el.com>,
Zhao Liu <zhao1.liu@...el.com>,
Chen Yu <yu.chen.surf@...il.com>,
Chen Yu <yu.c.chen@...el.com>,
Adam Li <adamli@...amperecomputing.com>,
Aaron Lu <ziqianlu@...edance.com>,
Tim Chen <tim.c.chen@...el.com>,
linux-kernel@...r.kernel.org
Subject: [PATCH v2 00/23] Cache aware scheduling
This patch series introduces infrastructure for cache-aware load
balancing, with the goal of co-locating tasks that share data on the
same Last Level Cache (LLC) domain. By improving cache locality, the
scheduler can reduce cache bouncing and cache misses, ultimately
improving data access efficiency. The design builds on the initial
prototype from Peter [1].
In this initial implementation, threads within the same process are
treated as entities that are likely to share data. During load
balancing, the scheduler attempts to aggregate these threads onto the
same LLC domain whenever possible.
We would like to thank everyone who provided feedbacks on the v1
series[1]. Most of the comments have been addressed in this revision.
Several broader suggestions surfaced during review, and we believe
they are best approached in follow-up work once the foundational
cache-aware scheduling infrastructure is merged:
1. **Generalizing task grouping beyond processes.**
While v2 focuses on grouping threads within a single process, other
classes of workloads naturally share data and could benefit from LLC
co-location, such as:
a) Tasks from different processes that operate on shared data.
b) Tasks belonging to the same NUMA group.
c) Tasks with strong waker/wakee relationships.
d) User-defined groups via cgroups or other user interfaces.
2. **Configurable cache-aware scheduling policies.**
The current iteration implements a global cache-aware scheduling
policy. Future work may introduce per-process or per-task-group
policies, exposed through prctl() or other mechanisms.
**v2 Changes:**
1. Align NUMA balancing and cache affinity by
prioritizing NUMA balancing when their decisions differ.
2. Dynamically resize per-LLC statistics structures based on the LLC
size.
3. Switch to a contiguous LLC-ID space so these IDs can be used
directly as array indices for LLC statistics.
4. Add clarification comments.
5. Add 3 debug patches (not meant for merging).
6. Other changes to address feedbacks from review of v1 patch set
(see individual patch change log).
Test results:
The patch series was applied and tested on v6.18-rc7.
See: https://github.com/timcchen1298/linux/commits/cache_aware_v2
The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.
The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.
hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched on
these two platforms.
[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows overall wakeup latency improvement.
ChaCha20-xiangshan shows good throughput improvement.
Genoa:
ChaCha20-xiangshan shows huge throughput improvement.
No obvious difference is observed in hackbench/schbench
/netperf/stream/stress-ng.
Phoronix has tested v1 and shows good improvements
in 33 cases[2].
Detail:
Due to length constraints, only part of the data is presented.
Sapphire Rapids:
hackbench thread pipes
baseline sched_cache
groups
Amean 1 38.8224 ( 0.00%) 26.4582 * 31.85%*
Amean 3 38.2358 ( 0.00%) 38.0758 ( 0.42%)
Amean 5 40.7282 ( 0.00%) 41.1568 ( -1.05%)
Amean 7 51.1720 ( 0.00%) 50.6646 ( 0.99%)
Amean 12 63.1562 ( 0.00%) 63.3516 ( -0.31%)
Amean 16 73.9584 ( 0.00%) 75.5596 ( -2.17%)
Max 1 39.4140 ( 0.00%) 26.7590 ( 32.11%)
Max 3 40.8310 ( 0.00%) 39.8000 ( 2.53%)
Max 5 42.2150 ( 0.00%) 42.4860 ( -0.64%)
Max 7 52.1800 ( 0.00%) 51.9370 ( 0.47%)
Max 12 63.9430 ( 0.00%) 64.2820 ( -0.53%)
Max 16 74.3710 ( 0.00%) 76.4170 ( -2.75%)
further test hackbench using other number of fds:
case fd groups baseline(std%) compare%( std%)
threads-pipe-2 1-groups 1.00 ( 1.25) +38.52 ( 1.33)
threads-pipe-2 2-groups 1.00 ( 12.52) +12.74 ( 1.31)
threads-pipe-2 4-groups 1.00 ( 7.91) +12.29 ( 1.86)
threads-pipe-4 1-groups 1.00 ( 0.55) +34.99 ( 0.45)
threads-pipe-4 2-groups 1.00 ( 16.00) +27.32 ( 0.75)
threads-pipe-4 4-groups 1.00 ( 17.37) +25.75 ( 0.20)
threads-pipe-8 1-groups 1.00 ( 0.74) +27.13 ( 0.44)
threads-pipe-8 2-groups 1.00 ( 8.82) +23.79 ( 0.32)
threads-pipe-8 4-groups 1.00 ( 1.30) +27.64 ( 0.51)
threads-pipe-16 1-groups 1.00 ( 1.03) +30.55 ( 0.27)
threads-pipe-16 2-groups 1.00 ( 6.43) +29.52 ( 0.20)
threads-pipe-16 4-groups 1.00 ( 1.36) -1.85 ( 1.43)
threads-pipe-20 1-groups 1.00 ( 0.45) +30.88 ( 0.42)
threads-pipe-20 2-groups 1.00 ( 1.95) -0.81 ( 5.84)
threads-pipe-20 4-groups 1.00 ( 2.09) -1.77 ( 7.57)
stream:
baseline sched_cache
GB/sec copy-2 36.48 ( 0.00%) 36.55 ( 0.18%)
GB/sec scale-2 36.83 ( 0.00%) 36.97 ( 0.38%)
GB/sec add-2 37.92 ( 0.00%) 38.03 ( 0.31%)
GB/sec triad-2 37.83 ( 0.00%) 37.97 ( 0.37%)
stress-ng context switch:
baseline sched_cache
Min context-1 2957.81 ( 0.00%) 2966.17 ( 0.28%)
Min context-2 5931.68 ( 0.00%) 5930.17 ( -0.03%)
Min context-4 11874.20 ( 0.00%) 11875.68 ( 0.01%)
Min context-8 23755.30 ( 0.00%) 23762.43 ( 0.03%)
Min context-16 47535.14 ( 0.00%) 47526.46 ( -0.02%)
Min context-32 95078.66 ( 0.00%) 94356.39 ( -0.76%)
Min context-64 190074.62 ( 0.00%) 190042.93 ( -0.02%)
Min context-128 371107.12 ( 0.00%) 371008.10 ( -0.03%)
Min context-256 578443.73 ( 0.00%) 579037.86 ( 0.10%)
Min context-480 580203.34 ( 0.00%) 580499.43 ( 0.05%)
Hmean context-1 2964.59 ( 0.00%) 2967.69 ( 0.10%)
Hmean context-2 5936.41 ( 0.00%) 5935.51 ( -0.02%)
Hmean context-4 11879.56 ( 0.00%) 11881.70 ( 0.02%)
Hmean context-8 23771.92 ( 0.00%) 23770.28 ( -0.01%)
Hmean context-16 47552.23 ( 0.00%) 47538.01 ( -0.03%)
Hmean context-32 95102.67 ( 0.00%) 94969.43 ( -0.14%)
Hmean context-64 190129.74 ( 0.00%) 190088.68 ( -0.02%)
Hmean context-128 371291.95 ( 0.00%) 371114.82 ( -0.05%)
Hmean context-256 578907.96 ( 0.00%) 579338.99 ( 0.07%)
Hmean context-480 580541.78 ( 0.00%) 580726.13 ( 0.03%)
Max context-1 2967.93 ( 0.00%) 2968.90 ( 0.03%)
Max context-2 5942.37 ( 0.00%) 5940.40 ( -0.03%)
Max context-4 11885.25 ( 0.00%) 11886.43 ( 0.01%)
Max context-8 23784.17 ( 0.00%) 23783.31 ( -0.00%)
Max context-16 47576.84 ( 0.00%) 47561.42 ( -0.03%)
Max context-32 95139.03 ( 0.00%) 95094.86 ( -0.05%)
Max context-64 190180.08 ( 0.00%) 190123.31 ( -0.03%)
Max context-128 371451.73 ( 0.00%) 371240.25 ( -0.06%)
Max context-256 579355.24 ( 0.00%) 579731.37 ( 0.06%)
Max context-480 580750.44 ( 0.00%) 581118.33 ( 0.06%)
BHmean-50 context-1 2966.80 ( 0.00%) 2968.82 ( 0.07%)
BHmean-50 context-2 5939.32 ( 0.00%) 5939.49 ( 0.00%)
BHmean-50 context-4 11883.02 ( 0.00%) 11886.08 ( 0.03%)
BHmean-50 context-8 23778.40 ( 0.00%) 23775.90 ( -0.01%)
BHmean-50 context-16 47568.31 ( 0.00%) 47546.19 ( -0.05%)
BHmean-50 context-32 95125.84 ( 0.00%) 95087.06 ( -0.04%)
BHmean-50 context-64 190165.37 ( 0.00%) 190117.94 ( -0.02%)
BHmean-50 context-128 371405.28 ( 0.00%) 371168.75 ( -0.06%)
BHmean-50 context-256 579137.11 ( 0.00%) 579609.35 ( 0.08%)
BHmean-50 context-480 580646.72 ( 0.00%) 580920.46 ( 0.05%)
BHmean-95 context-1 2965.72 ( 0.00%) 2967.94 ( 0.07%)
BHmean-95 context-2 5937.20 ( 0.00%) 5936.40 ( -0.01%)
BHmean-95 context-4 11880.45 ( 0.00%) 11882.71 ( 0.02%)
BHmean-95 context-8 23774.69 ( 0.00%) 23771.59 ( -0.01%)
BHmean-95 context-16 47555.08 ( 0.00%) 47539.93 ( -0.03%)
BHmean-95 context-32 95106.67 ( 0.00%) 95072.38 ( -0.04%)
BHmean-95 context-64 190138.93 ( 0.00%) 190096.30 ( -0.02%)
BHmean-95 context-128 371322.78 ( 0.00%) 371132.61 ( -0.05%)
BHmean-95 context-256 578985.41 ( 0.00%) 579389.21 ( 0.07%)
BHmean-95 context-480 580598.22 ( 0.00%) 580763.93 ( 0.03%)
BHmean-99 context-1 2965.72 ( 0.00%) 2967.94 ( 0.07%)
BHmean-99 context-2 5937.20 ( 0.00%) 5936.40 ( -0.01%)
BHmean-99 context-4 11880.45 ( 0.00%) 11882.71 ( 0.02%)
BHmean-99 context-8 23774.69 ( 0.00%) 23771.59 ( -0.01%)
BHmean-99 context-16 47555.08 ( 0.00%) 47539.93 ( -0.03%)
BHmean-99 context-32 95106.67 ( 0.00%) 95072.38 ( -0.04%)
BHmean-99 context-64 190138.93 ( 0.00%) 190096.30 ( -0.02%)
BHmean-99 context-128 371322.78 ( 0.00%) 371132.61 ( -0.05%)
BHmean-99 context-256 578985.41 ( 0.00%) 579389.21 ( 0.07%)
BHmean-99 context-480 580598.22 ( 0.00%) 580763.93 ( 0.03%)
schbench thread = 1
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 10.71(0.76) 9.86(1.46) +7.94%
Request Latencies 99.0th 4036.00(6.53) 4054.29(10.03) -0.45%
RPS 50.0th 267.29(0.49) 266.86(0.38) -0.16%
Average RPS 268.42(0.16) 267.86(0.31) -0.21%
schbench thread = 2
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 11.43(1.13) 8.00(2.00) +30.01%
Request Latencies 99.0th 4007.43(34.52) 3967.43(70.03) +1.00%
RPS 50.0th 536.71(0.76) 536.14(1.57) -0.11%
Average RPS 536.59(0.55) 535.33(1.34) -0.23%
schbench thread = 4
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 9.57(0.79) 6.14(1.46) +35.84%
Request Latencies 99.0th 3789.14(31.47) 3810.86(48.97) -0.57%
RPS 50.0th 1074.00(0.00) 1073.43(2.76) -0.05%
Average RPS 1075.03(1.07) 1072.93(2.13) -0.20%
schbench thread = 8
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 9.29(0.49) 6.57(1.81) +29.28%
Request Latencies 99.0th 3756.00(19.60) 3769.71(23.87) -0.37%
RPS 50.0th 2152.57(4.28) 2152.57(4.28) 0.00%
Average RPS 2151.07(2.71) 2150.58(3.41) -0.02%
schbench thread = 16
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 9.43(0.53) 6.86(0.90) +27.25%
Request Latencies 99.0th 3780.00(32.98) 3774.29(11.04) +0.15%
RPS 50.0th 4305.14(8.55) 4307.43(7.81) +0.05%
Average RPS 4303.47(5.74) 4301.71(4.35) -0.04%
schbench thread = 32
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 10.14(0.38) 6.86(0.69) +32.35%
Request Latencies 99.0th 3764.00(21.66) 3806.29(32.24) -1.12%
RPS 50.0th 8624.00(0.00) 8619.43(12.09) -0.05%
Average RPS 8607.36(5.29) 8602.69(7.08) -0.05%
schbench thread = 64
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 11.71(0.49) 8.43(1.81) +28.01%
Request Latencies 99.0th 3796.00(62.48) 3860.25(147.35) -1.69%
RPS 50.0th 17238.86(24.19) 16411.43(88.95) -4.80%
Average RPS 17209.02(10.18) 16389.73(100.27) -4.76%
schbench thread = 128
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 13.29(0.49) 12.00(0.00) +9.71%
Request Latencies 99.0th 7893.71(11.04) 7909.71(17.10) -0.20%
RPS 50.0th 32013.71(194.52) 32068.57(50.35) +0.17%
Average RPS 31762.03(238.18) 31884.81(300.85) +0.39%
schbench thread = 239
Metric Base (mean±std) Compare (mean±std) Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 13.29(0.49) 14.43(0.53) -8.58%
Request Latencies 99.0th 8174.86(8.55) 8244.57(12.09) -0.85%
RPS 50.0th 30624.00(0.00) 30614.86(24.19) -0.03%
Average RPS 30695.86(11.03) 30673.35(17.31) -0.07%
chacha20:
baseline:
Host time spent: 66,320ms
sched_cache:
Host time spent: 53,859ms
Time reduced by 18%, throughput increased by 23%
Genoa:
chacha20
baseline:
Host time spent: 51,848ms
sched_cache:
Host time spent: 28,439ms
Time reduced by 45%, throughput increased by 82%
[1] https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/
[2] https://www.phoronix.com/review/cache-aware-scheduling-amd-turin
Chen Yu (10):
sched/cache: Record per-LLC utilization to guide cache-aware
scheduling decisions
sched/cache: Introduce helper functions to enforce LLC migration
policy
sched/cache: Introduce sched_cache_present to enable cache aware
scheduling for multi LLCs NUMA node
sched/cache: Record the number of active threads per process for
cache-aware scheduling
sched/cache: Disable cache aware scheduling for processes with high
thread counts
sched/cache: Avoid cache-aware scheduling for memory-heavy processes
sched/cache: Add user control to adjust the parameters of cache-aware
scheduling
-- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware
load balancing
-- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load
balance statistics
-- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy
for each process via proc fs
Peter Zijlstra (Intel) (1):
sched/cache: Introduce infrastructure for cache-aware load balancing
Tim Chen (12):
sched/cache: Make LLC id continuous
sched/cache: Assign preferred LLC ID to processes
sched/cache: Track LLC-preferred tasks per runqueue
sched/cache: Introduce per runqueue task LLC preference counter
sched/cache: Calculate the per runqueue task LLC preference
sched/cache: Count tasks prefering destination LLC in a sched group
sched/cache: Check local_group only once in update_sg_lb_stats()
sched/cache: Prioritize tasks preferring destination LLC during
balancing
sched/cache: Add migrate_llc_task migration type for cache-aware
balancing
sched/cache: Handle moving single tasks to/from their preferred LLC
sched/cache: Consider LLC preference when selecting tasks for load
balancing
sched/cache: Respect LLC preference in task migration and detach
fs/proc/base.c | 22 +
include/linux/cacheinfo.h | 21 +-
include/linux/mm_types.h | 60 ++
include/linux/sched.h | 19 +
include/linux/sched/topology.h | 5 +
include/trace/events/sched.h | 31 +
init/Kconfig | 11 +
init/init_task.c | 4 +
kernel/fork.c | 6 +
kernel/sched/core.c | 12 +
kernel/sched/debug.c | 62 ++
kernel/sched/fair.c | 1034 +++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 39 ++
kernel/sched/stats.c | 5 +-
kernel/sched/topology.c | 239 +++++++-
15 files changed, 1543 insertions(+), 27 deletions(-)
--
2.32.0
Powered by blists - more mailing lists