linux-kernel - [PATCH 00/19] Cache Aware Scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <cover.1760206683.git.tim.c.chen@linux.intel.com>
Date: Sat, 11 Oct 2025 11:24:37 -0700
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>
Cc: Tim Chen <tim.c.chen@...ux.intel.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>,
	Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
	Hillf Danton <hdanton@...a.com>,
	Shrikanth Hegde <sshegde@...ux.ibm.com>,
	Jianyong Wu <jianyong.wu@...look.com>,
	Yangyu Chen <cyy@...self.name>,
	Tingyin Duan <tingyin.duan@...il.com>,
	Vern Hao <vernhao@...cent.com>,
	Len Brown <len.brown@...el.com>,
	Aubrey Li <aubrey.li@...el.com>,
	Zhao Liu <zhao1.liu@...el.com>,
	Chen Yu <yu.chen.surf@...il.com>,
	Chen Yu <yu.c.chen@...el.com>,
	Libo Chen <libo.chen@...cle.com>,
	Adam Li <adamli@...amperecomputing.com>,
	Tim Chen <tim.c.chen@...el.com>,
	linux-kernel@...r.kernel.org
Subject: [PATCH 00/19] Cache Aware Scheduling 

There had been 4 RFC postings of this patch set. We've incorporated
the feedbacks and comments and now would like to post this patch set
for consideration of inclusion to mainline. The patches are based on
the original patch proposed by Peter[1].

The goal of the patch series is to aggregate tasks sharing data
to the same LLC cache domain, thereby reducing cache bouncing and
cache misses, and improve data access efficiency. In the current
implementation, threads within the same process are considered
as entities that potentially share resources.
 
The changes from v4 RFC patches are minor. Most are commit log and
and code clean ups per feedbacks. Several bugs were fixed:
1. A memory leak of not freeing cache aware scheduling structure when struct mm is freed.
2. A false sharing regression involving nr_running_avg.
3. Bug for initializing cache aware scheduling structures for system with no L3.

Peter suggested enhancing the patch set to allow task aggregation into
secondary LLCs when the preferred LLC becomes overloaded. We have not
implemented that in this version. In our previous testing, maintaining
stable LLC preferences proved important to avoid excessive task
migrations, which can undermine cache locality benefits. Additionally,
migrating tasks between primary and secondary LLCs often caused cache
bouncing, making the locality gains from using a secondary LLC marginal.
We would have to take a closer look to see if such scheme can 
can be done without the such problems. 

The following tunables control under /sys/kernel/debug/sched/ control
the behavior of cache aware scheduling:

1. llc_aggr_tolerance Controls how aggressive we aggregate tasks to
their preferred LLC, based on a process's RSS size and number of running
threads.  Processes that have smaller memory footprint and fewer number
of tasks will benefit better from aggregation.  Varies between 0 to 100
        0:  Cache aware scheduling is disabled 1:  Process with RSS
        greater than LLC size,
	    or running threads more than number of cpu cores/LLC skip
	    aggregation
	100:  Aggressive; a process's threads are aggregated regardless of
	      RSS or running threads.
For example, with a 32MB L3 cache 8 cores in L3:
    llc_aggr_tolerance=1 -> process with RSS > 32MB, or nr_running_avg >
    8 are skipped.  llc_aggr_tolerance=99 -> process with RSS > 784GB
    or nr_running_avg > 785 are skipped.  784GB = (1 + (99 - 1) * 256)
    * 32MB.
     785  = (1 + (99 - 1) * 8).

Currently this knob is a global control. Considering that different workloads have
different requirements for task consolidation, it would be ideal to introduce
per process control for this knob via prctl in the future.
 
2. llc_overload_pct, llc_imb_pct
We'll always try to move a task to its preferred LLC if an LLC's average core
utilization is below llc_overload_pct (default to 50%). Otherwise, the utilization
of preferred LLC has to be not more than llc_imb_pct (default to 20%) to move a task
to it. This is to prevent overloading on the preferred LLC.
 
3. llc_epoch_period
Controls how often the scheduler collect LLC occupancy of a process (default to 10 msec)
 
4. llc_epoch_affinity_timeout
Detect that if a process has not run for llc_epoch_affinity_timeout (default to 50 msec),
it loses its cache preference.

Test results:
The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.

The second test platform is a AMD Milan. There are 2 Nodes and 64 CPUs
per node. Each node has 8 CCXs and each CCX has 8 CPUs.

The third test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs per node.
Each node has 2 CCXs and each CCX has 16 CPUs.

[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when there is 1 group
with different number of fd pairs(threads) within this process.
schbench shows overall wakeup latency improvement.
ChaCha20-xiangshan shows ~10% throughput improvement. Other
micro-workloads did not show much difference.

Milan:
No obvious difference is observed so far.

Genoa:
ChaCha20-xiangshan shows 44% throughput improvement.

[Sapphire Rapids details]

[hackbench]
Hackbench show overall improvement when there is only 1
group, with different number of fd(pairs). This is the
expected behavior because this test scenario would benefit
from cache aware load balance most. Other number of groups
shows not much difference(using default fd = 20).

       groups              baseline            sched_cache
Min       1      37.5960 (   0.00%)     26.4340 (  29.69%)
Min       3      38.7050 (   0.00%)     38.6920 (   0.03%)
Min       5      39.4550 (   0.00%)     38.6280 (   2.10%)
Min       7      51.4270 (   0.00%)     50.6790 (   1.45%)
Min       12     62.8540 (   0.00%)     63.6590 (  -1.28%)
Min       16     74.0160 (   0.00%)     74.7480 (  -0.99%)
Amean     1      38.4768 (   0.00%)     26.7146 *  30.57%*
Amean     3      39.0750 (   0.00%)     39.5586 (  -1.24%)
Amean     5      41.5178 (   0.00%)     41.2766 (   0.58%)
Amean     7      52.1164 (   0.00%)     51.5152 (   1.15%)
Amean     12     63.9052 (   0.00%)     64.0420 (  -0.21%)
Amean     16     74.5812 (   0.00%)     75.4318 (  -1.14%)
BAmean-99 1      38.2027 (   0.00%)     26.5500 (  30.50%)
BAmean-99 3      38.8725 (   0.00%)     39.2225 (  -0.90%)
BAmean-99 5      41.1898 (   0.00%)     41.0037 (   0.45%)
BAmean-99 7      51.8645 (   0.00%)     51.4453 (   0.81%)
BAmean-99 12     63.6317 (   0.00%)     63.9307 (  -0.47%)
BAmean-99 16     74.4528 (   0.00%)     75.2113 (  -1.02%)

[schbench]
Wakeup Latencies 99.0th improvement is observed.

threads          baseline             sched_cache          change
1                13.80(1.10)          14.80(2.86)          -7.25%
2                12.00(1.00)          8.00(2.12)           +33.33%
4                9.00(0.00)           5.60(0.89)           +37.78%
8                9.00(0.00)           6.40(1.14)           +28.89%
16               9.20(0.45)           6.20(0.84)           +32.61%
32               9.60(0.55)           7.00(0.71)           +27.08%
64               10.80(0.45)          8.40(0.55)           +22.22%
128              12.60(0.55)          11.40(0.55)          +9.52%
239              14.00(0.00)          14.20(0.45)          -1.43%

[stream]
No much difference is observed.
                             baseline                     sc
GB/sec copy-2        35.00 (   0.00%)       34.79 (  -0.60%)
GB/sec scale-2       24.04 (   0.00%)       23.90 (  -0.58%)
GB/sec add-2         28.98 (   0.00%)       28.92 (  -0.22%)
GB/sec triad-2       28.32 (   0.00%)       28.31 (  -0.04%)

[netperf]
No much difference is observed(consider the stdev).

         nr_pairs          netperf                netperf

Hmean     60      1023.44 (   0.00%)     1021.87 (  -0.15%)
BHmean-99 60      1023.78 (   0.00%)     1022.22 (  -0.15%)
Hmean     120      792.09 (   0.00%)      793.75 (   0.21%)
BHmean-99 120      792.36 (   0.00%)      794.04 (   0.21%)
Hmean     180      513.42 (   0.00%)      513.53 (   0.02%)
BHmean-99 180      513.81 (   0.00%)      513.80 (  -0.00%)
Hmean     240      387.09 (   0.00%)      387.33 (   0.06%)
BHmean-99 240      387.18 (   0.00%)      387.45 (   0.07%)
Hmean     300      316.04 (   0.00%)      315.68 (  -0.12%)
BHmean-99 300      316.12 (   0.00%)      315.77 (  -0.11%)
Hmean     360      496.38 (   0.00%)      455.49 (  -8.24%)
BHmean-99 360      499.88 (   0.00%)      458.17 (  -8.34%)
Hmean     420      497.32 (   0.00%)      501.84 (   0.91%)
BHmean-99 420      499.90 (   0.00%)      504.56 (   0.93%)
Hmean     480      417.62 (   0.00%)      432.25 (   3.50%)
BHmean-99 480      419.96 (   0.00%)      434.43 (   3.45%)

In above case of 360 pairs, although there is a performance
drop of 8.24%, the corresponding:
HCoeffVar   360    23.78 (   0.00%)       29.52 ( -24.15%)
shows that the regression is within the run-to-run variance.

[Milan details]

default settings:
[hackbench]

Min       1      50.8170 (   0.00%)     51.1890 (  -0.73%)
Min       3      59.3610 (   0.00%)     58.6080 (   1.27%)
Min       5      94.9760 (   0.00%)     96.0210 (  -1.10%)
Min       7     123.3270 (   0.00%)    124.1680 (  -0.68%)
Min       12    179.2000 (   0.00%)    181.8390 (  -1.47%)
Min       16    238.8680 (   0.00%)    242.6390 (  -1.58%)
Amean     1      51.6614 (   0.00%)     51.3630 (   0.58%)
Amean     3      60.1886 (   0.00%)     59.4542 (   1.22%)
Amean     5      95.7602 (   0.00%)     96.8338 (  -1.12%)
Amean     7     124.0332 (   0.00%)    124.4406 (  -0.33%)
Amean     12    181.0324 (   0.00%)    182.9220 (  -1.04%)
Amean     16    239.5556 (   0.00%)    243.3556 *  -1.59%*
BAmean-99 1      51.5335 (   0.00%)     51.3338 (   0.39%)
BAmean-99 3      59.7848 (   0.00%)     59.0958 (   1.15%)
BAmean-99 5      95.6698 (   0.00%)     96.5450 (  -0.91%)
BAmean-99 7     123.8478 (   0.00%)    124.3760 (  -0.43%)
BAmean-99 12    180.8035 (   0.00%)    182.5135 (  -0.95%)
BAmean-99 16    239.1933 (   0.00%)    243.0570 (  -1.62%)

[schbench]

threads          baseline             sched_cache          change
1                12.00(2.00)          11.00(0.71)          +8.33%
2                12.40(0.89)          13.80(0.84)          -11.29%
4                14.20(0.45)          14.80(0.45)          -4.23%
8                16.00(0.00)          15.80(0.45)          +1.25%
16               16.00(0.00)          16.00(0.71)          0.00%
32               19.40(0.55)          18.60(0.55)          +4.12%
63               22.20(0.45)          23.20(0.45)          -4.50%

[stream]
No obvious difference is found.
export STREAM_SIZE=$((128000000))

                     baseline               sched_cache
GB/sec copy-16       726.48 (   0.00%)      715.60 (  -1.50%)
GB/sec scale-16      577.71 (   0.00%)      577.03 (  -0.12%)
GB/sec add-16        678.85 (   0.00%)      672.87 (  -0.88%)
GB/sec triad-16      735.52 (   0.00%)      729.05 (  -0.88%)


[netperf]
No much difference is observed.

         nr_pairs          baseline           sched_cache
Hmean     32       755.98 (   0.00%)      755.17 (  -0.11%)
BHmean-99 32       756.42 (   0.00%)      755.40 (  -0.13%)
Hmean     64       677.38 (   0.00%)      669.75 (  -1.13%)
BHmean-99 64       677.50 (   0.00%)      669.86 (  -1.13%)
Hmean     96       498.52 (   0.00%)      496.73 (  -0.36%)
BHmean-99 96       498.69 (   0.00%)      496.93 (  -0.35%)
Hmean     128      604.38 (   0.00%)      604.22 (  -0.03%)
BHmean-99 128      604.87 (   0.00%)      604.87 (   0.00%)
Hmean     160      471.67 (   0.00%)      468.29 (  -0.72%)
BHmean-99 160      474.34 (   0.00%)      471.05 (  -0.69%)
Hmean     192      381.18 (   0.00%)      384.88 (   0.97%)
BHmean-99 192      383.30 (   0.00%)      386.82 (   0.92%)
Hmean     224      327.79 (   0.00%)      326.05 (  -0.53%)
BHmean-99 224      329.85 (   0.00%)      327.87 (  -0.60%)
Hmean     256      284.61 (   0.00%)      300.52 (   5.59%)
BHmean-99 256      286.41 (   0.00%)      302.06 (   5.47%)

[Genoa details]
[ChaCha20-xiangshan]
ChaCha20-xiangshan is a simple benchmark using a static build of an
8-thread Verilator of XiangShan(RISC-V). The README file can be
found here[2]. The score depends on how aggressive the user set the
/sys/kernel/debug/sched/llc_aggr_tolerance. Using the default values,
there is no much difference observed. While setting the
/sys/kernel/debug/sched/llc_aggr_tolerance to 100, 44% improvment is
observed.

baseline:
Host time spent: 50,868ms

sched_cache:
Host time spent: 28,349ms

The time has been reduced by 44%.

Thanks to everyone who participated and provided valuable suggestions for
the previous versions. Your comments and tests on the latest version are
also greatly appreciated in advance.

Tim

[1] https://lore.kernel.org/lkml/20250325120952.GJ36322@noisy.programming.kicks-ass.net/

[2] https://github.com/yu-chen-surf/chacha20-xiangshan/blob/master/README.eng.md

RFC v4:
[3] https://lore.kernel.org/all/cover.1754712565.git.tim.c.chen@linux.intel.com/

RFC v3
[4] https://lore.kernel.org/all/cover.1750268218.git.tim.c.chen@linux.intel.com/

RFC v2:
[5] https://lore.kernel.org/lkml/cover.1745199017.git.yu.c.chen@intel.com/


Chen Yu (7):
  sched/fair: Record per-LLC utilization to guide cache-aware scheduling
    decisions
  sched/fair: Introduce helper functions to enforce LLC migration policy
  sched/fair: Introduce a static key to enable cache aware only for
    multi LLCs
  sched/fair: Exclude processes with many threads from cache-aware
    scheduling
  sched/fair: Disable cache aware scheduling for processes with high
    thread counts
  sched/fair: Avoid cache-aware scheduling for memory-heavy processes
  sched/fair: Add user control to adjust the tolerance of cache-aware
    scheduling

Peter Zijlstra (Intel) (1):
  sched/fair: Add infrastructure for cache-aware load balancing

Tim Chen (11):
  sched/fair: Add LLC index mapping for CPUs
  sched/fair: Assign preferred LLC ID to processes
  sched/fair: Track LLC-preferred tasks per runqueue
  sched/fair: Introduce per runqueue task LLC preference counter
  sched/fair: Count tasks prefering each LLC in a sched group
  sched/fair: Prioritize tasks preferring destination LLC during
    balancing
  sched/fair: Identify busiest sched_group for LLC-aware load balancing
  sched/fair: Add migrate_llc_task migration type for cache-aware
    balancing
  sched/fair: Handle moving single tasks to/from their preferred LLC
  sched/fair: Consider LLC preference when selecting tasks for load
    balancing
  sched/fair: Respect LLC preference in task migration and detach

 include/linux/cacheinfo.h      |   21 +-
 include/linux/mm_types.h       |   45 ++
 include/linux/sched.h          |    5 +
 include/linux/sched/topology.h |    4 +
 include/linux/threads.h        |   10 +
 init/Kconfig                   |   20 +
 init/init_task.c               |    3 +
 kernel/fork.c                  |    6 +
 kernel/sched/core.c            |   18 +
 kernel/sched/debug.c           |   56 ++
 kernel/sched/fair.c            | 1022 +++++++++++++++++++++++++++++++-
 kernel/sched/features.h        |    1 +
 kernel/sched/sched.h           |   27 +
 kernel/sched/topology.c        |   61 +-
 14 files changed, 1283 insertions(+), 16 deletions(-)

-- 
2.32.0