linux-kernel - Re: [PATCH 00/19] Cache Aware Scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b82842ff20995cd50b422dad844664089dcd0c7.camel@linux.intel.com>
Date: Tue, 14 Oct 2025 14:48:09 -0700
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, K
 Prateek Nayak <kprateek.nayak@....com>, "Gautham R . Shenoy"
 <gautham.shenoy@....com>, Vincent Guittot	 <vincent.guittot@...aro.org>,
 Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
 <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben
 Segall	 <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin
 Schneider	 <vschneid@...hat.com>, Hillf Danton <hdanton@...a.com>,
 Shrikanth Hegde	 <sshegde@...ux.ibm.com>, Jianyong Wu
 <jianyong.wu@...look.com>, Yangyu Chen	 <cyy@...self.name>, Tingyin Duan
 <tingyin.duan@...il.com>, Vern Hao	 <vernhao@...cent.com>, Len Brown
 <len.brown@...el.com>, Aubrey Li	 <aubrey.li@...el.com>, Zhao Liu
 <zhao1.liu@...el.com>, Chen Yu	 <yu.chen.surf@...il.com>, Chen Yu
 <yu.c.chen@...el.com>, Libo Chen	 <libo.chen@...cle.com>, Adam Li
 <adamli@...amperecomputing.com>, Tim Chen	 <tim.c.chen@...el.com>,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH 00/19] Cache Aware Scheduling

On Tue, 2025-10-14 at 17:43 +0530, Madadi Vineeth Reddy wrote:
> Hi Tim,
> Thanks for the patch.
> 
> On 11/10/25 23:54, Tim Chen wrote:
> > There had been 4 RFC postings of this patch set. We've incorporated
> > the feedbacks and comments and now would like to post this patch set
> > for consideration of inclusion to mainline. The patches are based on
> > the original patch proposed by Peter[1].
> > 
> 
> [snip]
> 
> > The following tunables control under /sys/kernel/debug/sched/ control
> > the behavior of cache aware scheduling:
> > 
> > 1. llc_aggr_tolerance Controls how aggressive we aggregate tasks to
> > their preferred LLC, based on a process's RSS size and number of running
> > threads.  Processes that have smaller memory footprint and fewer number
> > of tasks will benefit better from aggregation.  Varies between 0 to 100
> >         0:  Cache aware scheduling is disabled 1:  Process with RSS
> >         greater than LLC size,
> > 	    or running threads more than number of cpu cores/LLC skip
> > 	    aggregation
> > 	100:  Aggressive; a process's threads are aggregated regardless of
> > 	      RSS or running threads.
> > For example, with a 32MB L3 cache 8 cores in L3:
> >     llc_aggr_tolerance=1 -> process with RSS > 32MB, or nr_running_avg >
> >     8 are skipped.  llc_aggr_tolerance=99 -> process with RSS > 784GB
> >     or nr_running_avg > 785 are skipped.  784GB = (1 + (99 - 1) * 256)
> >     * 32MB.
> >      785  = (1 + (99 - 1) * 8).
> > 
> > Currently this knob is a global control. Considering that different workloads have
> > different requirements for task consolidation, it would be ideal to introduce
> > per process control for this knob via prctl in the future.
> >  
> > 2. llc_overload_pct, llc_imb_pct
> > We'll always try to move a task to its preferred LLC if an LLC's average core
> > utilization is below llc_overload_pct (default to 50%). Otherwise, the utilization
> > of preferred LLC has to be not more than llc_imb_pct (default to 20%) to move a task
> > to it. This is to prevent overloading on the preferred LLC.
> >  
> > 3. llc_epoch_period
> > Controls how often the scheduler collect LLC occupancy of a process (default to 10 msec)
> >  
> > 4. llc_epoch_affinity_timeout
> > Detect that if a process has not run for llc_epoch_affinity_timeout (default to 50 msec),
> > it loses its cache preference.
> 
> How are these default values arrived at? Is it based on some theory or
> based on the results of the runs?

Right now the default value of llc_aggr_tolerance is fairly conservative.
We make sure that we don't cause regressions to workloads we tested.

Knobs like llc_overload_pct, llc_imb_pct are actually chosen from
Len's Yogini micro-benchmark experiments we did that gave good aggregation
without overloading LLC.

llc_epoch_period and llc_epoch_affinity_timeout are from Peter's
original patch that seem to work fairly well so we left it as is. 

> 
> > 
> > Test results:
> > The first test platform is a 2 socket Intel Sapphire Rapids with 30
> > cores per socket. The DRAM interleaving is enabled in the BIOS so it
> > essential has one NUMA node with two last level caches. There are 60
> > CPUs associated with each last level cache.
> > 
> > The second test platform is a AMD Milan. There are 2 Nodes and 64 CPUs
> > per node. Each node has 8 CCXs and each CCX has 8 CPUs.
> > 
> > The third test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs per node.
> > Each node has 2 CCXs and each CCX has 16 CPUs.
> > 
> > [TL;DR]
> > Sappire Rapids:
> > hackbench shows significant improvement when there is 1 group
> > with different number of fd pairs(threads) within this process.
> > schbench shows overall wakeup latency improvement.
> > ChaCha20-xiangshan shows ~10% throughput improvement. Other
> > micro-workloads did not show much difference.
> > 
> > Milan:
> > No obvious difference is observed so far.
> > 
> > Genoa:
> > ChaCha20-xiangshan shows 44% throughput improvement.
> > 
> > [Sapphire Rapids details]
> > 
> > [hackbench]
> > Hackbench show overall improvement when there is only 1
> > group, with different number of fd(pairs). This is the
> > expected behavior because this test scenario would benefit
> > from cache aware load balance most. Other number of groups
> > shows not much difference(using default fd = 20).
> > 
> >        groups              baseline            sched_cache
> > Min       1      37.5960 (   0.00%)     26.4340 (  29.69%)
> > Min       3      38.7050 (   0.00%)     38.6920 (   0.03%)
> > Min       5      39.4550 (   0.00%)     38.6280 (   2.10%)
> > Min       7      51.4270 (   0.00%)     50.6790 (   1.45%)
> > Min       12     62.8540 (   0.00%)     63.6590 (  -1.28%)
> > Min       16     74.0160 (   0.00%)     74.7480 (  -0.99%)
> > Amean     1      38.4768 (   0.00%)     26.7146 *  30.57%*
> > Amean     3      39.0750 (   0.00%)     39.5586 (  -1.24%)
> > Amean     5      41.5178 (   0.00%)     41.2766 (   0.58%)
> > Amean     7      52.1164 (   0.00%)     51.5152 (   1.15%)
> > Amean     12     63.9052 (   0.00%)     64.0420 (  -0.21%)
> > Amean     16     74.5812 (   0.00%)     75.4318 (  -1.14%)
> > BAmean-99 1      38.2027 (   0.00%)     26.5500 (  30.50%)
> > BAmean-99 3      38.8725 (   0.00%)     39.2225 (  -0.90%)
> > BAmean-99 5      41.1898 (   0.00%)     41.0037 (   0.45%)
> > BAmean-99 7      51.8645 (   0.00%)     51.4453 (   0.81%)
> > BAmean-99 12     63.6317 (   0.00%)     63.9307 (  -0.47%)
> > BAmean-99 16     74.4528 (   0.00%)     75.2113 (  -1.02%)
> > 
> > [schbench]
> > Wakeup Latencies 99.0th improvement is observed.
> > 
> > threads          baseline             sched_cache          change
> > 1                13.80(1.10)          14.80(2.86)          -7.25%
> > 2                12.00(1.00)          8.00(2.12)           +33.33%
> > 4                9.00(0.00)           5.60(0.89)           +37.78%
> > 8                9.00(0.00)           6.40(1.14)           +28.89%
> > 16               9.20(0.45)           6.20(0.84)           +32.61%
> > 32               9.60(0.55)           7.00(0.71)           +27.08%
> > 64               10.80(0.45)          8.40(0.55)           +22.22%
> > 128              12.60(0.55)          11.40(0.55)          +9.52%
> > 239              14.00(0.00)          14.20(0.45)          -1.43%
> > 
> > [stream]
> > No much difference is observed.
> >                              baseline                     sc
> > GB/sec copy-2        35.00 (   0.00%)       34.79 (  -0.60%)
> > GB/sec scale-2       24.04 (   0.00%)       23.90 (  -0.58%)
> > GB/sec add-2         28.98 (   0.00%)       28.92 (  -0.22%)
> > GB/sec triad-2       28.32 (   0.00%)       28.31 (  -0.04%)
> > 
> > [netperf]
> > No much difference is observed(consider the stdev).
> > 
> >          nr_pairs          netperf                netperf
> > 
> > Hmean     60      1023.44 (   0.00%)     1021.87 (  -0.15%)
> > BHmean-99 60      1023.78 (   0.00%)     1022.22 (  -0.15%)
> > Hmean     120      792.09 (   0.00%)      793.75 (   0.21%)
> > BHmean-99 120      792.36 (   0.00%)      794.04 (   0.21%)
> > Hmean     180      513.42 (   0.00%)      513.53 (   0.02%)
> > BHmean-99 180      513.81 (   0.00%)      513.80 (  -0.00%)
> > Hmean     240      387.09 (   0.00%)      387.33 (   0.06%)
> > BHmean-99 240      387.18 (   0.00%)      387.45 (   0.07%)
> > Hmean     300      316.04 (   0.00%)      315.68 (  -0.12%)
> > BHmean-99 300      316.12 (   0.00%)      315.77 (  -0.11%)
> > Hmean     360      496.38 (   0.00%)      455.49 (  -8.24%)
> > BHmean-99 360      499.88 (   0.00%)      458.17 (  -8.34%)
> > Hmean     420      497.32 (   0.00%)      501.84 (   0.91%)
> > BHmean-99 420      499.90 (   0.00%)      504.56 (   0.93%)
> > Hmean     480      417.62 (   0.00%)      432.25 (   3.50%)
> > BHmean-99 480      419.96 (   0.00%)      434.43 (   3.45%)
> > 
> > In above case of 360 pairs, although there is a performance
> > drop of 8.24%, the corresponding:
> > HCoeffVar   360    23.78 (   0.00%)       29.52 ( -24.15%)
> > shows that the regression is within the run-to-run variance.
> > 
> > [Milan details]
> > 
> > default settings:
> > [hackbench]
> > 
> > Min       1      50.8170 (   0.00%)     51.1890 (  -0.73%)
> > Min       3      59.3610 (   0.00%)     58.6080 (   1.27%)
> > Min       5      94.9760 (   0.00%)     96.0210 (  -1.10%)
> > Min       7     123.3270 (   0.00%)    124.1680 (  -0.68%)
> > Min       12    179.2000 (   0.00%)    181.8390 (  -1.47%)
> > Min       16    238.8680 (   0.00%)    242.6390 (  -1.58%)
> > Amean     1      51.6614 (   0.00%)     51.3630 (   0.58%)
> > Amean     3      60.1886 (   0.00%)     59.4542 (   1.22%)
> > Amean     5      95.7602 (   0.00%)     96.8338 (  -1.12%)
> > Amean     7     124.0332 (   0.00%)    124.4406 (  -0.33%)
> > Amean     12    181.0324 (   0.00%)    182.9220 (  -1.04%)
> > Amean     16    239.5556 (   0.00%)    243.3556 *  -1.59%*
> > BAmean-99 1      51.5335 (   0.00%)     51.3338 (   0.39%)
> > BAmean-99 3      59.7848 (   0.00%)     59.0958 (   1.15%)
> > BAmean-99 5      95.6698 (   0.00%)     96.5450 (  -0.91%)
> > BAmean-99 7     123.8478 (   0.00%)    124.3760 (  -0.43%)
> > BAmean-99 12    180.8035 (   0.00%)    182.5135 (  -0.95%)
> > BAmean-99 16    239.1933 (   0.00%)    243.0570 (  -1.62%)
> > 
> > [schbench]
> > 
> > threads          baseline             sched_cache          change
> > 1                12.00(2.00)          11.00(0.71)          +8.33%
> > 2                12.40(0.89)          13.80(0.84)          -11.29%
> > 4                14.20(0.45)          14.80(0.45)          -4.23%
> > 8                16.00(0.00)          15.80(0.45)          +1.25%
> > 16               16.00(0.00)          16.00(0.71)          0.00%
> > 32               19.40(0.55)          18.60(0.55)          +4.12%
> > 63               22.20(0.45)          23.20(0.45)          -4.50%
> > 
> > [stream]
> > No obvious difference is found.
> > export STREAM_SIZE=$((128000000))
> > 
> >                      baseline               sched_cache
> > GB/sec copy-16       726.48 (   0.00%)      715.60 (  -1.50%)
> > GB/sec scale-16      577.71 (   0.00%)      577.03 (  -0.12%)
> > GB/sec add-16        678.85 (   0.00%)      672.87 (  -0.88%)
> > GB/sec triad-16      735.52 (   0.00%)      729.05 (  -0.88%)
> > 
> > 
> > [netperf]
> > No much difference is observed.
> > 
> >          nr_pairs          baseline           sched_cache
> > Hmean     32       755.98 (   0.00%)      755.17 (  -0.11%)
> > BHmean-99 32       756.42 (   0.00%)      755.40 (  -0.13%)
> > Hmean     64       677.38 (   0.00%)      669.75 (  -1.13%)
> > BHmean-99 64       677.50 (   0.00%)      669.86 (  -1.13%)
> > Hmean     96       498.52 (   0.00%)      496.73 (  -0.36%)
> > BHmean-99 96       498.69 (   0.00%)      496.93 (  -0.35%)
> > Hmean     128      604.38 (   0.00%)      604.22 (  -0.03%)
> > BHmean-99 128      604.87 (   0.00%)      604.87 (   0.00%)
> > Hmean     160      471.67 (   0.00%)      468.29 (  -0.72%)
> > BHmean-99 160      474.34 (   0.00%)      471.05 (  -0.69%)
> > Hmean     192      381.18 (   0.00%)      384.88 (   0.97%)
> > BHmean-99 192      383.30 (   0.00%)      386.82 (   0.92%)
> > Hmean     224      327.79 (   0.00%)      326.05 (  -0.53%)
> > BHmean-99 224      329.85 (   0.00%)      327.87 (  -0.60%)
> > Hmean     256      284.61 (   0.00%)      300.52 (   5.59%)
> > BHmean-99 256      286.41 (   0.00%)      302.06 (   5.47%)
> > 
> > [Genoa details]
> > [ChaCha20-xiangshan]
> > ChaCha20-xiangshan is a simple benchmark using a static build of an
> > 8-thread Verilator of XiangShan(RISC-V). The README file can be
> > found here[2]. The score depends on how aggressive the user set the
> > /sys/kernel/debug/sched/llc_aggr_tolerance. Using the default values,
> > there is no much difference observed. While setting the
> > /sys/kernel/debug/sched/llc_aggr_tolerance to 100, 44% improvment is
> > observed.
> > 
> > baseline:
> > Host time spent: 50,868ms
> > 
> > sched_cache:
> > Host time spent: 28,349ms
> > 
> > The time has been reduced by 44%.
> 
> Milan showed no improvement across all benchmarks, which could be due to the 
> CCX topology (8 CCXs × 8 CPUs) where the LLC domain is too small for this
> optimization to be effective. Moreover there could be overhead due to additional
> computations.
> 
> ChaCha20-xiangshan improvement in Genoa when llc_aggr_tolerance is set to 100 seems
> due to having relatively lesser thread count. Please provide the numbers
> with default values too. Would like to know numbers on varying loads.

I'll ask Chen Yu who did the Xiangshan experiments if he has those numbers.

> 
> In Power 10 and Power 11, the LLC size is 4 threads which is even smaller. Not
> expecting improvements here but will run some workloads and share the data.
> 
> Not gone through the entire series yet but are the situations like say in two
> NUMA system, if a task's preferred LLC is on the wrong NUMA node for its memory,
> which takes precedence? 

We take preferred NUMA node in the consideration but we do not force task to
go to the preferred node.

I remembered initially we limited the consideration to only LLCs in the
preferred node. But we encountered regressions in hackbench and schbench,
because when the preferred node don't have any occupancy resulting in preferred LLC
to be set to -1 (no preference), and resulted in extra task migrations.
And also the preferred node for hackbench and schbench was volatile
as they have small memory footprint.  Chen Yu, please chime in if there
were other reasons you remembered.

We'll need to revisit this part of the code to take care of such
corner case. I think ideally we should move tasks to the least loaded LLC
in the preferred node (even if no LLCs have occupancy in the preferred node),
as long as preferred NUMA node don't changes too often.


> 
> Also, what about the workloads that don't share data like stress-ng? 
> 

We can test those.  Ideally the controls to prevent over aggregation to preferred LLC
would keep stress-ng happy.

> It will
> be good to make sure that most other workloads don't suffer. As mentioned,
> per process knob for llc_aggr_tolerance could help.

Agree. We are planning to add per process knob for the next version.  One thought is to use
prctl. Any other suggestions are welcome.

Tim

> 
> Thanks,
> Madadi Vineeth Reddy
> 
> > 
> > Thanks to everyone who participated and provided valuable suggestions for
> > the previous versions. Your comments and tests on the latest version are
> > also greatly appreciated in advance.
> > 
> > Tim
> > 
> > [1] https://lore.kernel.org/lkml/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> > 
> > [2] https://github.com/yu-chen-surf/chacha20-xiangshan/blob/master/README.eng.md
> > 
> > RFC v4:
> > [3] https://lore.kernel.org/all/cover.1754712565.git.tim.c.chen@linux.intel.com/
> > 
> > RFC v3
> > [4] https://lore.kernel.org/all/cover.1750268218.git.tim.c.chen@linux.intel.com/
> > 
> > RFC v2:
> > [5] https://lore.kernel.org/lkml/cover.1745199017.git.yu.c.chen@intel.com/
> > 
> > 
> > Chen Yu (7):
> >   sched/fair: Record per-LLC utilization to guide cache-aware scheduling
> >     decisions
> >   sched/fair: Introduce helper functions to enforce LLC migration policy
> >   sched/fair: Introduce a static key to enable cache aware only for
> >     multi LLCs
> >   sched/fair: Exclude processes with many threads from cache-aware
> >     scheduling
> >   sched/fair: Disable cache aware scheduling for processes with high
> >     thread counts
> >   sched/fair: Avoid cache-aware scheduling for memory-heavy processes
> >   sched/fair: Add user control to adjust the tolerance of cache-aware
> >     scheduling
> > 
> > Peter Zijlstra (Intel) (1):
> >   sched/fair: Add infrastructure for cache-aware load balancing
> > 
> > Tim Chen (11):
> >   sched/fair: Add LLC index mapping for CPUs
> >   sched/fair: Assign preferred LLC ID to processes
> >   sched/fair: Track LLC-preferred tasks per runqueue
> >   sched/fair: Introduce per runqueue task LLC preference counter
> >   sched/fair: Count tasks prefering each LLC in a sched group
> >   sched/fair: Prioritize tasks preferring destination LLC during
> >     balancing
> >   sched/fair: Identify busiest sched_group for LLC-aware load balancing
> >   sched/fair: Add migrate_llc_task migration type for cache-aware
> >     balancing
> >   sched/fair: Handle moving single tasks to/from their preferred LLC
> >   sched/fair: Consider LLC preference when selecting tasks for load
> >     balancing
> >   sched/fair: Respect LLC preference in task migration and detach
> > 
> >  include/linux/cacheinfo.h      |   21 +-
> >  include/linux/mm_types.h       |   45 ++
> >  include/linux/sched.h          |    5 +
> >  include/linux/sched/topology.h |    4 +
> >  include/linux/threads.h        |   10 +
> >  init/Kconfig                   |   20 +
> >  init/init_task.c               |    3 +
> >  kernel/fork.c                  |    6 +
> >  kernel/sched/core.c            |   18 +
> >  kernel/sched/debug.c           |   56 ++
> >  kernel/sched/fair.c            | 1022 +++++++++++++++++++++++++++++++-
> >  kernel/sched/features.h        |    1 +
> >  kernel/sched/sched.h           |   27 +
> >  kernel/sched/topology.c        |   61 +-
> >  14 files changed, 1283 insertions(+), 16 deletions(-)
> > 
>