[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a54c33fd-504d-4eb4-9b0e-ad4a3e2156ad@linux.ibm.com>
Date: Tue, 14 Oct 2025 17:43:51 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: Tim Chen <tim.c.chen@...ux.intel.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Hillf Danton <hdanton@...a.com>,
Shrikanth Hegde <sshegde@...ux.ibm.com>,
Jianyong Wu <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>,
Len Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>,
Zhao Liu <zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>,
Chen Yu <yu.c.chen@...el.com>, Libo Chen <libo.chen@...cle.com>,
Adam Li <adamli@...amperecomputing.com>,
Tim Chen <tim.c.chen@...el.com>, linux-kernel@...r.kernel.org,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [PATCH 00/19] Cache Aware Scheduling
Hi Tim,
Thanks for the patch.
On 11/10/25 23:54, Tim Chen wrote:
> There had been 4 RFC postings of this patch set. We've incorporated
> the feedbacks and comments and now would like to post this patch set
> for consideration of inclusion to mainline. The patches are based on
> the original patch proposed by Peter[1].
>
[snip]
> The following tunables control under /sys/kernel/debug/sched/ control
> the behavior of cache aware scheduling:
>
> 1. llc_aggr_tolerance Controls how aggressive we aggregate tasks to
> their preferred LLC, based on a process's RSS size and number of running
> threads. Processes that have smaller memory footprint and fewer number
> of tasks will benefit better from aggregation. Varies between 0 to 100
> 0: Cache aware scheduling is disabled 1: Process with RSS
> greater than LLC size,
> or running threads more than number of cpu cores/LLC skip
> aggregation
> 100: Aggressive; a process's threads are aggregated regardless of
> RSS or running threads.
> For example, with a 32MB L3 cache 8 cores in L3:
> llc_aggr_tolerance=1 -> process with RSS > 32MB, or nr_running_avg >
> 8 are skipped. llc_aggr_tolerance=99 -> process with RSS > 784GB
> or nr_running_avg > 785 are skipped. 784GB = (1 + (99 - 1) * 256)
> * 32MB.
> 785 = (1 + (99 - 1) * 8).
>
> Currently this knob is a global control. Considering that different workloads have
> different requirements for task consolidation, it would be ideal to introduce
> per process control for this knob via prctl in the future.
>
> 2. llc_overload_pct, llc_imb_pct
> We'll always try to move a task to its preferred LLC if an LLC's average core
> utilization is below llc_overload_pct (default to 50%). Otherwise, the utilization
> of preferred LLC has to be not more than llc_imb_pct (default to 20%) to move a task
> to it. This is to prevent overloading on the preferred LLC.
>
> 3. llc_epoch_period
> Controls how often the scheduler collect LLC occupancy of a process (default to 10 msec)
>
> 4. llc_epoch_affinity_timeout
> Detect that if a process has not run for llc_epoch_affinity_timeout (default to 50 msec),
> it loses its cache preference.
How are these default values arrived at? Is it based on some theory or
based on the results of the runs?
>
> Test results:
> The first test platform is a 2 socket Intel Sapphire Rapids with 30
> cores per socket. The DRAM interleaving is enabled in the BIOS so it
> essential has one NUMA node with two last level caches. There are 60
> CPUs associated with each last level cache.
>
> The second test platform is a AMD Milan. There are 2 Nodes and 64 CPUs
> per node. Each node has 8 CCXs and each CCX has 8 CPUs.
>
> The third test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs per node.
> Each node has 2 CCXs and each CCX has 16 CPUs.
>
> [TL;DR]
> Sappire Rapids:
> hackbench shows significant improvement when there is 1 group
> with different number of fd pairs(threads) within this process.
> schbench shows overall wakeup latency improvement.
> ChaCha20-xiangshan shows ~10% throughput improvement. Other
> micro-workloads did not show much difference.
>
> Milan:
> No obvious difference is observed so far.
>
> Genoa:
> ChaCha20-xiangshan shows 44% throughput improvement.
>
> [Sapphire Rapids details]
>
> [hackbench]
> Hackbench show overall improvement when there is only 1
> group, with different number of fd(pairs). This is the
> expected behavior because this test scenario would benefit
> from cache aware load balance most. Other number of groups
> shows not much difference(using default fd = 20).
>
> groups baseline sched_cache
> Min 1 37.5960 ( 0.00%) 26.4340 ( 29.69%)
> Min 3 38.7050 ( 0.00%) 38.6920 ( 0.03%)
> Min 5 39.4550 ( 0.00%) 38.6280 ( 2.10%)
> Min 7 51.4270 ( 0.00%) 50.6790 ( 1.45%)
> Min 12 62.8540 ( 0.00%) 63.6590 ( -1.28%)
> Min 16 74.0160 ( 0.00%) 74.7480 ( -0.99%)
> Amean 1 38.4768 ( 0.00%) 26.7146 * 30.57%*
> Amean 3 39.0750 ( 0.00%) 39.5586 ( -1.24%)
> Amean 5 41.5178 ( 0.00%) 41.2766 ( 0.58%)
> Amean 7 52.1164 ( 0.00%) 51.5152 ( 1.15%)
> Amean 12 63.9052 ( 0.00%) 64.0420 ( -0.21%)
> Amean 16 74.5812 ( 0.00%) 75.4318 ( -1.14%)
> BAmean-99 1 38.2027 ( 0.00%) 26.5500 ( 30.50%)
> BAmean-99 3 38.8725 ( 0.00%) 39.2225 ( -0.90%)
> BAmean-99 5 41.1898 ( 0.00%) 41.0037 ( 0.45%)
> BAmean-99 7 51.8645 ( 0.00%) 51.4453 ( 0.81%)
> BAmean-99 12 63.6317 ( 0.00%) 63.9307 ( -0.47%)
> BAmean-99 16 74.4528 ( 0.00%) 75.2113 ( -1.02%)
>
> [schbench]
> Wakeup Latencies 99.0th improvement is observed.
>
> threads baseline sched_cache change
> 1 13.80(1.10) 14.80(2.86) -7.25%
> 2 12.00(1.00) 8.00(2.12) +33.33%
> 4 9.00(0.00) 5.60(0.89) +37.78%
> 8 9.00(0.00) 6.40(1.14) +28.89%
> 16 9.20(0.45) 6.20(0.84) +32.61%
> 32 9.60(0.55) 7.00(0.71) +27.08%
> 64 10.80(0.45) 8.40(0.55) +22.22%
> 128 12.60(0.55) 11.40(0.55) +9.52%
> 239 14.00(0.00) 14.20(0.45) -1.43%
>
> [stream]
> No much difference is observed.
> baseline sc
> GB/sec copy-2 35.00 ( 0.00%) 34.79 ( -0.60%)
> GB/sec scale-2 24.04 ( 0.00%) 23.90 ( -0.58%)
> GB/sec add-2 28.98 ( 0.00%) 28.92 ( -0.22%)
> GB/sec triad-2 28.32 ( 0.00%) 28.31 ( -0.04%)
>
> [netperf]
> No much difference is observed(consider the stdev).
>
> nr_pairs netperf netperf
>
> Hmean 60 1023.44 ( 0.00%) 1021.87 ( -0.15%)
> BHmean-99 60 1023.78 ( 0.00%) 1022.22 ( -0.15%)
> Hmean 120 792.09 ( 0.00%) 793.75 ( 0.21%)
> BHmean-99 120 792.36 ( 0.00%) 794.04 ( 0.21%)
> Hmean 180 513.42 ( 0.00%) 513.53 ( 0.02%)
> BHmean-99 180 513.81 ( 0.00%) 513.80 ( -0.00%)
> Hmean 240 387.09 ( 0.00%) 387.33 ( 0.06%)
> BHmean-99 240 387.18 ( 0.00%) 387.45 ( 0.07%)
> Hmean 300 316.04 ( 0.00%) 315.68 ( -0.12%)
> BHmean-99 300 316.12 ( 0.00%) 315.77 ( -0.11%)
> Hmean 360 496.38 ( 0.00%) 455.49 ( -8.24%)
> BHmean-99 360 499.88 ( 0.00%) 458.17 ( -8.34%)
> Hmean 420 497.32 ( 0.00%) 501.84 ( 0.91%)
> BHmean-99 420 499.90 ( 0.00%) 504.56 ( 0.93%)
> Hmean 480 417.62 ( 0.00%) 432.25 ( 3.50%)
> BHmean-99 480 419.96 ( 0.00%) 434.43 ( 3.45%)
>
> In above case of 360 pairs, although there is a performance
> drop of 8.24%, the corresponding:
> HCoeffVar 360 23.78 ( 0.00%) 29.52 ( -24.15%)
> shows that the regression is within the run-to-run variance.
>
> [Milan details]
>
> default settings:
> [hackbench]
>
> Min 1 50.8170 ( 0.00%) 51.1890 ( -0.73%)
> Min 3 59.3610 ( 0.00%) 58.6080 ( 1.27%)
> Min 5 94.9760 ( 0.00%) 96.0210 ( -1.10%)
> Min 7 123.3270 ( 0.00%) 124.1680 ( -0.68%)
> Min 12 179.2000 ( 0.00%) 181.8390 ( -1.47%)
> Min 16 238.8680 ( 0.00%) 242.6390 ( -1.58%)
> Amean 1 51.6614 ( 0.00%) 51.3630 ( 0.58%)
> Amean 3 60.1886 ( 0.00%) 59.4542 ( 1.22%)
> Amean 5 95.7602 ( 0.00%) 96.8338 ( -1.12%)
> Amean 7 124.0332 ( 0.00%) 124.4406 ( -0.33%)
> Amean 12 181.0324 ( 0.00%) 182.9220 ( -1.04%)
> Amean 16 239.5556 ( 0.00%) 243.3556 * -1.59%*
> BAmean-99 1 51.5335 ( 0.00%) 51.3338 ( 0.39%)
> BAmean-99 3 59.7848 ( 0.00%) 59.0958 ( 1.15%)
> BAmean-99 5 95.6698 ( 0.00%) 96.5450 ( -0.91%)
> BAmean-99 7 123.8478 ( 0.00%) 124.3760 ( -0.43%)
> BAmean-99 12 180.8035 ( 0.00%) 182.5135 ( -0.95%)
> BAmean-99 16 239.1933 ( 0.00%) 243.0570 ( -1.62%)
>
> [schbench]
>
> threads baseline sched_cache change
> 1 12.00(2.00) 11.00(0.71) +8.33%
> 2 12.40(0.89) 13.80(0.84) -11.29%
> 4 14.20(0.45) 14.80(0.45) -4.23%
> 8 16.00(0.00) 15.80(0.45) +1.25%
> 16 16.00(0.00) 16.00(0.71) 0.00%
> 32 19.40(0.55) 18.60(0.55) +4.12%
> 63 22.20(0.45) 23.20(0.45) -4.50%
>
> [stream]
> No obvious difference is found.
> export STREAM_SIZE=$((128000000))
>
> baseline sched_cache
> GB/sec copy-16 726.48 ( 0.00%) 715.60 ( -1.50%)
> GB/sec scale-16 577.71 ( 0.00%) 577.03 ( -0.12%)
> GB/sec add-16 678.85 ( 0.00%) 672.87 ( -0.88%)
> GB/sec triad-16 735.52 ( 0.00%) 729.05 ( -0.88%)
>
>
> [netperf]
> No much difference is observed.
>
> nr_pairs baseline sched_cache
> Hmean 32 755.98 ( 0.00%) 755.17 ( -0.11%)
> BHmean-99 32 756.42 ( 0.00%) 755.40 ( -0.13%)
> Hmean 64 677.38 ( 0.00%) 669.75 ( -1.13%)
> BHmean-99 64 677.50 ( 0.00%) 669.86 ( -1.13%)
> Hmean 96 498.52 ( 0.00%) 496.73 ( -0.36%)
> BHmean-99 96 498.69 ( 0.00%) 496.93 ( -0.35%)
> Hmean 128 604.38 ( 0.00%) 604.22 ( -0.03%)
> BHmean-99 128 604.87 ( 0.00%) 604.87 ( 0.00%)
> Hmean 160 471.67 ( 0.00%) 468.29 ( -0.72%)
> BHmean-99 160 474.34 ( 0.00%) 471.05 ( -0.69%)
> Hmean 192 381.18 ( 0.00%) 384.88 ( 0.97%)
> BHmean-99 192 383.30 ( 0.00%) 386.82 ( 0.92%)
> Hmean 224 327.79 ( 0.00%) 326.05 ( -0.53%)
> BHmean-99 224 329.85 ( 0.00%) 327.87 ( -0.60%)
> Hmean 256 284.61 ( 0.00%) 300.52 ( 5.59%)
> BHmean-99 256 286.41 ( 0.00%) 302.06 ( 5.47%)
>
> [Genoa details]
> [ChaCha20-xiangshan]
> ChaCha20-xiangshan is a simple benchmark using a static build of an
> 8-thread Verilator of XiangShan(RISC-V). The README file can be
> found here[2]. The score depends on how aggressive the user set the
> /sys/kernel/debug/sched/llc_aggr_tolerance. Using the default values,
> there is no much difference observed. While setting the
> /sys/kernel/debug/sched/llc_aggr_tolerance to 100, 44% improvment is
> observed.
>
> baseline:
> Host time spent: 50,868ms
>
> sched_cache:
> Host time spent: 28,349ms
>
> The time has been reduced by 44%.
Milan showed no improvement across all benchmarks, which could be due to the
CCX topology (8 CCXs × 8 CPUs) where the LLC domain is too small for this
optimization to be effective. Moreover there could be overhead due to additional
computations.
ChaCha20-xiangshan improvement in Genoa when llc_aggr_tolerance is set to 100 seems
due to having relatively lesser thread count. Please provide the numbers
with default values too. Would like to know numbers on varying loads.
In Power 10 and Power 11, the LLC size is 4 threads which is even smaller. Not
expecting improvements here but will run some workloads and share the data.
Not gone through the entire series yet but are the situations like say in two
NUMA system, if a task's preferred LLC is on the wrong NUMA node for its memory,
which takes precedence?
Also, what about the workloads that don't share data like stress-ng? It will
be good to make sure that most other workloads don't suffer. As mentioned,
per process knob for llc_aggr_tolerance could help.
Thanks,
Madadi Vineeth Reddy
>
> Thanks to everyone who participated and provided valuable suggestions for
> the previous versions. Your comments and tests on the latest version are
> also greatly appreciated in advance.
>
> Tim
>
> [1] https://lore.kernel.org/lkml/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
>
> [2] https://github.com/yu-chen-surf/chacha20-xiangshan/blob/master/README.eng.md
>
> RFC v4:
> [3] https://lore.kernel.org/all/cover.1754712565.git.tim.c.chen@linux.intel.com/
>
> RFC v3
> [4] https://lore.kernel.org/all/cover.1750268218.git.tim.c.chen@linux.intel.com/
>
> RFC v2:
> [5] https://lore.kernel.org/lkml/cover.1745199017.git.yu.c.chen@intel.com/
>
>
> Chen Yu (7):
> sched/fair: Record per-LLC utilization to guide cache-aware scheduling
> decisions
> sched/fair: Introduce helper functions to enforce LLC migration policy
> sched/fair: Introduce a static key to enable cache aware only for
> multi LLCs
> sched/fair: Exclude processes with many threads from cache-aware
> scheduling
> sched/fair: Disable cache aware scheduling for processes with high
> thread counts
> sched/fair: Avoid cache-aware scheduling for memory-heavy processes
> sched/fair: Add user control to adjust the tolerance of cache-aware
> scheduling
>
> Peter Zijlstra (Intel) (1):
> sched/fair: Add infrastructure for cache-aware load balancing
>
> Tim Chen (11):
> sched/fair: Add LLC index mapping for CPUs
> sched/fair: Assign preferred LLC ID to processes
> sched/fair: Track LLC-preferred tasks per runqueue
> sched/fair: Introduce per runqueue task LLC preference counter
> sched/fair: Count tasks prefering each LLC in a sched group
> sched/fair: Prioritize tasks preferring destination LLC during
> balancing
> sched/fair: Identify busiest sched_group for LLC-aware load balancing
> sched/fair: Add migrate_llc_task migration type for cache-aware
> balancing
> sched/fair: Handle moving single tasks to/from their preferred LLC
> sched/fair: Consider LLC preference when selecting tasks for load
> balancing
> sched/fair: Respect LLC preference in task migration and detach
>
> include/linux/cacheinfo.h | 21 +-
> include/linux/mm_types.h | 45 ++
> include/linux/sched.h | 5 +
> include/linux/sched/topology.h | 4 +
> include/linux/threads.h | 10 +
> init/Kconfig | 20 +
> init/init_task.c | 3 +
> kernel/fork.c | 6 +
> kernel/sched/core.c | 18 +
> kernel/sched/debug.c | 56 ++
> kernel/sched/fair.c | 1022 +++++++++++++++++++++++++++++++-
> kernel/sched/features.h | 1 +
> kernel/sched/sched.h | 27 +
> kernel/sched/topology.c | 61 +-
> 14 files changed, 1283 insertions(+), 16 deletions(-)
>
Powered by blists - more mailing lists