[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b82842ff20995cd50b422dad844664089dcd0c7.camel@linux.intel.com>
Date: Tue, 14 Oct 2025 14:48:09 -0700
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, K
Prateek Nayak <kprateek.nayak@....com>, "Gautham R . Shenoy"
<gautham.shenoy@....com>, Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben
Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin
Schneider <vschneid@...hat.com>, Hillf Danton <hdanton@...a.com>,
Shrikanth Hegde <sshegde@...ux.ibm.com>, Jianyong Wu
<jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>, Tingyin Duan
<tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>, Len Brown
<len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>, Zhao Liu
<zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Chen Yu
<yu.c.chen@...el.com>, Libo Chen <libo.chen@...cle.com>, Adam Li
<adamli@...amperecomputing.com>, Tim Chen <tim.c.chen@...el.com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 00/19] Cache Aware Scheduling
On Tue, 2025-10-14 at 17:43 +0530, Madadi Vineeth Reddy wrote:
> Hi Tim,
> Thanks for the patch.
>
> On 11/10/25 23:54, Tim Chen wrote:
> > There had been 4 RFC postings of this patch set. We've incorporated
> > the feedbacks and comments and now would like to post this patch set
> > for consideration of inclusion to mainline. The patches are based on
> > the original patch proposed by Peter[1].
> >
>
> [snip]
>
> > The following tunables control under /sys/kernel/debug/sched/ control
> > the behavior of cache aware scheduling:
> >
> > 1. llc_aggr_tolerance Controls how aggressive we aggregate tasks to
> > their preferred LLC, based on a process's RSS size and number of running
> > threads. Processes that have smaller memory footprint and fewer number
> > of tasks will benefit better from aggregation. Varies between 0 to 100
> > 0: Cache aware scheduling is disabled 1: Process with RSS
> > greater than LLC size,
> > or running threads more than number of cpu cores/LLC skip
> > aggregation
> > 100: Aggressive; a process's threads are aggregated regardless of
> > RSS or running threads.
> > For example, with a 32MB L3 cache 8 cores in L3:
> > llc_aggr_tolerance=1 -> process with RSS > 32MB, or nr_running_avg >
> > 8 are skipped. llc_aggr_tolerance=99 -> process with RSS > 784GB
> > or nr_running_avg > 785 are skipped. 784GB = (1 + (99 - 1) * 256)
> > * 32MB.
> > 785 = (1 + (99 - 1) * 8).
> >
> > Currently this knob is a global control. Considering that different workloads have
> > different requirements for task consolidation, it would be ideal to introduce
> > per process control for this knob via prctl in the future.
> >
> > 2. llc_overload_pct, llc_imb_pct
> > We'll always try to move a task to its preferred LLC if an LLC's average core
> > utilization is below llc_overload_pct (default to 50%). Otherwise, the utilization
> > of preferred LLC has to be not more than llc_imb_pct (default to 20%) to move a task
> > to it. This is to prevent overloading on the preferred LLC.
> >
> > 3. llc_epoch_period
> > Controls how often the scheduler collect LLC occupancy of a process (default to 10 msec)
> >
> > 4. llc_epoch_affinity_timeout
> > Detect that if a process has not run for llc_epoch_affinity_timeout (default to 50 msec),
> > it loses its cache preference.
>
> How are these default values arrived at? Is it based on some theory or
> based on the results of the runs?
Right now the default value of llc_aggr_tolerance is fairly conservative.
We make sure that we don't cause regressions to workloads we tested.
Knobs like llc_overload_pct, llc_imb_pct are actually chosen from
Len's Yogini micro-benchmark experiments we did that gave good aggregation
without overloading LLC.
llc_epoch_period and llc_epoch_affinity_timeout are from Peter's
original patch that seem to work fairly well so we left it as is.
>
> >
> > Test results:
> > The first test platform is a 2 socket Intel Sapphire Rapids with 30
> > cores per socket. The DRAM interleaving is enabled in the BIOS so it
> > essential has one NUMA node with two last level caches. There are 60
> > CPUs associated with each last level cache.
> >
> > The second test platform is a AMD Milan. There are 2 Nodes and 64 CPUs
> > per node. Each node has 8 CCXs and each CCX has 8 CPUs.
> >
> > The third test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs per node.
> > Each node has 2 CCXs and each CCX has 16 CPUs.
> >
> > [TL;DR]
> > Sappire Rapids:
> > hackbench shows significant improvement when there is 1 group
> > with different number of fd pairs(threads) within this process.
> > schbench shows overall wakeup latency improvement.
> > ChaCha20-xiangshan shows ~10% throughput improvement. Other
> > micro-workloads did not show much difference.
> >
> > Milan:
> > No obvious difference is observed so far.
> >
> > Genoa:
> > ChaCha20-xiangshan shows 44% throughput improvement.
> >
> > [Sapphire Rapids details]
> >
> > [hackbench]
> > Hackbench show overall improvement when there is only 1
> > group, with different number of fd(pairs). This is the
> > expected behavior because this test scenario would benefit
> > from cache aware load balance most. Other number of groups
> > shows not much difference(using default fd = 20).
> >
> > groups baseline sched_cache
> > Min 1 37.5960 ( 0.00%) 26.4340 ( 29.69%)
> > Min 3 38.7050 ( 0.00%) 38.6920 ( 0.03%)
> > Min 5 39.4550 ( 0.00%) 38.6280 ( 2.10%)
> > Min 7 51.4270 ( 0.00%) 50.6790 ( 1.45%)
> > Min 12 62.8540 ( 0.00%) 63.6590 ( -1.28%)
> > Min 16 74.0160 ( 0.00%) 74.7480 ( -0.99%)
> > Amean 1 38.4768 ( 0.00%) 26.7146 * 30.57%*
> > Amean 3 39.0750 ( 0.00%) 39.5586 ( -1.24%)
> > Amean 5 41.5178 ( 0.00%) 41.2766 ( 0.58%)
> > Amean 7 52.1164 ( 0.00%) 51.5152 ( 1.15%)
> > Amean 12 63.9052 ( 0.00%) 64.0420 ( -0.21%)
> > Amean 16 74.5812 ( 0.00%) 75.4318 ( -1.14%)
> > BAmean-99 1 38.2027 ( 0.00%) 26.5500 ( 30.50%)
> > BAmean-99 3 38.8725 ( 0.00%) 39.2225 ( -0.90%)
> > BAmean-99 5 41.1898 ( 0.00%) 41.0037 ( 0.45%)
> > BAmean-99 7 51.8645 ( 0.00%) 51.4453 ( 0.81%)
> > BAmean-99 12 63.6317 ( 0.00%) 63.9307 ( -0.47%)
> > BAmean-99 16 74.4528 ( 0.00%) 75.2113 ( -1.02%)
> >
> > [schbench]
> > Wakeup Latencies 99.0th improvement is observed.
> >
> > threads baseline sched_cache change
> > 1 13.80(1.10) 14.80(2.86) -7.25%
> > 2 12.00(1.00) 8.00(2.12) +33.33%
> > 4 9.00(0.00) 5.60(0.89) +37.78%
> > 8 9.00(0.00) 6.40(1.14) +28.89%
> > 16 9.20(0.45) 6.20(0.84) +32.61%
> > 32 9.60(0.55) 7.00(0.71) +27.08%
> > 64 10.80(0.45) 8.40(0.55) +22.22%
> > 128 12.60(0.55) 11.40(0.55) +9.52%
> > 239 14.00(0.00) 14.20(0.45) -1.43%
> >
> > [stream]
> > No much difference is observed.
> > baseline sc
> > GB/sec copy-2 35.00 ( 0.00%) 34.79 ( -0.60%)
> > GB/sec scale-2 24.04 ( 0.00%) 23.90 ( -0.58%)
> > GB/sec add-2 28.98 ( 0.00%) 28.92 ( -0.22%)
> > GB/sec triad-2 28.32 ( 0.00%) 28.31 ( -0.04%)
> >
> > [netperf]
> > No much difference is observed(consider the stdev).
> >
> > nr_pairs netperf netperf
> >
> > Hmean 60 1023.44 ( 0.00%) 1021.87 ( -0.15%)
> > BHmean-99 60 1023.78 ( 0.00%) 1022.22 ( -0.15%)
> > Hmean 120 792.09 ( 0.00%) 793.75 ( 0.21%)
> > BHmean-99 120 792.36 ( 0.00%) 794.04 ( 0.21%)
> > Hmean 180 513.42 ( 0.00%) 513.53 ( 0.02%)
> > BHmean-99 180 513.81 ( 0.00%) 513.80 ( -0.00%)
> > Hmean 240 387.09 ( 0.00%) 387.33 ( 0.06%)
> > BHmean-99 240 387.18 ( 0.00%) 387.45 ( 0.07%)
> > Hmean 300 316.04 ( 0.00%) 315.68 ( -0.12%)
> > BHmean-99 300 316.12 ( 0.00%) 315.77 ( -0.11%)
> > Hmean 360 496.38 ( 0.00%) 455.49 ( -8.24%)
> > BHmean-99 360 499.88 ( 0.00%) 458.17 ( -8.34%)
> > Hmean 420 497.32 ( 0.00%) 501.84 ( 0.91%)
> > BHmean-99 420 499.90 ( 0.00%) 504.56 ( 0.93%)
> > Hmean 480 417.62 ( 0.00%) 432.25 ( 3.50%)
> > BHmean-99 480 419.96 ( 0.00%) 434.43 ( 3.45%)
> >
> > In above case of 360 pairs, although there is a performance
> > drop of 8.24%, the corresponding:
> > HCoeffVar 360 23.78 ( 0.00%) 29.52 ( -24.15%)
> > shows that the regression is within the run-to-run variance.
> >
> > [Milan details]
> >
> > default settings:
> > [hackbench]
> >
> > Min 1 50.8170 ( 0.00%) 51.1890 ( -0.73%)
> > Min 3 59.3610 ( 0.00%) 58.6080 ( 1.27%)
> > Min 5 94.9760 ( 0.00%) 96.0210 ( -1.10%)
> > Min 7 123.3270 ( 0.00%) 124.1680 ( -0.68%)
> > Min 12 179.2000 ( 0.00%) 181.8390 ( -1.47%)
> > Min 16 238.8680 ( 0.00%) 242.6390 ( -1.58%)
> > Amean 1 51.6614 ( 0.00%) 51.3630 ( 0.58%)
> > Amean 3 60.1886 ( 0.00%) 59.4542 ( 1.22%)
> > Amean 5 95.7602 ( 0.00%) 96.8338 ( -1.12%)
> > Amean 7 124.0332 ( 0.00%) 124.4406 ( -0.33%)
> > Amean 12 181.0324 ( 0.00%) 182.9220 ( -1.04%)
> > Amean 16 239.5556 ( 0.00%) 243.3556 * -1.59%*
> > BAmean-99 1 51.5335 ( 0.00%) 51.3338 ( 0.39%)
> > BAmean-99 3 59.7848 ( 0.00%) 59.0958 ( 1.15%)
> > BAmean-99 5 95.6698 ( 0.00%) 96.5450 ( -0.91%)
> > BAmean-99 7 123.8478 ( 0.00%) 124.3760 ( -0.43%)
> > BAmean-99 12 180.8035 ( 0.00%) 182.5135 ( -0.95%)
> > BAmean-99 16 239.1933 ( 0.00%) 243.0570 ( -1.62%)
> >
> > [schbench]
> >
> > threads baseline sched_cache change
> > 1 12.00(2.00) 11.00(0.71) +8.33%
> > 2 12.40(0.89) 13.80(0.84) -11.29%
> > 4 14.20(0.45) 14.80(0.45) -4.23%
> > 8 16.00(0.00) 15.80(0.45) +1.25%
> > 16 16.00(0.00) 16.00(0.71) 0.00%
> > 32 19.40(0.55) 18.60(0.55) +4.12%
> > 63 22.20(0.45) 23.20(0.45) -4.50%
> >
> > [stream]
> > No obvious difference is found.
> > export STREAM_SIZE=$((128000000))
> >
> > baseline sched_cache
> > GB/sec copy-16 726.48 ( 0.00%) 715.60 ( -1.50%)
> > GB/sec scale-16 577.71 ( 0.00%) 577.03 ( -0.12%)
> > GB/sec add-16 678.85 ( 0.00%) 672.87 ( -0.88%)
> > GB/sec triad-16 735.52 ( 0.00%) 729.05 ( -0.88%)
> >
> >
> > [netperf]
> > No much difference is observed.
> >
> > nr_pairs baseline sched_cache
> > Hmean 32 755.98 ( 0.00%) 755.17 ( -0.11%)
> > BHmean-99 32 756.42 ( 0.00%) 755.40 ( -0.13%)
> > Hmean 64 677.38 ( 0.00%) 669.75 ( -1.13%)
> > BHmean-99 64 677.50 ( 0.00%) 669.86 ( -1.13%)
> > Hmean 96 498.52 ( 0.00%) 496.73 ( -0.36%)
> > BHmean-99 96 498.69 ( 0.00%) 496.93 ( -0.35%)
> > Hmean 128 604.38 ( 0.00%) 604.22 ( -0.03%)
> > BHmean-99 128 604.87 ( 0.00%) 604.87 ( 0.00%)
> > Hmean 160 471.67 ( 0.00%) 468.29 ( -0.72%)
> > BHmean-99 160 474.34 ( 0.00%) 471.05 ( -0.69%)
> > Hmean 192 381.18 ( 0.00%) 384.88 ( 0.97%)
> > BHmean-99 192 383.30 ( 0.00%) 386.82 ( 0.92%)
> > Hmean 224 327.79 ( 0.00%) 326.05 ( -0.53%)
> > BHmean-99 224 329.85 ( 0.00%) 327.87 ( -0.60%)
> > Hmean 256 284.61 ( 0.00%) 300.52 ( 5.59%)
> > BHmean-99 256 286.41 ( 0.00%) 302.06 ( 5.47%)
> >
> > [Genoa details]
> > [ChaCha20-xiangshan]
> > ChaCha20-xiangshan is a simple benchmark using a static build of an
> > 8-thread Verilator of XiangShan(RISC-V). The README file can be
> > found here[2]. The score depends on how aggressive the user set the
> > /sys/kernel/debug/sched/llc_aggr_tolerance. Using the default values,
> > there is no much difference observed. While setting the
> > /sys/kernel/debug/sched/llc_aggr_tolerance to 100, 44% improvment is
> > observed.
> >
> > baseline:
> > Host time spent: 50,868ms
> >
> > sched_cache:
> > Host time spent: 28,349ms
> >
> > The time has been reduced by 44%.
>
> Milan showed no improvement across all benchmarks, which could be due to the
> CCX topology (8 CCXs × 8 CPUs) where the LLC domain is too small for this
> optimization to be effective. Moreover there could be overhead due to additional
> computations.
>
> ChaCha20-xiangshan improvement in Genoa when llc_aggr_tolerance is set to 100 seems
> due to having relatively lesser thread count. Please provide the numbers
> with default values too. Would like to know numbers on varying loads.
I'll ask Chen Yu who did the Xiangshan experiments if he has those numbers.
>
> In Power 10 and Power 11, the LLC size is 4 threads which is even smaller. Not
> expecting improvements here but will run some workloads and share the data.
>
> Not gone through the entire series yet but are the situations like say in two
> NUMA system, if a task's preferred LLC is on the wrong NUMA node for its memory,
> which takes precedence?
We take preferred NUMA node in the consideration but we do not force task to
go to the preferred node.
I remembered initially we limited the consideration to only LLCs in the
preferred node. But we encountered regressions in hackbench and schbench,
because when the preferred node don't have any occupancy resulting in preferred LLC
to be set to -1 (no preference), and resulted in extra task migrations.
And also the preferred node for hackbench and schbench was volatile
as they have small memory footprint. Chen Yu, please chime in if there
were other reasons you remembered.
We'll need to revisit this part of the code to take care of such
corner case. I think ideally we should move tasks to the least loaded LLC
in the preferred node (even if no LLCs have occupancy in the preferred node),
as long as preferred NUMA node don't changes too often.
>
> Also, what about the workloads that don't share data like stress-ng?
>
We can test those. Ideally the controls to prevent over aggregation to preferred LLC
would keep stress-ng happy.
> It will
> be good to make sure that most other workloads don't suffer. As mentioned,
> per process knob for llc_aggr_tolerance could help.
Agree. We are planning to add per process knob for the next version. One thought is to use
prctl. Any other suggestions are welcome.
Tim
>
> Thanks,
> Madadi Vineeth Reddy
>
> >
> > Thanks to everyone who participated and provided valuable suggestions for
> > the previous versions. Your comments and tests on the latest version are
> > also greatly appreciated in advance.
> >
> > Tim
> >
> > [1] https://lore.kernel.org/lkml/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> >
> > [2] https://github.com/yu-chen-surf/chacha20-xiangshan/blob/master/README.eng.md
> >
> > RFC v4:
> > [3] https://lore.kernel.org/all/cover.1754712565.git.tim.c.chen@linux.intel.com/
> >
> > RFC v3
> > [4] https://lore.kernel.org/all/cover.1750268218.git.tim.c.chen@linux.intel.com/
> >
> > RFC v2:
> > [5] https://lore.kernel.org/lkml/cover.1745199017.git.yu.c.chen@intel.com/
> >
> >
> > Chen Yu (7):
> > sched/fair: Record per-LLC utilization to guide cache-aware scheduling
> > decisions
> > sched/fair: Introduce helper functions to enforce LLC migration policy
> > sched/fair: Introduce a static key to enable cache aware only for
> > multi LLCs
> > sched/fair: Exclude processes with many threads from cache-aware
> > scheduling
> > sched/fair: Disable cache aware scheduling for processes with high
> > thread counts
> > sched/fair: Avoid cache-aware scheduling for memory-heavy processes
> > sched/fair: Add user control to adjust the tolerance of cache-aware
> > scheduling
> >
> > Peter Zijlstra (Intel) (1):
> > sched/fair: Add infrastructure for cache-aware load balancing
> >
> > Tim Chen (11):
> > sched/fair: Add LLC index mapping for CPUs
> > sched/fair: Assign preferred LLC ID to processes
> > sched/fair: Track LLC-preferred tasks per runqueue
> > sched/fair: Introduce per runqueue task LLC preference counter
> > sched/fair: Count tasks prefering each LLC in a sched group
> > sched/fair: Prioritize tasks preferring destination LLC during
> > balancing
> > sched/fair: Identify busiest sched_group for LLC-aware load balancing
> > sched/fair: Add migrate_llc_task migration type for cache-aware
> > balancing
> > sched/fair: Handle moving single tasks to/from their preferred LLC
> > sched/fair: Consider LLC preference when selecting tasks for load
> > balancing
> > sched/fair: Respect LLC preference in task migration and detach
> >
> > include/linux/cacheinfo.h | 21 +-
> > include/linux/mm_types.h | 45 ++
> > include/linux/sched.h | 5 +
> > include/linux/sched/topology.h | 4 +
> > include/linux/threads.h | 10 +
> > init/Kconfig | 20 +
> > init/init_task.c | 3 +
> > kernel/fork.c | 6 +
> > kernel/sched/core.c | 18 +
> > kernel/sched/debug.c | 56 ++
> > kernel/sched/fair.c | 1022 +++++++++++++++++++++++++++++++-
> > kernel/sched/features.h | 1 +
> > kernel/sched/sched.h | 27 +
> > kernel/sched/topology.c | 61 +-
> > 14 files changed, 1283 insertions(+), 16 deletions(-)
> >
>
Powered by blists - more mailing lists