[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aa8d2c46-811f-4470-ad30-b92d436abc3d@intel.com>
Date: Fri, 19 Dec 2025 21:21:32 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Vern Hao <haoxing990@...il.com>
CC: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
<vschneid@...hat.com>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, "Hillf
Danton" <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
"Jianyong Wu" <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>, Len
Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>, Zhao Liu
<zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Adam Li
<adamli@...amperecomputing.com>, Aaron Lu <ziqianlu@...edance.com>, Tim Chen
<tim.c.chen@...el.com>, <linux-kernel@...r.kernel.org>, Tim Chen
<tim.c.chen@...ux.intel.com>, Peter Zijlstra <peterz@...radead.org>, "K
Prateek Nayak" <kprateek.nayak@....com>, Ingo Molnar <mingo@...hat.com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>, Vincent Guittot
<vincent.guittot@...aro.org>
Subject: Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the
parameters of cache-aware scheduling
On 12/19/2025 12:14 PM, Vern Hao wrote:
>
> On 2025/12/4 07:07, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@...el.com>
>>
>> Introduce a set of debugfs knobs to control the enabling of
>> and parameters for cache-aware load balancing.
>>
>> (1) llc_enabled
>> llc_enabled acts as the primary switch - users can toggle it to
>> enable or disable cache aware load balancing.
>>
>> (2) llc_aggr_tolerance
>> With sched_cache enabled, the scheduler uses a process's RSS as a
>> proxy for its LLC footprint to determine if aggregating tasks on the
>> preferred LLC could cause cache contention. If RSS exceeds the LLC
>> size, aggregation is skipped. Some workloads with large RSS but small
>> actual memory footprints may still benefit from aggregation. Since
>> the kernel cannot efficiently track per-task cache usage (resctrl is
>> user-space only), userspace can provide a more accurate hint.
>>
>> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
>> users control how strictly RSS limits aggregation. Values range from
>> 0 to 100:
>>
>> - 0: Cache-aware scheduling is disabled.
>> - 1: Strict; tasks with RSS larger than LLC size are skipped.
>> - 100: Aggressive; tasks are aggregated regardless of RSS.
>>
>> For example, with a 32MB L3 cache:
>>
>> - llc_aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
>> - llc_aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
>> (784GB = (1 + (99 - 1) * 256) * 32MB).
>>
>> Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
>> how strictly the number of active threads is considered when doing
>> cache aware load balance. The number of SMTs is also considered.
>> High SMT counts reduce the aggregation capacity, preventing excessive
>> task aggregation on SMT-heavy systems like Power10/Power11.
>>
>> For example, with 8 Cores/16 CPUs in a L3:
>>
>> - llc_aggr_tolerance=1 -> tasks with nr_running > 8 are skipped.
>> - llc_aggr_tolerance=99 -> tasks with nr_running > 785 are skipped
>> 785 = (1 + (99 - 1) * 8).
>>
>> (3) llc_epoch_period/llc_epoch_affinity_timeout
>> Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned
>> into tunable.
>>
>> Suggested-by: K Prateek Nayak <kprateek.nayak@....com>
>> Suggested-by: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
>> Suggested-by: Shrikanth Hegde <sshegde@...ux.ibm.com>
>> Suggested-by: Tingyin Duan <tingyin.duan@...il.com>
>> Co-developed-by: Tim Chen <tim.c.chen@...ux.intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
>> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
>> ---
>>
>> Notes:
>> v1->v2: Remove the smt_nr check in fits_llc_capacity().
>> (Aaron Lu)
>>
>> include/linux/sched.h | 4 ++-
>> kernel/sched/debug.c | 62 ++++++++++++++++++++++++++++++++++++++++
>> kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++++++-----
>> kernel/sched/sched.h | 5 ++++
>> kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++--
>> 5 files changed, 178 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 466ba8b7398c..95bf080bbbf0 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
>> DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>> #ifdef CONFIG_SCHED_CACHE
>> +DECLARE_STATIC_KEY_FALSE(sched_cache_on);
>> +
>> static inline bool sched_cache_enabled(void)
>> {
>> - return false;
>> + return static_branch_unlikely(&sched_cache_on);
>> }
>> #endif
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 02e16b70a790..cde324672103 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -169,6 +169,53 @@ static const struct file_operations
>> sched_feat_fops = {
>> .release = single_release,
>> };
>> +#ifdef CONFIG_SCHED_CACHE
>> +#define SCHED_CACHE_CREATE_CONTROL(name, max) \
>> +static ssize_t sched_cache_write_##name(struct file *filp, \
>> + const char __user *ubuf, \
>> + size_t cnt, loff_t *ppos) \
>> +{ \
>> + char buf[16]; \
>> + unsigned int val; \
>> + if (cnt > 15) \
>> + cnt = 15; \
>> + if (copy_from_user(&buf, ubuf, cnt)) \
>> + return -EFAULT; \
>> + buf[cnt] = '\0'; \
>> + if (kstrtouint(buf, 10, &val)) \
>> + return -EINVAL; \
>> + if (val > (max)) \
>> + return -EINVAL; \
>> + llc_##name = val; \
>> + if (!strcmp(#name, "enabled")) \
>> + sched_cache_set(false); \
>> + *ppos += cnt; \
>> + return cnt; \
>> +} \
>> +static int sched_cache_show_##name(struct seq_file *m, void *v) \
>> +{ \
>> + seq_printf(m, "%d\n", llc_##name); \
>> + return 0; \
>> +} \
>> +static int sched_cache_open_##name(struct inode *inode, \
>> + struct file *filp) \
>> +{ \
>> + return single_open(filp, sched_cache_show_##name, NULL); \
>> +} \
>> +static const struct file_operations sched_cache_fops_##name = { \
>> + .open = sched_cache_open_##name, \
>> + .write = sched_cache_write_##name, \
>> + .read = seq_read, \
>> + .llseek = seq_lseek, \
>> + .release = single_release, \
>> +}
>> +
>> +SCHED_CACHE_CREATE_CONTROL(overload_pct, 100);
>> +SCHED_CACHE_CREATE_CONTROL(imb_pct, 100);
>> +SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100);
>> +SCHED_CACHE_CREATE_CONTROL(enabled, 1);
>> +#endif /* SCHED_CACHE */
>> +
>> static ssize_t sched_scaling_write(struct file *filp, const char
>> __user *ubuf,
>> size_t cnt, loff_t *ppos)
>> {
>> @@ -523,6 +570,21 @@ static __init int sched_init_debug(void)
>> debugfs_create_u32("hot_threshold_ms", 0644, numa,
>> &sysctl_numa_balancing_hot_threshold);
>> #endif /* CONFIG_NUMA_BALANCING */
>> +#ifdef CONFIG_SCHED_CACHE
>> + debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL,
>> + &sched_cache_fops_overload_pct);
>> + debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL,
>> + &sched_cache_fops_imb_pct);
>> + debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL,
>> + &sched_cache_fops_aggr_tolerance);
>> + debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
>> + &sched_cache_fops_enabled);
>> + debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
>> + &llc_epoch_period);
>> + debugfs_create_u32("llc_epoch_affinity_timeout", 0644,
>> debugfs_sched,
>> + &llc_epoch_affinity_timeout);
>> +#endif
>> +
>> debugfs_create_file("debug", 0444, debugfs_sched, NULL,
>> &sched_debug_fops);
>> debugfs_fair_server_init();
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 424ec601cfdf..a2e2d6742481 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct
>> sched_entity *se)
>> __read_mostly unsigned int llc_overload_pct = 50;
>> __read_mostly unsigned int llc_imb_pct = 20;
>> +__read_mostly unsigned int llc_aggr_tolerance = 1;
>> +__read_mostly unsigned int llc_epoch_period = EPOCH_PERIOD;
>> +__read_mostly unsigned int llc_epoch_affinity_timeout =
>> EPOCH_LLC_AFFINITY_TIMEOUT;
>> static int llc_id(int cpu)
>> {
>> @@ -1223,11 +1226,22 @@ static int llc_id(int cpu)
>> return llc;
>> }
>> +static inline int get_sched_cache_scale(int mul)
>> +{
>> + if (!llc_aggr_tolerance)
>> + return 0;
>> +
>> + if (llc_aggr_tolerance == 100)
> the range of llc_aggr_tolerance is [0, 100], so a little bug here? maybe
> check if (llc_aggr_tolerance >= 100)
I thought llc_aggr_tolerance should not exceed 100, in
sched_cache_write_aggr_tolerance(), if the input value is
higher than max, it will return invalid:
return -EINVAL;
I did a double check on this:
root@vm:/sys/kernel/debug/sched# echo 100 > llc_aggr_tolerance
root@vm:/sys/kernel/debug/sched# echo 101 > llc_aggr_tolerance
bash: echo: write error: Invalid argument
>
> and if llc_aggr_tolerance = 0, the func return 0, it means
> exceed_llc_capacity & exceed_llc_nr always true, there may be
> inconsistent to have this value set while |llc_enable=1| is set.
>
If the llc_aggr_tolerance is 0, the cache aware scheduling is supposed
to be disabled - that is, exceed_llc_capacity() always returns true ->
the process is not eligible for cache aware scheduling.
thanks,
Chenyu
Powered by blists - more mailing lists