linux-kernel - Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aa8d2c46-811f-4470-ad30-b92d436abc3d@intel.com>
Date: Fri, 19 Dec 2025 21:21:32 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Vern Hao <haoxing990@...il.com>
CC: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
	<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
	<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
	<vschneid@...hat.com>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, "Hillf
 Danton" <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
	"Jianyong Wu" <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
	Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>, Len
 Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>, Zhao Liu
	<zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Adam Li
	<adamli@...amperecomputing.com>, Aaron Lu <ziqianlu@...edance.com>, Tim Chen
	<tim.c.chen@...el.com>, <linux-kernel@...r.kernel.org>, Tim Chen
	<tim.c.chen@...ux.intel.com>, Peter Zijlstra <peterz@...radead.org>, "K
 Prateek Nayak" <kprateek.nayak@....com>, Ingo Molnar <mingo@...hat.com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>, Vincent Guittot
	<vincent.guittot@...aro.org>
Subject: Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the
 parameters of cache-aware scheduling

On 12/19/2025 12:14 PM, Vern Hao wrote:
> 
> On 2025/12/4 07:07, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@...el.com>
>>
>> Introduce a set of debugfs knobs to control the enabling of
>> and parameters for cache-aware load balancing.
>>
>> (1) llc_enabled
>> llc_enabled acts as the primary switch - users can toggle it to
>> enable or disable cache aware load balancing.
>>
>> (2) llc_aggr_tolerance
>> With sched_cache enabled, the scheduler uses a process's RSS as a
>> proxy for its LLC footprint to determine if aggregating tasks on the
>> preferred LLC could cause cache contention. If RSS exceeds the LLC
>> size, aggregation is skipped. Some workloads with large RSS but small
>> actual memory footprints may still benefit from aggregation. Since
>> the kernel cannot efficiently track per-task cache usage (resctrl is
>> user-space only), userspace can provide a more accurate hint.
>>
>> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
>> users control how strictly RSS limits aggregation. Values range from
>> 0 to 100:
>>
>>    - 0: Cache-aware scheduling is disabled.
>>    - 1: Strict; tasks with RSS larger than LLC size are skipped.
>>    - 100: Aggressive; tasks are aggregated regardless of RSS.
>>
>> For example, with a 32MB L3 cache:
>>
>>    - llc_aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
>>    - llc_aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
>>      (784GB = (1 + (99 - 1) * 256) * 32MB).
>>
>> Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
>> how strictly the number of active threads is considered when doing
>> cache aware load balance. The number of SMTs is also considered.
>> High SMT counts reduce the aggregation capacity, preventing excessive
>> task aggregation on SMT-heavy systems like Power10/Power11.
>>
>> For example, with 8 Cores/16 CPUs in a L3:
>>
>>    - llc_aggr_tolerance=1 -> tasks with nr_running > 8 are skipped.
>>    - llc_aggr_tolerance=99 -> tasks with nr_running > 785 are skipped
>>      785 = (1 + (99 - 1) * 8).
>>
>> (3) llc_epoch_period/llc_epoch_affinity_timeout
>> Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned
>> into tunable.
>>
>> Suggested-by: K Prateek Nayak <kprateek.nayak@....com>
>> Suggested-by: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
>> Suggested-by: Shrikanth Hegde <sshegde@...ux.ibm.com>
>> Suggested-by: Tingyin Duan <tingyin.duan@...il.com>
>> Co-developed-by: Tim Chen <tim.c.chen@...ux.intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
>> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
>> ---
>>
>> Notes:
>>      v1->v2: Remove the smt_nr check in fits_llc_capacity().
>>              (Aaron Lu)
>>
>>   include/linux/sched.h   |  4 ++-
>>   kernel/sched/debug.c    | 62 ++++++++++++++++++++++++++++++++++++++++
>>   kernel/sched/fair.c     | 63 ++++++++++++++++++++++++++++++++++++-----
>>   kernel/sched/sched.h    |  5 ++++
>>   kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++--
>>   5 files changed, 178 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 466ba8b7398c..95bf080bbbf0 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
>>   DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>>   #ifdef CONFIG_SCHED_CACHE
>> +DECLARE_STATIC_KEY_FALSE(sched_cache_on);
>> +
>>   static inline bool sched_cache_enabled(void)
>>   {
>> -    return false;
>> +    return static_branch_unlikely(&sched_cache_on);
>>   }
>>   #endif
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 02e16b70a790..cde324672103 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -169,6 +169,53 @@ static const struct file_operations 
>> sched_feat_fops = {
>>       .release    = single_release,
>>   };
>> +#ifdef CONFIG_SCHED_CACHE
>> +#define SCHED_CACHE_CREATE_CONTROL(name, max)              \
>> +static ssize_t sched_cache_write_##name(struct file *filp,      \
>> +                    const char __user *ubuf,  \
>> +                    size_t cnt, loff_t *ppos) \
>> +{                                  \
>> +    char buf[16];                          \
>> +    unsigned int val;                      \
>> +    if (cnt > 15)                          \
>> +        cnt = 15;                      \
>> +    if (copy_from_user(&buf, ubuf, cnt))              \
>> +        return -EFAULT;                      \
>> +    buf[cnt] = '\0';                      \
>> +    if (kstrtouint(buf, 10, &val))                  \
>> +        return -EINVAL;                      \
>> +    if (val > (max))                          \
>> +        return -EINVAL;                      \
>> +    llc_##name = val;                      \
>> +    if (!strcmp(#name, "enabled"))                  \
>> +        sched_cache_set(false);                  \
>> +    *ppos += cnt;                          \
>> +    return cnt;                          \
>> +}                                  \
>> +static int sched_cache_show_##name(struct seq_file *m, void *v)      \
>> +{                                  \
>> +    seq_printf(m, "%d\n", llc_##name);              \
>> +    return 0;                          \
>> +}                                  \
>> +static int sched_cache_open_##name(struct inode *inode,          \
>> +                   struct file *filp)          \
>> +{                                  \
>> +    return single_open(filp, sched_cache_show_##name, NULL);  \
>> +}                                  \
>> +static const struct file_operations sched_cache_fops_##name = {      \
>> +    .open        = sched_cache_open_##name,          \
>> +    .write        = sched_cache_write_##name,          \
>> +    .read        = seq_read,                  \
>> +    .llseek        = seq_lseek,                  \
>> +    .release    = single_release,              \
>> +}
>> +
>> +SCHED_CACHE_CREATE_CONTROL(overload_pct, 100);
>> +SCHED_CACHE_CREATE_CONTROL(imb_pct, 100);
>> +SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100);
>> +SCHED_CACHE_CREATE_CONTROL(enabled, 1);
>> +#endif /* SCHED_CACHE */
>> +
>>   static ssize_t sched_scaling_write(struct file *filp, const char 
>> __user *ubuf,
>>                      size_t cnt, loff_t *ppos)
>>   {
>> @@ -523,6 +570,21 @@ static __init int sched_init_debug(void)
>>       debugfs_create_u32("hot_threshold_ms", 0644, numa, 
>> &sysctl_numa_balancing_hot_threshold);
>>   #endif /* CONFIG_NUMA_BALANCING */
>> +#ifdef CONFIG_SCHED_CACHE
>> +    debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL,
>> +                &sched_cache_fops_overload_pct);
>> +    debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL,
>> +                &sched_cache_fops_imb_pct);
>> +    debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL,
>> +                &sched_cache_fops_aggr_tolerance);
>> +    debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
>> +                &sched_cache_fops_enabled);
>> +    debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
>> +               &llc_epoch_period);
>> +    debugfs_create_u32("llc_epoch_affinity_timeout", 0644, 
>> debugfs_sched,
>> +               &llc_epoch_affinity_timeout);
>> +#endif
>> +
>>       debugfs_create_file("debug", 0444, debugfs_sched, NULL, 
>> &sched_debug_fops);
>>       debugfs_fair_server_init();
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 424ec601cfdf..a2e2d6742481 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct 
>> sched_entity *se)
>>   __read_mostly unsigned int llc_overload_pct       = 50;
>>   __read_mostly unsigned int llc_imb_pct            = 20;
>> +__read_mostly unsigned int llc_aggr_tolerance     = 1;
>> +__read_mostly unsigned int llc_epoch_period       = EPOCH_PERIOD;
>> +__read_mostly unsigned int llc_epoch_affinity_timeout = 
>> EPOCH_LLC_AFFINITY_TIMEOUT;
>>   static int llc_id(int cpu)
>>   {
>> @@ -1223,11 +1226,22 @@ static int llc_id(int cpu)
>>       return llc;
>>   }
>> +static inline int get_sched_cache_scale(int mul)
>> +{
>> +    if (!llc_aggr_tolerance)
>> +        return 0;
>> +
>> +    if (llc_aggr_tolerance == 100)
> the range of llc_aggr_tolerance is [0, 100], so a little bug here? maybe 
> check if (llc_aggr_tolerance >= 100)

I thought llc_aggr_tolerance should not exceed 100, in
sched_cache_write_aggr_tolerance(), if the input value is
higher than max, it will return invalid:
return -EINVAL;
I did a double check on this:

root@vm:/sys/kernel/debug/sched# echo 100 > llc_aggr_tolerance
root@vm:/sys/kernel/debug/sched# echo 101 > llc_aggr_tolerance
bash: echo: write error: Invalid argument

> 
> and if llc_aggr_tolerance = 0, the func return 0, it means 
> exceed_llc_capacity & exceed_llc_nr always true, there may be
> inconsistent to have this value set while |llc_enable=1| is set.
> 

If the llc_aggr_tolerance is 0, the cache aware scheduling is supposed
to be disabled - that is, exceed_llc_capacity() always returns true ->
the process is not eligible for cache aware scheduling.

thanks,
Chenyu