linux-kernel - Re: [PATCH 17/19] sched/fair: Disable cache aware scheduling for processes with high thread counts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c047e50b-13f4-4234-8590-0f82314bcb8f@intel.com>
Date: Thu, 23 Oct 2025 14:55:51 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, Tim Chen
	<tim.c.chen@...ux.intel.com>
CC: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, "K
 Prateek Nayak" <kprateek.nayak@....com>, "Gautham R . Shenoy"
	<gautham.shenoy@....com>, Vincent Guittot <vincent.guittot@...aro.org>, "Juri
 Lelli" <juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, "Mel
 Gorman" <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, "Hillf
 Danton" <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
	"Jianyong Wu" <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
	Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>, Len
 Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>, Zhao Liu
	<zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Adam Li
	<adamli@...amperecomputing.com>, Tim Chen <tim.c.chen@...el.com>,
	<linux-kernel@...r.kernel.org>, <haoxing990@...il.com>
Subject: Re: [PATCH 17/19] sched/fair: Disable cache aware scheduling for
 processes with high thread counts

On 10/23/2025 1:21 AM, Madadi Vineeth Reddy wrote:
> On 11/10/25 23:54, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@...el.com>
>>
>> If the number of active threads within the process
>> exceeds the number of Cores(divided by SMTs number)
>> in the LLC, do not enable cache-aware scheduling.
>> This is because there is a risk of cache contention
>> within the preferred LLC when too many threads are
>> present.
>>
>> Reported-by: K Prateek Nayak <kprateek.nayak@....com>
>> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
>> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
>> ---
>>   kernel/sched/fair.c | 27 +++++++++++++++++++++++++--
>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 79d109f8a09f..6b8eace79eee 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1240,6 +1240,18 @@ static inline int pref_llc_idx(struct task_struct *p)
>>   	return llc_idx(p->preferred_llc);
>>   }
>>   
>> +static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>> +{
>> +	int smt_nr = 1;
>> +
>> +#ifdef CONFIG_SCHED_SMT
>> +	if (sched_smt_active())
>> +		smt_nr = cpumask_weight(cpu_smt_mask(cpu));
>> +#endif
>> +
>> +	return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
> 
> In Power10 and Power11 that has SMT8 and LLC size of 4, this would disable
> cache aware scheduling even for one thread.
> 

Using smt_nr was mainly due to concerns about introducing regressions
on Power, as discussed in v3
https://lore.kernel.org/all/8f6c7c69-b6b3-4c82-8db3-96757f09245f@linux.ibm.com/
and
https://lore.kernel.org/all/ddb9d558-d114-41db-9d4b-296fc2ecdbb4@linux.ibm.com/

It seems that aggregating tasks on an LLC with many SMT
threads/smaller LLC size would pose a risk of cache
contention. Additionally, with patch [19/19], users can tune
/sys/kernel/debug/sched/llc_aggr_tolerance to adjust the threshold:

return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, 
cpu)));

> Also, llc_overload_pct already ensures the load on the  preferred LLC doesn't
> exceed certain capacity. Why is this exceed_llc_nr() check needed? Won't the
> existing overload_pct naturally prevent excessive task aggregation by blocking
> migrations when the destination LLC reaches ~50% utilization?
> 

Using exceed_llc_nr() was because some short-duration tasks could
  generate low utilization but still cause cache contention (for
some reason, the util_avg cannot track that properly), such as
schbench. Therefore, we inhibit task aggregation for a large number
of active threads.


thanks,
Chenyu