linux-kernel - Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <94c8c1af-d9a5-411b-bc54-a7b28d6cff29@intel.com>
Date: Fri, 19 Dec 2025 20:55:46 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: K Prateek Nayak <kprateek.nayak@....com>, Vern Hao <haoxing990@...il.com>
CC: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
	<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
	<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
	<vschneid@...hat.com>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, "Hillf
 Danton" <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
	"Jianyong Wu" <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
	Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>, Len
 Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>, Zhao Liu
	<zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Adam Li
	<adamli@...amperecomputing.com>, Aaron Lu <ziqianlu@...edance.com>, Tim Chen
	<tim.c.chen@...el.com>, <linux-kernel@...r.kernel.org>, Tim Chen
	<tim.c.chen@...ux.intel.com>, Peter Zijlstra <peterz@...radead.org>, "Vincent
 Guittot" <vincent.guittot@...aro.org>, "Gautham R . Shenoy"
	<gautham.shenoy@....com>, Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for
 memory-heavy processes

On 12/19/2025 11:14 AM, K Prateek Nayak wrote:
> Hello Vern,
> 
> On 12/18/2025 3:12 PM, Vern Hao wrote:
>>
>> On 2025/12/18 16:32, Chen, Yu C wrote:
>>> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>>>
>>>> On 2025/12/4 07:07, Tim Chen wrote:
>>>>> From: Chen Yu <yu.c.chen@...el.com>
>>>>>
>>>>> Prateek and Tingyin reported that memory-intensive workloads (such as
>>>>> stream) can saturate memory bandwidth and caches on the preferred LLC
>>>>> when sched_cache aggregates too many threads.
>>>>>
>>>>> To mitigate this, estimate a process's memory footprint by comparing
>>>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>>>> exceeds the LLC size, skip cache-aware scheduling.
>>>> Restricting RSS prevents many applications from benefiting from this optimization. I believe this restriction should be lifted. For memory- intensive workloads, the optimization may simply yield no gains, but it certainly shouldn't make performance worse. We need to further refine this logic.
>>>
>>> Memory-intensive workloads may trigger performance regressions when
>>> memory bandwidth(from L3 cache to memory controller) is saturated due
>> RSS size and bandwidth saturation are not necessarily linked, In my view, the optimization should be robust enough that it doesn't cause a noticeable drop in performance, no matter how large the RSS is.
> 
> Easier said than done. I agree RSS size is not a clear indication of
> bandwidth saturation. With NUMA Balancing enabled, we can use the
> hinting faults to estimate the working set and make decisions but for
> systems that do not have NUMA, short of programming some performance
> counters, there is no real way to estimate the working set.
> 
> Hinting faults are known to cause overheads so enabling them without
> NUMA can cause noticeable overheads with no real benefits.
> 
>> We need to have a more profound discussion on this.
> 
> What do you have in mind?
> 
>  From where I stand, having the RSS based bailout for now won't make
> things worse for these tasks with huge memory reserves and when we can
> all agree on some generic method to estimate the working set of a task,
> we can always add it into exceed_llc_capacity().
>

Prateek, thanks very much for the practical callouts - using RSS seems to be
the best trade-off we can go with for now. Vern, I get your point about the
concern between RSS and actual memory footprint. However, detecting the 
working
set doesn’t seem to be accurate or generic in kernel space - even with
NUMA fault statistics sampling. One reliable way I can think of to
  detect the working set is in user space, via resctrl (Intel RDT, AMD QoS,
Arm MPAM). So maybe we can leverage that information to implement 
fine-grained
control on a per-process or per-task basis later.

thanks,
Chenyu