linux-kernel - Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d5de9bfc-a274-4af3-8051-17d386f52990@gmail.com>
Date: Mon, 22 Dec 2025 10:19:35 +0800
From: Vern Hao <haoxing990@...il.com>
To: K Prateek Nayak <kprateek.nayak@....com>, "Chen, Yu C"
 <yu.c.chen@...el.com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
 Hillf Danton <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
 Jianyong Wu <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
 Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>,
 Len Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>,
 Zhao Liu <zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>,
 Adam Li <adamli@...amperecomputing.com>, Aaron Lu <ziqianlu@...edance.com>,
 Tim Chen <tim.c.chen@...el.com>, linux-kernel@...r.kernel.org,
 Tim Chen <tim.c.chen@...ux.intel.com>, Peter Zijlstra
 <peterz@...radead.org>, Vincent Guittot <vincent.guittot@...aro.org>,
 "Gautham R . Shenoy" <gautham.shenoy@....com>, Ingo Molnar
 <mingo@...hat.com>, Vern Hao <haoxing990@...il.com>
Subject: Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for
 memory-heavy processes

On 2025/12/19 11:14, K Prateek Nayak wrote:
> Hello Vern,
>
> On 12/18/2025 3:12 PM, Vern Hao wrote:
>> On 2025/12/18 16:32, Chen, Yu C wrote:
>>> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>>> On 2025/12/4 07:07, Tim Chen wrote:
>>>>> From: Chen Yu <yu.c.chen@...el.com>
>>>>>
>>>>> Prateek and Tingyin reported that memory-intensive workloads (such as
>>>>> stream) can saturate memory bandwidth and caches on the preferred LLC
>>>>> when sched_cache aggregates too many threads.
>>>>>
>>>>> To mitigate this, estimate a process's memory footprint by comparing
>>>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>>>> exceeds the LLC size, skip cache-aware scheduling.
>>>> Restricting RSS prevents many applications from benefiting from this optimization. I believe this restriction should be lifted. For memory- intensive workloads, the optimization may simply yield no gains, but it certainly shouldn't make performance worse. We need to further refine this logic.
>>> Memory-intensive workloads may trigger performance regressions when
>>> memory bandwidth(from L3 cache to memory controller) is saturated due
>> RSS size and bandwidth saturation are not necessarily linked, In my view, the optimization should be robust enough that it doesn't cause a noticeable drop in performance, no matter how large the RSS is.
> Easier said than done. I agree RSS size is not a clear indication of
> bandwidth saturation. With NUMA Balancing enabled, we can use the
> hinting faults to estimate the working set and make decisions but for
> systems that do not have NUMA, short of programming some performance
> counters, there is no real way to estimate the working set.
I see the challenge, but the reality is that many production workloads 
have large memory footprints and deserve to see performance gains as 
well. In my testing with Chen Yu on STREAM, it's intriguing that the 
performance is fine without |llc_enable| but drops significantly once 
it's turned on.I sincerely hope this situation can be optimized; 
otherwise, we won't be able to utilize these optimizations in 
large-memory scenarios.
>
> Hinting faults are known to cause overheads so enabling them without
> NUMA can cause noticeable overheads with no real benefits.
>
>> We need to have a more profound discussion on this.
> What do you have in mind?
I am wondering if we could address this through alternative approaches, 
such as reducing the migration frequency or preventing excessive task 
stacking within a single LLC. Of course, defining the right metrics to 
evaluate these conditions remains a significant challenge.
>
>  From where I stand, having the RSS based bailout for now won't make
> things worse for these tasks with huge memory reserves and when we can
> all agree on some generic method to estimate the working set of a task,
> we can always add it into exceed_llc_capacity().
>