linux-kernel - Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6da4333d-2f64-4b4b-8b51-8d0ca937b946@gmail.com>
Date: Mon, 22 Dec 2025 10:49:11 +0800
From: Vern Hao <haoxing990@...il.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>,
 K Prateek Nayak <kprateek.nayak@....com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
 Hillf Danton <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
 Jianyong Wu <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
 Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>,
 Len Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>,
 Zhao Liu <zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>,
 Adam Li <adamli@...amperecomputing.com>, Aaron Lu <ziqianlu@...edance.com>,
 Tim Chen <tim.c.chen@...el.com>, linux-kernel@...r.kernel.org,
 Tim Chen <tim.c.chen@...ux.intel.com>, Peter Zijlstra
 <peterz@...radead.org>, Vincent Guittot <vincent.guittot@...aro.org>,
 "Gautham R . Shenoy" <gautham.shenoy@....com>, Ingo Molnar
 <mingo@...hat.com>, Vern Hao <haoxing990@...il.com>
Subject: Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for
 memory-heavy processes

On 2025/12/19 20:55, Chen, Yu C wrote:
> On 12/19/2025 11:14 AM, K Prateek Nayak wrote:
>> Hello Vern,
>>
>> On 12/18/2025 3:12 PM, Vern Hao wrote:
>>>
>>> On 2025/12/18 16:32, Chen, Yu C wrote:
>>>> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>>>>
>>>>> On 2025/12/4 07:07, Tim Chen wrote:
>>>>>> From: Chen Yu <yu.c.chen@...el.com>
>>>>>>
>>>>>> Prateek and Tingyin reported that memory-intensive workloads 
>>>>>> (such as
>>>>>> stream) can saturate memory bandwidth and caches on the preferred 
>>>>>> LLC
>>>>>> when sched_cache aggregates too many threads.
>>>>>>
>>>>>> To mitigate this, estimate a process's memory footprint by comparing
>>>>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>>>>> exceeds the LLC size, skip cache-aware scheduling.
>>>>> Restricting RSS prevents many applications from benefiting from 
>>>>> this optimization. I believe this restriction should be lifted. 
>>>>> For memory- intensive workloads, the optimization may simply yield 
>>>>> no gains, but it certainly shouldn't make performance worse. We 
>>>>> need to further refine this logic.
>>>>
>>>> Memory-intensive workloads may trigger performance regressions when
>>>> memory bandwidth(from L3 cache to memory controller) is saturated due
>>> RSS size and bandwidth saturation are not necessarily linked, In my 
>>> view, the optimization should be robust enough that it doesn't cause 
>>> a noticeable drop in performance, no matter how large the RSS is.
>>
>> Easier said than done. I agree RSS size is not a clear indication of
>> bandwidth saturation. With NUMA Balancing enabled, we can use the
>> hinting faults to estimate the working set and make decisions but for
>> systems that do not have NUMA, short of programming some performance
>> counters, there is no real way to estimate the working set.
>>
>> Hinting faults are known to cause overheads so enabling them without
>> NUMA can cause noticeable overheads with no real benefits.
>>
>>> We need to have a more profound discussion on this.
>>
>> What do you have in mind?
>>
>>  From where I stand, having the RSS based bailout for now won't make
>> things worse for these tasks with huge memory reserves and when we can
>> all agree on some generic method to estimate the working set of a task,
>> we can always add it into exceed_llc_capacity().
>>
>
> Prateek, thanks very much for the practical callouts - using RSS seems 
> to be
> the best trade-off we can go with for now. Vern, I get your point 
> about the
> concern between RSS and actual memory footprint. However, detecting 
> the working
> set doesn’t seem to be accurate or generic in kernel space - even with
> NUMA fault statistics sampling. One reliable way I can think of to
>  detect the working set is in user space, via resctrl (Intel RDT, AMD 
> QoS,
> Arm MPAM). So maybe we can leverage that information to implement 
> fine-grained
> control on a per-process or per-task basis later.
OK, I agree, thanks.
>
> thanks,
> Chenyu