[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <94c8c1af-d9a5-411b-bc54-a7b28d6cff29@intel.com>
Date: Fri, 19 Dec 2025 20:55:46 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: K Prateek Nayak <kprateek.nayak@....com>, Vern Hao <haoxing990@...il.com>
CC: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
<vschneid@...hat.com>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, "Hillf
Danton" <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
"Jianyong Wu" <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>, Len
Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>, Zhao Liu
<zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Adam Li
<adamli@...amperecomputing.com>, Aaron Lu <ziqianlu@...edance.com>, Tim Chen
<tim.c.chen@...el.com>, <linux-kernel@...r.kernel.org>, Tim Chen
<tim.c.chen@...ux.intel.com>, Peter Zijlstra <peterz@...radead.org>, "Vincent
Guittot" <vincent.guittot@...aro.org>, "Gautham R . Shenoy"
<gautham.shenoy@....com>, Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for
memory-heavy processes
On 12/19/2025 11:14 AM, K Prateek Nayak wrote:
> Hello Vern,
>
> On 12/18/2025 3:12 PM, Vern Hao wrote:
>>
>> On 2025/12/18 16:32, Chen, Yu C wrote:
>>> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>>>
>>>> On 2025/12/4 07:07, Tim Chen wrote:
>>>>> From: Chen Yu <yu.c.chen@...el.com>
>>>>>
>>>>> Prateek and Tingyin reported that memory-intensive workloads (such as
>>>>> stream) can saturate memory bandwidth and caches on the preferred LLC
>>>>> when sched_cache aggregates too many threads.
>>>>>
>>>>> To mitigate this, estimate a process's memory footprint by comparing
>>>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>>>> exceeds the LLC size, skip cache-aware scheduling.
>>>> Restricting RSS prevents many applications from benefiting from this optimization. I believe this restriction should be lifted. For memory- intensive workloads, the optimization may simply yield no gains, but it certainly shouldn't make performance worse. We need to further refine this logic.
>>>
>>> Memory-intensive workloads may trigger performance regressions when
>>> memory bandwidth(from L3 cache to memory controller) is saturated due
>> RSS size and bandwidth saturation are not necessarily linked, In my view, the optimization should be robust enough that it doesn't cause a noticeable drop in performance, no matter how large the RSS is.
>
> Easier said than done. I agree RSS size is not a clear indication of
> bandwidth saturation. With NUMA Balancing enabled, we can use the
> hinting faults to estimate the working set and make decisions but for
> systems that do not have NUMA, short of programming some performance
> counters, there is no real way to estimate the working set.
>
> Hinting faults are known to cause overheads so enabling them without
> NUMA can cause noticeable overheads with no real benefits.
>
>> We need to have a more profound discussion on this.
>
> What do you have in mind?
>
> From where I stand, having the RSS based bailout for now won't make
> things worse for these tasks with huge memory reserves and when we can
> all agree on some generic method to estimate the working set of a task,
> we can always add it into exceed_llc_capacity().
>
Prateek, thanks very much for the practical callouts - using RSS seems to be
the best trade-off we can go with for now. Vern, I get your point about the
concern between RSS and actual memory footprint. However, detecting the
working
set doesn’t seem to be accurate or generic in kernel space - even with
NUMA fault statistics sampling. One reliable way I can think of to
detect the working set is in user space, via resctrl (Intel RDT, AMD QoS,
Arm MPAM). So maybe we can leverage that information to implement
fine-grained
control on a per-process or per-task basis later.
thanks,
Chenyu
Powered by blists - more mailing lists