[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6da4333d-2f64-4b4b-8b51-8d0ca937b946@gmail.com>
Date: Mon, 22 Dec 2025 10:49:11 +0800
From: Vern Hao <haoxing990@...il.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>,
K Prateek Nayak <kprateek.nayak@....com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Hillf Danton <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
Jianyong Wu <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>,
Len Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>,
Zhao Liu <zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>,
Adam Li <adamli@...amperecomputing.com>, Aaron Lu <ziqianlu@...edance.com>,
Tim Chen <tim.c.chen@...el.com>, linux-kernel@...r.kernel.org,
Tim Chen <tim.c.chen@...ux.intel.com>, Peter Zijlstra
<peterz@...radead.org>, Vincent Guittot <vincent.guittot@...aro.org>,
"Gautham R . Shenoy" <gautham.shenoy@....com>, Ingo Molnar
<mingo@...hat.com>, Vern Hao <haoxing990@...il.com>
Subject: Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for
memory-heavy processes
On 2025/12/19 20:55, Chen, Yu C wrote:
> On 12/19/2025 11:14 AM, K Prateek Nayak wrote:
>> Hello Vern,
>>
>> On 12/18/2025 3:12 PM, Vern Hao wrote:
>>>
>>> On 2025/12/18 16:32, Chen, Yu C wrote:
>>>> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>>>>
>>>>> On 2025/12/4 07:07, Tim Chen wrote:
>>>>>> From: Chen Yu <yu.c.chen@...el.com>
>>>>>>
>>>>>> Prateek and Tingyin reported that memory-intensive workloads
>>>>>> (such as
>>>>>> stream) can saturate memory bandwidth and caches on the preferred
>>>>>> LLC
>>>>>> when sched_cache aggregates too many threads.
>>>>>>
>>>>>> To mitigate this, estimate a process's memory footprint by comparing
>>>>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>>>>> exceeds the LLC size, skip cache-aware scheduling.
>>>>> Restricting RSS prevents many applications from benefiting from
>>>>> this optimization. I believe this restriction should be lifted.
>>>>> For memory- intensive workloads, the optimization may simply yield
>>>>> no gains, but it certainly shouldn't make performance worse. We
>>>>> need to further refine this logic.
>>>>
>>>> Memory-intensive workloads may trigger performance regressions when
>>>> memory bandwidth(from L3 cache to memory controller) is saturated due
>>> RSS size and bandwidth saturation are not necessarily linked, In my
>>> view, the optimization should be robust enough that it doesn't cause
>>> a noticeable drop in performance, no matter how large the RSS is.
>>
>> Easier said than done. I agree RSS size is not a clear indication of
>> bandwidth saturation. With NUMA Balancing enabled, we can use the
>> hinting faults to estimate the working set and make decisions but for
>> systems that do not have NUMA, short of programming some performance
>> counters, there is no real way to estimate the working set.
>>
>> Hinting faults are known to cause overheads so enabling them without
>> NUMA can cause noticeable overheads with no real benefits.
>>
>>> We need to have a more profound discussion on this.
>>
>> What do you have in mind?
>>
>> From where I stand, having the RSS based bailout for now won't make
>> things worse for these tasks with huge memory reserves and when we can
>> all agree on some generic method to estimate the working set of a task,
>> we can always add it into exceed_llc_capacity().
>>
>
> Prateek, thanks very much for the practical callouts - using RSS seems
> to be
> the best trade-off we can go with for now. Vern, I get your point
> about the
> concern between RSS and actual memory footprint. However, detecting
> the working
> set doesn’t seem to be accurate or generic in kernel space - even with
> NUMA fault statistics sampling. One reliable way I can think of to
> detect the working set is in user space, via resctrl (Intel RDT, AMD
> QoS,
> Arm MPAM). So maybe we can leverage that information to implement
> fine-grained
> control on a per-process or per-task basis later.
OK, I agree, thanks.
>
> thanks,
> Chenyu
Powered by blists - more mailing lists