linux-kernel - Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2924fe29-e813-4dc5-8a1b-86890029e372@gmail.com>
Date: Thu, 18 Dec 2025 17:42:53 +0800
From: Vern Hao <haoxing990@...il.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
 Hillf Danton <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
 Jianyong Wu <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
 Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>,
 Len Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>,
 Zhao Liu <zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>,
 Adam Li <adamli@...amperecomputing.com>, Aaron Lu <ziqianlu@...edance.com>,
 Tim Chen <tim.c.chen@...el.com>, linux-kernel@...r.kernel.org,
 Tim Chen <tim.c.chen@...ux.intel.com>, Peter Zijlstra
 <peterz@...radead.org>, K Prateek Nayak <kprateek.nayak@....com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 "Gautham R . Shenoy" <gautham.shenoy@....com>, Ingo Molnar
 <mingo@...hat.com>, Vern Hao <haoxing990@...il.com>
Subject: Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for
 memory-heavy processes


On 2025/12/18 16:32, Chen, Yu C wrote:
> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>
>> On 2025/12/4 07:07, Tim Chen wrote:
>>> From: Chen Yu <yu.c.chen@...el.com>
>>>
>>> Prateek and Tingyin reported that memory-intensive workloads (such as
>>> stream) can saturate memory bandwidth and caches on the preferred LLC
>>> when sched_cache aggregates too many threads.
>>>
>>> To mitigate this, estimate a process's memory footprint by comparing
>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>> exceeds the LLC size, skip cache-aware scheduling.
>> Restricting RSS prevents many applications from benefiting from this 
>> optimization. I believe this restriction should be lifted. For 
>> memory- intensive workloads, the optimization may simply yield no 
>> gains, but it certainly shouldn't make performance worse. We need to 
>> further refine this logic.
>
> Memory-intensive workloads may trigger performance regressions when
> memory bandwidth(from L3 cache to memory controller) is saturated due
RSS size and bandwidth saturation are not necessarily linked, In my 
view, the optimization should be robust enough that it doesn't cause a 
noticeable drop in performance, no matter how large the RSS is. We need 
to have a more profound discussion on this.
> to task aggregation on single LLC. We have seen this issue in stream
> benchmark runs in previous version.
>
> Patch 23 introduces a debugfs knob llc_aggr_tolerance that lets userspace
> tune the scale factor. This allows memory-intensive workloads to perform
> task aggregation when their footprint is small and the administrator 
> considers
> it safe. As you noted in another patch, fine-grained control would 
> improve
> flexibility—and this can be addressed in future iterations.
>
>>> Note that RSS is only an approximation of the memory footprint.
>>> By default, the comparison is strict, but a later patch will allow
>>> users to provide a hint to adjust this threshold.
>>>
>>> According to the test from Adam, some systems do not have shared L3
>>> but with shared L2 as clusters. In this case, the L2 becomes the 
>>> LLC[1].
>>>
>>> Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739- 
>>> b00e28a09cb6@...amperecomputing.com/
>>>
>>> Co-developed-by: Tim Chen <tim.c.chen@...ux.intel.com>
>>> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
>>> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
>>> ---
>>>
>>> Notes:
>>>      v1->v2: Assigned curr_cpu in task_cache_work() before checking
>>>              exceed_llc_capacity(mm, curr_cpu) to avoid out-of-bound
>>>              access.(lkp/0day)
>>>
>>>   include/linux/cacheinfo.h | 21 ++++++++++-------
>>>   kernel/sched/fair.c       | 49 
>>> +++++++++++++++++++++++++++++++++++----
>>>   2 files changed, 57 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
>>> index c8f4f0a0b874..82d0d59ca0e1 100644
>>> --- a/include/linux/cacheinfo.h
>>> +++ b/include/linux/cacheinfo.h
>>> @@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu,
>>>   const struct attribute_group *cache_get_priv_group(struct 
>>> cacheinfo *this_leaf);
>>> -/*
>>> - * Get the cacheinfo structure for the cache associated with @cpu at
>>> - * level @level.
>>> - * cpuhp lock must be held.
>>> - */
>>> -static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, 
>>> int level)
>>> +static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, 
>>> int level)
>>>   {
>>>       struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
>>>       int i;
>>> -    lockdep_assert_cpus_held();
>>> -
>>>       for (i = 0; i < ci->num_leaves; i++) {
>>>           if (ci->info_list[i].level == level) {
>>>               if (ci->info_list[i].attributes & CACHE_ID)
>>> @@ -136,6 +129,18 @@ static inline struct cacheinfo 
>>> *get_cpu_cacheinfo_level(int cpu, int level)
>>>       return NULL;
>>>   }
>>> +/*
>>> + * Get the cacheinfo structure for the cache associated with @cpu at
>>> + * level @level.
>>> + * cpuhp lock must be held.
>>> + */
>>> +static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, 
>>> int level)
>>> +{
>>> +    lockdep_assert_cpus_held();
>>> +
>>> +    return _get_cpu_cacheinfo_level(cpu, level);
>>> +}
>>> +
>>>   /*
>>>    * Get the id of the cache associated with @cpu at level @level.
>>>    * cpuhp lock must be held.
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 6afa3f9a4e9b..424ec601cfdf 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1223,6 +1223,38 @@ static int llc_id(int cpu)
>>>       return llc;
>>>   }
>>> +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>>> +{
>>> +    struct cacheinfo *ci;
>>> +    unsigned long rss;
>>> +    unsigned int llc;
>>> +
>>> +    /*
>>> +     * get_cpu_cacheinfo_level() can not be used
>>> +     * because it requires the cpu_hotplug_lock
>>> +     * to be held. Use _get_cpu_cacheinfo_level()
>>> +     * directly because the 'cpu' can not be
>>> +     * offlined at the moment.
>>> +     */
>>> +    ci = _get_cpu_cacheinfo_level(cpu, 3);
>>> +    if (!ci) {
>>> +        /*
>>> +         * On system without L3 but with shared L2,
>>> +         * L2 becomes the LLC.
>>> +         */
>>> +        ci = _get_cpu_cacheinfo_level(cpu, 2);
>>> +        if (!ci)
>>> +            return true;
>>> +    }
>> Is there must call it one by one for get llc size? a static variable 
>> instead in building sched domain?
>
> I suppose you suggested introducing a per-CPU variable, like 
> percpu(sd_llc_bytes, cpu),
> or something similar to struct cpuinfo_x86.x86_cache_size. I am not 
> sure if the community
> would endorse introducing this variable, given that sched_cache would 
> be its only user.
> We can leave this as an open question.
>
> thanks,
> Chenyu