linux-kernel - Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f98d7afc-ee82-4552-b15a-c86315f6ced0@intel.com>
Date: Wed, 24 Dec 2025 15:51:16 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Yangyu Chen <cyy@...self.name>, Tim Chen <tim.c.chen@...ux.intel.com>
CC: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, "K
 Prateek Nayak" <kprateek.nayak@....com>, "Gautham R . Shenoy"
	<gautham.shenoy@....com>, Vincent Guittot <vincent.guittot@...aro.org>, "Juri
 Lelli" <juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, "Mel
 Gorman" <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, "Madadi
 Vineeth Reddy" <vineethr@...ux.ibm.com>, Hillf Danton <hdanton@...a.com>,
	Shrikanth Hegde <sshegde@...ux.ibm.com>, Jianyong Wu
	<jianyong.wu@...look.com>, Tingyin Duan <tingyin.duan@...il.com>, Vern Hao
	<vernhao@...cent.com>, Vern Hao <haoxing990@...il.com>, Len Brown
	<len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>, Zhao Liu
	<zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Adam Li
	<adamli@...amperecomputing.com>, Aaron Lu <ziqianlu@...edance.com>, Tim Chen
	<tim.c.chen@...el.com>, <linux-kernel@...r.kernel.org>, Qais Yousef
	<qyousef@...alina.io>
Subject: Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the
 parameters of cache-aware scheduling

On 12/24/2025 11:28 AM, Yangyu Chen wrote:
> 
> 
>> On 24 Dec 2025, at 00:44, Yangyu Chen <cyy@...self.name> wrote:
>>
>>> On 23 Dec 2025, at 20:12, Yangyu Chen <cyy@...self.name> wrote:
>>>
>>>> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@...ux.intel.com> wrote:
>>>>
>>>> From: Chen Yu <yu.c.chen@...el.com>
>>>>
>>>> Introduce a set of debugfs knobs to control the enabling of
>>>> and parameters for cache-aware load balancing.
>>>>
>>>> (1) llc_enabled
>>>> llc_enabled acts as the primary switch - users can toggle it to
>>>> enable or disable cache aware load balancing.
>>>>
>>>> (2) llc_aggr_tolerance
>>>> With sched_cache enabled, the scheduler uses a process's RSS as a
>>>> proxy for its LLC footprint to determine if aggregating tasks on the
>>>> preferred LLC could cause cache contention. If RSS exceeds the LLC
>>>> size, aggregation is skipped. Some workloads with large RSS but small
>>>> actual memory footprints may still benefit from aggregation. Since
>>>> the kernel cannot efficiently track per-task cache usage (resctrl is
>>>> user-space only), userspace can provide a more accurate hint.
>>>>
>>>> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
>>>> users control how strictly RSS limits aggregation. Values range from
>>>> 0 to 100:
>>>>
>>>> - 0: Cache-aware scheduling is disabled.
>>>> - 1: Strict; tasks with RSS larger than LLC size are skipped.
>>>> - 100: Aggressive; tasks are aggregated regardless of RSS.
>>>>
>>>
>>> Hi Chen Yu and Tim Chen,
>>>
>>> Maybe we should have something like prctl(PR_LLC_AGGR_TOLERANCE, 100).
>>>
>>> I have tested this version of the patch on my EPYC Milan 7V13 (7763 variant) physical machine, with 32M LLC for each 8-core CCX. I found that I need to tune "llc_aggr_tolerance" to 100, else I can't get cache-aware scheduling to work on Verilated [1] XiangShan [2] running the chacha20 [3] as I mentioned before [4].
>>>
>>
>> In addition, I have investigated why this happens. And finally I
>> realized that's because that workload observed 35596 kB RssAnon on
>> my EPYC Milan Machine, slightly exceeding the LLC size (32M). I
>> have tested it on an EPYC Genoa cloud server with the correct core
>> / cache hierarchy in ACPI table, that shows 31700 kB RssAnon, thus
>> fitting in LLC. I have no idea why my result shows higher RssAnon,
>> since they both run Debian Trixie with the exact same kernel and
>> same executable. But it reminds me we should have a userspace API
>> for that.
>>
> 
> In addition, during profiling the verilator, I found that if scheduled
> to SMTs, it will result in poor performance. Thus, I think we should
> separate the control for rss size with the SMT scale.
> 

Thanks for the investigation. Could you elaborate a little more about
scheduled to SMTs? Do you mean, if every CPU(SMT) in the LLC has 1 running
task, then the performance is impacted? I thought we have
exceed_llc_nr() to check the smt to avoid this?

> It's notable that rss size is not the actual memory footprint. It
> would be better if we could measure the l2_miss event or l3_miss
> event to measure the l3 hit rate. Just for future work.
> 

Yes, in user space, we can collect PMUs events/memory bandwidth via
resctrl to decide whether to set task attributes.

> I'm willing to provide a patch for such a prctl. But I'm busy these
> days, maybe I can have the time to do that after one week.
> 

Sure. We haven't yet decided which interface we can leverage.
Also,  Qais is working on QOS interface[1] - maybe we can build
on his work.

[1] 
https://lore.kernel.org/all/20240820163512.1096301-11-qyousef@layalina.io/

thanks,
Chenyu