linux-kernel - Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <tencent_02C10A9A6B173BD6E18AC7B0844C34DAF00A@qq.com>
Date: Wed, 24 Dec 2025 20:15:22 +0800
From: Yangyu Chen <cyy@...self.name>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Tim Chen <tim.c.chen@...ux.intel.com>,
 Peter Zijlstra <peterz@...radead.org>,
 Ingo Molnar <mingo@...hat.com>,
 K Prateek Nayak <kprateek.nayak@....com>,
 "Gautham R . Shenoy" <gautham.shenoy@....com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Juri Lelli <juri.lelli@...hat.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>,
 Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>,
 Valentin Schneider <vschneid@...hat.com>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
 Hillf Danton <hdanton@...a.com>,
 Shrikanth Hegde <sshegde@...ux.ibm.com>,
 Jianyong Wu <jianyong.wu@...look.com>,
 Tingyin Duan <tingyin.duan@...il.com>,
 Vern Hao <vernhao@...cent.com>,
 Vern Hao <haoxing990@...il.com>,
 Len Brown <len.brown@...el.com>,
 Aubrey Li <aubrey.li@...el.com>,
 Zhao Liu <zhao1.liu@...el.com>,
 Chen Yu <yu.chen.surf@...il.com>,
 Adam Li <adamli@...amperecomputing.com>,
 Aaron Lu <ziqianlu@...edance.com>,
 Tim Chen <tim.c.chen@...el.com>,
 linux-kernel@...r.kernel.org,
 Qais Yousef <qyousef@...alina.io>
Subject: Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the
 parameters of cache-aware scheduling



> On 24 Dec 2025, at 15:51, Chen, Yu C <yu.c.chen@...el.com> wrote:
> 
> On 12/24/2025 11:28 AM, Yangyu Chen wrote:
>>> On 24 Dec 2025, at 00:44, Yangyu Chen <cyy@...self.name> wrote:
>>> 
>>>> On 23 Dec 2025, at 20:12, Yangyu Chen <cyy@...self.name> wrote:
>>>> 
>>>>> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@...ux.intel.com> wrote:
>>>>> 
>>>>> From: Chen Yu <yu.c.chen@...el.com>
>>>>> 
>>>>> Introduce a set of debugfs knobs to control the enabling of
>>>>> and parameters for cache-aware load balancing.
>>>>> 
>>>>> (1) llc_enabled
>>>>> llc_enabled acts as the primary switch - users can toggle it to
>>>>> enable or disable cache aware load balancing.
>>>>> 
>>>>> (2) llc_aggr_tolerance
>>>>> With sched_cache enabled, the scheduler uses a process's RSS as a
>>>>> proxy for its LLC footprint to determine if aggregating tasks on the
>>>>> preferred LLC could cause cache contention. If RSS exceeds the LLC
>>>>> size, aggregation is skipped. Some workloads with large RSS but small
>>>>> actual memory footprints may still benefit from aggregation. Since
>>>>> the kernel cannot efficiently track per-task cache usage (resctrl is
>>>>> user-space only), userspace can provide a more accurate hint.
>>>>> 
>>>>> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
>>>>> users control how strictly RSS limits aggregation. Values range from
>>>>> 0 to 100:
>>>>> 
>>>>> - 0: Cache-aware scheduling is disabled.
>>>>> - 1: Strict; tasks with RSS larger than LLC size are skipped.
>>>>> - 100: Aggressive; tasks are aggregated regardless of RSS.
>>>>> 
>>>> 
>>>> Hi Chen Yu and Tim Chen,
>>>> 
>>>> Maybe we should have something like prctl(PR_LLC_AGGR_TOLERANCE, 100).
>>>> 
>>>> I have tested this version of the patch on my EPYC Milan 7V13 (7763 variant) physical machine, with 32M LLC for each 8-core CCX. I found that I need to tune "llc_aggr_tolerance" to 100, else I can't get cache-aware scheduling to work on Verilated [1] XiangShan [2] running the chacha20 [3] as I mentioned before [4].
>>>> 
>>> 
>>> In addition, I have investigated why this happens. And finally I
>>> realized that's because that workload observed 35596 kB RssAnon on
>>> my EPYC Milan Machine, slightly exceeding the LLC size (32M). I
>>> have tested it on an EPYC Genoa cloud server with the correct core
>>> / cache hierarchy in ACPI table, that shows 31700 kB RssAnon, thus
>>> fitting in LLC. I have no idea why my result shows higher RssAnon,
>>> since they both run Debian Trixie with the exact same kernel and
>>> same executable. But it reminds me we should have a userspace API
>>> for that.
>>> 
>> In addition, during profiling the verilator, I found that if scheduled
>> to SMTs, it will result in poor performance. Thus, I think we should
>> separate the control for rss size with the SMT scale.
> 
> Thanks for the investigation. Could you elaborate a little more about
> scheduled to SMTs? Do you mean, if every CPU(SMT) in the LLC has 1 running
> task, then the performance is impacted? I thought we have
> exceed_llc_nr() to check the smt to avoid this?

The verilator can specify the number of threads being used for the
RTL simulator during compilation. And it cannot be changed at runtime
since it will do static partitioning. Thus, I didn't mean if there
is another thread being scheduled to a SMT in the LLC and we got
poor performance. I mean that the users can allow the verilator to
use more threads larger than the LLC capacity. But I have tested
your case, on my observation with the recent version of XiangShan
+ Verilator + LLVM21 with an 8-thread emulator, it shows 41%(30%
for 1-thread) and 62%(39% for 1-thread) performance degradation on
Raptor Lake and EPYC Milan if another 8 threads are running with a
simple loop. But I think that's only a datapoint. Since both Raptor
Lake and Zen 5 will statically partition the ROB in the CPU backend,
and such workloads will suffer a lot of data cache misses since
they have a very huge instruction footprint. I think SMT performance
is not easy to characterize across different microarchitectures and
workloads, but one thing for sure is that I didn't come across a
situation where a 16-thread emulator on an EPYC machine scheduled
to 1-CCX with 2-SMT is better than 2-CCX with only 1-SMT. That’s
why I think we should split this two user controls, one for RSS and
one for the number of threads.

Thanks,
Yangyu Chen

> 
>> It's notable that rss size is not the actual memory footprint. It
>> would be better if we could measure the l2_miss event or l3_miss
>> event to measure the l3 hit rate. Just for future work.
> 
> Yes, in user space, we can collect PMUs events/memory bandwidth via
> resctrl to decide whether to set task attributes.
> 
>> I'm willing to provide a patch for such a prctl. But I'm busy these
>> days, maybe I can have the time to do that after one week.
> 
> Sure. We haven't yet decided which interface we can leverage.
> Also,  Qais is working on QOS interface[1] - maybe we can build
> on his work.
> 
> [1] https://lore.kernel.org/all/20240820163512.1096301-11-qyousef@layalina.io/
> 
> thanks,
> Chenyu