linux-kernel - Re: [PATCH 06/19] sched/fair: Assign preferred LLC ID to processes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <03fccf9d-50b7-4a7a-a7c2-21dcc06f235a@gmail.com>
Date: Mon, 20 Oct 2025 17:41:58 +0800
From: Vern Hao <haoxing990@...il.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>, Peter Zijlstra <peterz@...radead.org>
Cc: Tim Chen <tim.c.chen@...ux.intel.com>, Ingo Molnar <mingo@...hat.com>,
 K Prateek Nayak <kprateek.nayak@....com>,
 "Gautham R . Shenoy" <gautham.shenoy@....com>, Vern Hao
 <vernhao@...cent.com>, Vincent Guittot <vincent.guittot@...aro.org>,
 Juri Lelli <juri.lelli@...hat.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
 Hillf Danton <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
 Jianyong Wu <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
 Tingyin Duan <tingyin.duan@...il.com>, Len Brown <len.brown@...el.com>,
 Aubrey Li <aubrey.li@...el.com>, Zhao Liu <zhao1.liu@...el.com>,
 Chen Yu <yu.chen.surf@...il.com>, Adam Li <adamli@...amperecomputing.com>,
 Tim Chen <tim.c.chen@...el.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 06/19] sched/fair: Assign preferred LLC ID to processes

On 2025/10/17 12:50, Chen, Yu C wrote:
> On 10/15/2025 7:15 PM, Peter Zijlstra wrote:
>> On Tue, Oct 14, 2025 at 01:16:16PM +0800, Chen, Yu C wrote:
>>
>>> The question becomes: how can we figure out the threads that share
>>> data? Can the kernel detect this, or get the hint from user space?
>>
>> This needs the PMU, then you can steer using cache-miss ratios. But then
>> people will hate us for using counters.
>>
>>> Yes, the numa_group in NUMA load balancing indicates
>>> that several tasks manipulate the same page, which could be an
>>> indicator. Besides, if task A frequently wakes up task B, does it
>>> mean A and B have the potential to share data? Furthermore, if
>>> task A wakes up B via a pipe, it might also indicate that A has
>>> something to share with B. I just wonder if we can introduce a
>>> structure to gather this information together.
>>
>> The wakeup or pipe relation might be small relative to the working set.
>> Consider a sharded in memory database, where the query comes in through
>> the pipe/socket/wakeup. This query is small, but then it needs to go
>> trawl through its memory to find the answer.
>>
>> Something we *could* look at -- later -- is an interface to create
>> thread groups, such that userspace that is clever enough can communicate
>> this. But then there is the ago old question, will there be sufficient
>> users to justify the maintenance of said interface.
>
> I did not intend to digress too far, but since this issue has been 
> brought
> up, a wild guess came to me - could the "interface to create thread 
> groups"
> here refer to something like the filesystem for memory cgroup
> v2 thread mode? I just heard that some cloud users might split the 
> threads
> of a single process into different thread groups, where threads within 
> each
> group share data with one another (for example, when performing K-V 
> hashing
> operations). 

Yes, in our internal business, we encountered similar issues. The actual 
scenario is on AMD virtual machines,

where businesses would spawn multiple concurrent threads, for example, 
around 900 threads, with over 600 threads

handling hash or key-value computations, more than 100 threads dealing 
with network transmission, and some others handling

background logging or monitoring. These threads do not share same hot L3 
cache data. so concentrating these threads would only

exacerbate contention.

Can we differentiate these types of threads? It's obvious that the 
current configuration approach cannot meet the requirements

and will only cause more L3 cache race. Can we use cgroup or other 
methods, or configure through system calls to make

distinctions (the application may not be willing to modify the code) ?

> Using cgroup for this purpose might be a bit overkill, though,
> considering that cgroup itself is designed for resource partitioning 
> rather
> than identifying tasks sharing data. Meanwhile, the hierarchy of cgroup
> could also cause some overhead. If there were a single-layer thread 
> partitioning
> mechanism - similar to the resctrl filesystem - wouldn’t that allow us 
> to avoid
> modifying too much user business code while minimizing coupling with 
> existing
> kernel components?

> thanks,
> Chenyu