[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <03fccf9d-50b7-4a7a-a7c2-21dcc06f235a@gmail.com>
Date: Mon, 20 Oct 2025 17:41:58 +0800
From: Vern Hao <haoxing990@...il.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>, Peter Zijlstra <peterz@...radead.org>
Cc: Tim Chen <tim.c.chen@...ux.intel.com>, Ingo Molnar <mingo@...hat.com>,
K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>, Vern Hao
<vernhao@...cent.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Hillf Danton <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
Jianyong Wu <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
Tingyin Duan <tingyin.duan@...il.com>, Len Brown <len.brown@...el.com>,
Aubrey Li <aubrey.li@...el.com>, Zhao Liu <zhao1.liu@...el.com>,
Chen Yu <yu.chen.surf@...il.com>, Adam Li <adamli@...amperecomputing.com>,
Tim Chen <tim.c.chen@...el.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 06/19] sched/fair: Assign preferred LLC ID to processes
On 2025/10/17 12:50, Chen, Yu C wrote:
> On 10/15/2025 7:15 PM, Peter Zijlstra wrote:
>> On Tue, Oct 14, 2025 at 01:16:16PM +0800, Chen, Yu C wrote:
>>
>>> The question becomes: how can we figure out the threads that share
>>> data? Can the kernel detect this, or get the hint from user space?
>>
>> This needs the PMU, then you can steer using cache-miss ratios. But then
>> people will hate us for using counters.
>>
>>> Yes, the numa_group in NUMA load balancing indicates
>>> that several tasks manipulate the same page, which could be an
>>> indicator. Besides, if task A frequently wakes up task B, does it
>>> mean A and B have the potential to share data? Furthermore, if
>>> task A wakes up B via a pipe, it might also indicate that A has
>>> something to share with B. I just wonder if we can introduce a
>>> structure to gather this information together.
>>
>> The wakeup or pipe relation might be small relative to the working set.
>> Consider a sharded in memory database, where the query comes in through
>> the pipe/socket/wakeup. This query is small, but then it needs to go
>> trawl through its memory to find the answer.
>>
>> Something we *could* look at -- later -- is an interface to create
>> thread groups, such that userspace that is clever enough can communicate
>> this. But then there is the ago old question, will there be sufficient
>> users to justify the maintenance of said interface.
>
> I did not intend to digress too far, but since this issue has been
> brought
> up, a wild guess came to me - could the "interface to create thread
> groups"
> here refer to something like the filesystem for memory cgroup
> v2 thread mode? I just heard that some cloud users might split the
> threads
> of a single process into different thread groups, where threads within
> each
> group share data with one another (for example, when performing K-V
> hashing
> operations).
Yes, in our internal business, we encountered similar issues. The actual
scenario is on AMD virtual machines,
where businesses would spawn multiple concurrent threads, for example,
around 900 threads, with over 600 threads
handling hash or key-value computations, more than 100 threads dealing
with network transmission, and some others handling
background logging or monitoring. These threads do not share same hot L3
cache data. so concentrating these threads would only
exacerbate contention.
Can we differentiate these types of threads? It's obvious that the
current configuration approach cannot meet the requirements
and will only cause more L3 cache race. Can we use cgroup or other
methods, or configure through system calls to make
distinctions (the application may not be willing to modify the code) ?
> Using cgroup for this purpose might be a bit overkill, though,
> considering that cgroup itself is designed for resource partitioning
> rather
> than identifying tasks sharing data. Meanwhile, the hierarchy of cgroup
> could also cause some overhead. If there were a single-layer thread
> partitioning
> mechanism - similar to the resctrl filesystem - wouldn’t that allow us
> to avoid
> modifying too much user business code while minimizing coupling with
> existing
> kernel components?
> thanks,
> Chenyu
Powered by blists - more mailing lists