linux-kernel - Re: [PATCH RESEND] lib/group_cpus: make group CPU cluster aware

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7ec85f26-25fa-4c77-9e46-347261723d95@intel.com>
Date: Mon, 24 Nov 2025 15:58:01 +0800
From: "Guo, Wangyang" <wangyang.guo@...el.com>
To: Ming Lei <ming.lei@...hat.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
 Thomas Gleixner <tglx@...utronix.de>, Keith Busch <kbusch@...nel.org>,
 Jens Axboe <axboe@...com>, Christoph Hellwig <hch@....de>,
 Sagi Grimberg <sagi@...mberg.me>, linux-kernel@...r.kernel.org,
 linux-nvme@...ts.infradead.org, virtualization@...ts.linux-foundation.org,
 linux-block@...r.kernel.org, Tianyou Li <tianyou.li@...el.com>,
 Tim Chen <tim.c.chen@...ux.intel.com>, Dan Liang <dan.liang@...el.com>
Subject: Re: [PATCH RESEND] lib/group_cpus: make group CPU cluster aware

On 11/19/2025 9:52 AM, Ming Lei wrote:
> On Tue, Nov 18, 2025 at 02:29:20PM +0800, Guo, Wangyang wrote:
>> On 11/13/2025 9:38 AM, Ming Lei wrote:
>>> On Wed, Nov 12, 2025 at 11:02:47AM +0800, Guo, Wangyang wrote:
>>>> On 11/11/2025 8:08 PM, Ming Lei wrote:
>>>>> On Tue, Nov 11, 2025 at 01:31:04PM +0800, Guo, Wangyang wrote:
>>>>> They still should share same L3 cache, and cpus_share_cache() should be
>>>>> true when the IO completes on the CPU which belong to different L2 with the
>>>>> submission CPU, and remote completion via IPI won't be triggered.
>>>> Yes, remote IPI not triggered.
>>>
>>> OK, in my test on AMD zen4, NVMe performance can be dropped to 1/2 - 1/3 if
>>> remote IPI is triggered in case of crossing L3, which is understandable.
>>>
>>> I will check if topo cluster can cover L3, if yes, the patch still can be
>>> simplified a lot by introducing sub-node spread by changing build_node_to_cpumask()
>>> and adding nr_sub_nodes.
>>
>> Do you mean using cluster as "NUMA" nodes to spread CPU, instead of two
>> level NUMA-cluster spreading?
> 
> Yes, I think the change may be minimized by introducing sub-numa-node for
> covering it, what do you think of this approach?
> 
> However, it is bad to use cluster as sub-numa-node at default, because cluster
> is aligned with CPUs sharing L2 cache, so there could be too many clusters
> for many systems in which one cluster just includes two CPUs, then the finally
> calculated mapping crosses clusters inevitably because nr_queues is
> less than nr_clusters.
> 
> I'd suggest to map CPUs sharing L3 cache into one sub-numa-node.
> 
> For your case, either adding one kernel parameter, or adding group_cpus_cluster()
> API for the unusual case by sharing single code path.

Yes, it will make the change simpler, but no matter using cluster or L3 
cache as sub-numa-node, as CPU core count increase, there is potential 
risk that nr_queues < nr_sub_numa_node, which will make CPU spreading no 
affinity at all.

I think we need to keep two level NUMA/sub-NUMA spreading to avoid such 
regression.

For most platforms, there is no need for 2nd level spreading since L2 is 
not shared between physical cores and L3 mapping is the same as NUMA. I 
think we can add a platform specified configuration: by default, only 
NUMA spreading is used; if platform like Intel Xeon E or some AMD 
platform need second level spreading, we can config it to use L3 or 
cluster as sub-numa-node for 2nd level spreading.

How does that sound to you?

BR
Wangyang