lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e1a3c970-e1c0-4ed8-8b81-1d35f00d9c0b@kernel.org>
Date: Thu, 5 Feb 2026 18:29:28 -0500
From: Chuck Lever <cel@...nel.org>
To: Tejun Heo <tj@...nel.org>
Cc: jiangshanlai@...il.com, linux-kernel@...r.kernel.org,
 Chuck Lever <chuck.lever@...cle.com>
Subject: Re: [PATCH v2] workqueue: Automatic affinity scope fallback for
 single-pod topologies

On 2/5/26 5:10 PM, Tejun Heo wrote:
> Hello, Chuck.
> 
> On Wed, Feb 04, 2026 at 09:49:11PM -0500, Chuck Lever wrote:
>> +static bool __init cpus_share_cluster(int cpu0, int cpu1)
>> +{
>> +	return cpumask_test_cpu(cpu0, topology_cluster_cpumask(cpu1));
>> +}
> 
> Cluster boundary == core boundary for a lot of CPUs. I don't think this is
> going to work.

Fair enough; WQ_AFFN_CLUSTER is not a reliable intermediate level.
On x86 cpu_clustergroup_mask() returns cpu_l2c_shared_mask(), which
is per-core on many chips. The arm64 cpu_clustergroup_mask() has a
similar collapse: when cluster_sibling spans the coregroup, it falls
back to SMT siblings. And the generic fallback in topology.h is
cpumask_of(cpu).

I was hoping it would be a proper intermediate sharding scope.


> Here are a couple options:>
> - Introduce an affinity level which splits CACHE according to some
>   adjustable heuristics.
> 
> - Make the NFS workqueue default to WQ_AFFN_CORE (or maybe switch based on
>   some heuristics) or switch to a per-cpu workqueue.

The issue I see is that the contention isn't confined to a single
workqueue. In the NFS-over-RDMA I/O paths, at least four unbound
workqueues are in the hot path:

 - rpciod (WQ_UNBOUND) in net/sunrpc/sched.c -- core
   RPC task wake-up on every completion
 - xprtiod (WQ_UNBOUND) in net/sunrpc/xprt.c --
   transport cleanup and receive processing
 - nfsiod (WQ_UNBOUND) in fs/nfs/inode.c -- direct
   write and local I/O completion
 - svcrdma_wq (WQ_UNBOUND) in svc_rdma.c -- send
   context and write info release on every RDMA
   completion

These span three subsystems and maintainers. Other RDMA ULPs (iSER,
SRP target, kSMBd) have their own unbound workqueues with the same
exposure. Tuning each one individually is fragile, and any new
WQ_UNBOUND workqueue added to these paths inherits the degenerate
default behavior.

Even on platforms with a large core-to-pod ratio, pool lock contention
is going to be a significant problem when WQ pools are shared by more
than a handful of cores.

I don't have access to the kind of hardware needed to deeply test
sharding ideas, so I'll drop this patch for now and simply set boot
command line options on all my systems.


-- 
Chuck Lever

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ