[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e1a3c970-e1c0-4ed8-8b81-1d35f00d9c0b@kernel.org>
Date: Thu, 5 Feb 2026 18:29:28 -0500
From: Chuck Lever <cel@...nel.org>
To: Tejun Heo <tj@...nel.org>
Cc: jiangshanlai@...il.com, linux-kernel@...r.kernel.org,
Chuck Lever <chuck.lever@...cle.com>
Subject: Re: [PATCH v2] workqueue: Automatic affinity scope fallback for
single-pod topologies
On 2/5/26 5:10 PM, Tejun Heo wrote:
> Hello, Chuck.
>
> On Wed, Feb 04, 2026 at 09:49:11PM -0500, Chuck Lever wrote:
>> +static bool __init cpus_share_cluster(int cpu0, int cpu1)
>> +{
>> + return cpumask_test_cpu(cpu0, topology_cluster_cpumask(cpu1));
>> +}
>
> Cluster boundary == core boundary for a lot of CPUs. I don't think this is
> going to work.
Fair enough; WQ_AFFN_CLUSTER is not a reliable intermediate level.
On x86 cpu_clustergroup_mask() returns cpu_l2c_shared_mask(), which
is per-core on many chips. The arm64 cpu_clustergroup_mask() has a
similar collapse: when cluster_sibling spans the coregroup, it falls
back to SMT siblings. And the generic fallback in topology.h is
cpumask_of(cpu).
I was hoping it would be a proper intermediate sharding scope.
> Here are a couple options:>
> - Introduce an affinity level which splits CACHE according to some
> adjustable heuristics.
>
> - Make the NFS workqueue default to WQ_AFFN_CORE (or maybe switch based on
> some heuristics) or switch to a per-cpu workqueue.
The issue I see is that the contention isn't confined to a single
workqueue. In the NFS-over-RDMA I/O paths, at least four unbound
workqueues are in the hot path:
- rpciod (WQ_UNBOUND) in net/sunrpc/sched.c -- core
RPC task wake-up on every completion
- xprtiod (WQ_UNBOUND) in net/sunrpc/xprt.c --
transport cleanup and receive processing
- nfsiod (WQ_UNBOUND) in fs/nfs/inode.c -- direct
write and local I/O completion
- svcrdma_wq (WQ_UNBOUND) in svc_rdma.c -- send
context and write info release on every RDMA
completion
These span three subsystems and maintainers. Other RDMA ULPs (iSER,
SRP target, kSMBd) have their own unbound workqueues with the same
exposure. Tuning each one individually is fragile, and any new
WQ_UNBOUND workqueue added to these paths inherits the degenerate
default behavior.
Even on platforms with a large core-to-pod ratio, pool lock contention
is going to be a significant problem when WQ pools are shared by more
than a handful of cores.
I don't have access to the kind of hardware needed to deeply test
sharding ideas, so I'll drop this patch for now and simply set boot
command line options on all my systems.
--
Chuck Lever
Powered by blists - more mailing lists