[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aad84044-a9d3-4806-a966-4770a3634b03@redhat.com>
Date: Sun, 30 Nov 2025 11:04:31 -0500
From: Waiman Long <llong@...hat.com>
To: Anatoly Stepanov <stepanov.anatoly@...wei.com>, peterz@...radead.org,
boqun.feng@...il.com, catalin.marinas@....com, will@...nel.org,
mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com, x86@...nel.org,
hpa@...or.com, arnd@...db.de, dvhart@...radead.org, dave@...olabs.net,
andrealmeid@...lia.com
Cc: linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org,
guohanjun@...wei.com, wangkefeng.wang@...wei.com, weiyongjun1@...wei.com,
yusongping@...wei.com, leijitang@...wei.com, artem.kuzin@...wei.com,
fedorov.nikita@...artners.com, kang.sun@...wei.com
Subject: Re: [RFC PATCH 0/1] Introduce Hierarchical Queued NUMA-aware spinlock
On 11/27/25 10:26 PM, Anatoly Stepanov wrote:
> [Introduction & Motivation]
>
> In a high contention case, existing Linux kernel spinlock implementations can become
> inefficient on modern NUMA-systems due to frequent and expensive
> cross-NUMA cache-line transfers.
>
> This might happen due to following reasons:
> - on "contender enqueue" each lock contender updates a shared lock structure
> - on "MCS handoff" cross-NUMA cache-line transfer occurs when
> two contenders are from different NUMA nodes.
>
> We introduce Hierarchical Queued Spinlock (HQ spinlock), aiming to
> reduce cross-NUMA cache line traffic and thus improving lock/unlock throughput
> for high-contention cases.
>
> This idea might be considered as a combination of cohort-locking
> by Dave Dice and Linux kernel queued spinlock.
>
> [Design and workflow]
>
> The contenders are organized into 2-level scheme where each NUMA-node
> has it's own FIFO contender queue.
>
> NUMA-queues are linked into single-linked list alike structure, while
> maintaining FIFO order between them.
> When there're no contenders left in a NUMA-queue, the queue is removed from
> the list.
>
> Contenders try to enqueue to local NUMA-queue first and if there's
> no such queue - link it to the list.
> As for "qspinlock" only the first contender is spinning globally,
> all others - MCS-spinning.
>
> "Handoff" operation becomes two-staged:
> - local handoff: between contenders in a single NUMA-queue
> - remote handoff: between different NUMA-queues.
>
> If "remote handoff" reaches the end of the NUMA-queue linked list it goes to
> the list head.
>
> To avoid potential starvation issues, we use handoff thresholds.
>
> Key challenge here was keeping the same "qspinlock" structure size
> and "bind" a given lock to related NUMA-queues.
>
> We came up with dynamic lock metadata concept, where we can
> dynamically "bind" a given lock to it's NUMA-related metadata, and
> then "unbind" when the lock is released.
>
> This approach allows to avoid metadata reservation for every lock in advance,
> thus giving the upperbound of metadata instance number to ~ (NR_CPUS x nr_contexts / 2).
> Which corresponds to maximum amount of different locks falling into the slowpath.
>
> [Dynamic mode switching]
>
> HQ-lock supports switching between "default qspinlock" mode to "NUMA-aware lock" mode and backwards.
> Here we introduce simple scheme: when the contention is high enough a lock switches to NUMA-aware mode
> until it is completely released.
>
> If for some reason "NUMA-aware" mode cannot be enabled we fallback to default qspinlock mode.
>
> [Usage]
>
> If someone wants to try the HQ-lock in some subsystem, just
> change lock initialization code from "spin_lock_init()" to "spin_lock_init_hq()".
> The dedicated bit in the lock structure is used to distiguish between the two lock types.
>
> [Performance measurements]
>
> Performance measurements were done on x86 (AMD EPYC) and arm64 (Kunpeng 920)
> platforms with the following scenarious:
> - Locktorture benchmark
> - Memcached + memtier benchmark
> - Ngnix + Wrk benchmark
>
> [Locktorture]
>
> NPS stands for "Nodes per socket"
> +------------------------------+-----------------------+-------+-------+--------+
> | AMD EPYC 9654 |
> +------------------------------+-----------------------+-------+-------+--------+
> | 192 cores (x2 hyper-threads) | | | | |
> | 2 sockets | | | | |
> | Locktorture 60 sec. | NUMA nodes per-socket | | | |
> | Average gain (single lock) | 1 NPS | 2 NPS | 4 NPS | 12 NPS |
> | Total contender threads | | | | |
> | 8 | 19% | 21% | 12% | 12% |
> | 16 | 13% | 18% | 34% | 75% |
> | 32 | 8% | 14% | 25% | 112% |
> | 64 | 11% | 12% | 30% | 152% |
> | 128 | 9% | 17% | 37% | 163% |
> | 256 | 2% | 16% | 40% | 168% |
> | 384 | -1% | 14% | 44% | 186% |
> +------------------------------+-----------------------+-------+-------+--------+
>
> +-----------------+-------+-------+-------+--------+
> | Fairness factor | 1 NPS | 2 NPS | 4 NPS | 12 NPS |
> +-----------------+-------+-------+-------+--------+
> | 8 | 0.54 | 0.57 | 0.57 | 0.55 |
> | 16 | 0.52 | 0.53 | 0.60 | 0.58 |
> | 32 | 0.53 | 0.53 | 0.53 | 0.61 |
> | 64 | 0.52 | 0.56 | 0.54 | 0.56 |
> | 128 | 0.51 | 0.54 | 0.54 | 0.53 |
> | 256 | 0.52 | 0.52 | 0.52 | 0.52 |
> | 384 | 0.51 | 0.51 | 0.51 | 0.51 |
> +-----------------+-------+-------+-------+--------+
>
> +-------------------------+--------------+
> | Kunpeng 920 (arm64) | |
> +-------------------------+--------------+
> | 96 cores (no MT) | |
> | 2 sockets, 4 NUMA nodes | |
> | Locktorture 60 sec. | |
> | | |
> | Total contender threads | Average gain |
> | 8 | 93% |
> | 16 | 142% |
> | 32 | 129% |
> | 64 | 152% |
> | 96 | 158% |
> +-------------------------+--------------+
>
> [Memcached]
>
> +---------------------------------+-----------------+-------------------+
> | AMD EPYC 9654 | | |
> +---------------------------------+-----------------+-------------------+
> | 192 cores (x2 hyper-threads) | | |
> | 2 sockets, NPS=4 | | |
> | | | |
> | Memtier+memcached 1:1 R/W ratio | | |
> | Workers | Throughput gain | Latency reduction |
> | 32 | 1% | -1% |
> | 64 | 1% | -1% |
> | 128 | 3% | -4% |
> | 256 | 7% | -6% |
> | 384 | 10% | -8% |
> +---------------------------------+-----------------+-------------------+
>
> +---------------------------------+-----------------+-------------------+
> | Kunpeng 920 (arm64) | | |
> +---------------------------------+-----------------+-------------------+
> | 96 cores (no MT) | | |
> | 2 sockets, 4 NUMA nodes | | |
> | | | |
> | Memtier+memcached 1:1 R/W ratio | | |
> | Workers | Throughput gain | Latency reduction |
> | 32 | 4% | -3% |
> | 64 | 6% | -6% |
> | 80 | 8% | -7% |
> | 96 | 8% | -8% |
> +---------------------------------+-----------------+-------------------+
>
> [Nginx]
>
> +-----------------------------------------------------------------------+-----------------+
> | Kunpeng 920 (arm64) | |
> +-----------------------------------------------------------------------+-----------------+
> | 96 cores (no MT) | |
> | 2 sockets, 4 NUMA nodes | |
> | | |
> | Nginx + WRK benchmark, single file (lockref spinlock contention case) | |
> | Workers | Throughput gain |
> | 32 | 1% |
> | 64 | 68% |
> | 80 | 72% |
> | 96 | 78% |
> +-----------------------------------------------------------------------+-----------------+
> Despite, the test is a single-file test, it can be related to real-life cases, when some
> html-pages are accessed much more frequently than others (index.html, etc.)
>
> [Low contention remarks]
>
> With the simple contention detection scheme presented in the patch we observe
> some degradation, compared to "qspinlock" in low-contention scenraios (< 8 contenders).
>
> We're working on more sophisticated and effective contention-detection method, that
> addresses the issue.
>
> Anatoly Stepanov, Nikita Fedorov (1):
> HQspinlock - (NUMA-aware) Hierarchical Queued spinlock
>
> arch/arm64/Kconfig | 28 +
> arch/arm64/include/asm/qspinlock.h | 37 ++
> arch/x86/Kconfig | 28 +
> arch/x86/include/asm/qspinlock.h | 38 +-
> include/asm-generic/qspinlock.h | 23 +-
> include/asm-generic/qspinlock_types.h | 52 +-
> include/linux/lockref.h | 2 +-
> include/linux/spinlock.h | 26 +
> include/linux/spinlock_types.h | 26 +
> include/linux/spinlock_types_raw.h | 20 +
> init/main.c | 4 +
> kernel/futex/core.c | 2 +-
> kernel/locking/hqlock_core.h | 832 ++++++++++++++++++++++++++
> kernel/locking/hqlock_meta.h | 477 +++++++++++++++
> kernel/locking/hqlock_proc.h | 88 +++
> kernel/locking/hqlock_types.h | 118 ++++
> kernel/locking/qspinlock.c | 67 ++-
> kernel/locking/qspinlock.h | 4 +-
> kernel/locking/spinlock_debug.c | 20 +
> 19 files changed, 1862 insertions(+), 30 deletions(-)
> create mode 100644 arch/arm64/include/asm/qspinlock.h
> create mode 100644 kernel/locking/hqlock_core.h
> create mode 100644 kernel/locking/hqlock_meta.h
> create mode 100644 kernel/locking/hqlock_proc.h
> create mode 100644 kernel/locking/hqlock_types.h
There was a past attempt to make qspinlock NUMA-aware.
https://lore.kernel.org/lkml/20210514200743.3026725-1-alex.kogan@oracle.com/
I am not sure if you had seen that as you didn't mention about this patch.
That previous patch differs from yours in the following ways:
1) It can coexist with pvspinlock and is activated only if running on
bare metal and on systems with 2 or more NUMA nodes. Your patch is
mutually exhaustive with pvqspinlock.
2) Once the NUMA-aware lock is activated, all the spinlocks will use it
instead of using a separate init call to enable its use.
3) Your patch can dynamically switch between the qspinlock and HQ lock
mode which the previous one doesn't.
There are pros and cons in both cases.
I also have some general comments about this patch.
Your patch is too large to review as a single patch. I would suggest
breaking it up in multiple smaller patches to make it easier to review.
Also the cover letter won't be included if the patch is merged, but the
patch description in that single patch is not informative at all. I
would suggest putting the design and the performance information shown
in this cover letter to the individual patches and with a relatively
brief cover letter.
I have more comments about the actual content of your patch 1 in another
email.
Cheers,
Longman
Powered by blists - more mailing lists