[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251218014531.3793471-4-mathieu.desnoyers@efficios.com>
Date: Wed, 17 Dec 2025 20:45:30 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Boqun Feng <boqun.feng@...il.com>,
Joel Fernandes <joel@...lfernandes.org>,
"Paul E. McKenney" <paulmck@...nel.org>
Cc: linux-kernel@...r.kernel.org,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Nicholas Piggin <npiggin@...il.com>,
Michael Ellerman <mpe@...erman.id.au>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
Will Deacon <will@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Alan Stern <stern@...land.harvard.edu>,
John Stultz <jstultz@...gle.com>,
Neeraj Upadhyay <Neeraj.Upadhyay@....com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Frederic Weisbecker <frederic@...nel.org>,
Josh Triplett <josh@...htriplett.org>,
Uladzislau Rezki <urezki@...il.com>,
Steven Rostedt <rostedt@...dmis.org>,
Lai Jiangshan <jiangshanlai@...il.com>,
Zqiang <qiang.zhang1211@...il.com>,
Ingo Molnar <mingo@...hat.com>,
Waiman Long <longman@...hat.com>,
Mark Rutland <mark.rutland@....com>,
Thomas Gleixner <tglx@...utronix.de>,
Vlastimil Babka <vbabka@...e.cz>,
maged.michael@...il.com,
Mateusz Guzik <mjguzik@...il.com>,
Jonas Oberhauser <jonas.oberhauser@...weicloud.com>,
rcu@...r.kernel.org,
linux-mm@...ck.org,
lkmm@...ts.linux.dev
Subject: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
This API provides existence guarantees of objects through Hazard
Pointers [1] (hazptr).
Its main benefit over RCU is that it allows fast reclaim of
HP-protected pointers without needing to wait for a grace period.
This implementation has 8 statically allocated hazard pointer slots per
cpu for the fast path, and relies on a on-stack backup slot allocated by
the hazard pointer user as fallback in case no per-cpu slot is
available.
References:
[1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
lock-free objects," in IEEE Transactions on Parallel and
Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
Link: https://lpc.events/event/19/contributions/2082/
Link: https://lore.kernel.org/lkml/j3scdl5iymjlxavomgc6u5ndg3svhab6ga23dr36o4f5mt333w@7xslvq6b6hmv/
Link: https://lpc.events/event/18/contributions/1731/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Nicholas Piggin <npiggin@...il.com>
Cc: Michael Ellerman <mpe@...erman.id.au>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Cc: "Paul E. McKenney" <paulmck@...nel.org>
Cc: Will Deacon <will@...nel.org>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Boqun Feng <boqun.feng@...il.com>
Cc: Alan Stern <stern@...land.harvard.edu>
Cc: John Stultz <jstultz@...gle.com>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@....com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>
Cc: Boqun Feng <boqun.feng@...il.com>
Cc: Frederic Weisbecker <frederic@...nel.org>
Cc: Joel Fernandes <joel@...lfernandes.org>
Cc: Josh Triplett <josh@...htriplett.org>
Cc: Uladzislau Rezki <urezki@...il.com>
Cc: Steven Rostedt <rostedt@...dmis.org>
Cc: Lai Jiangshan <jiangshanlai@...il.com>
Cc: Zqiang <qiang.zhang1211@...il.com>
Cc: Ingo Molnar <mingo@...hat.com>
Cc: Waiman Long <longman@...hat.com>
Cc: Mark Rutland <mark.rutland@....com>
Cc: Thomas Gleixner <tglx@...utronix.de>
Cc: Vlastimil Babka <vbabka@...e.cz>
Cc: maged.michael@...il.com
Cc: Mateusz Guzik <mjguzik@...il.com>
Cc: Jonas Oberhauser <jonas.oberhauser@...weicloud.com>
Cc: rcu@...r.kernel.org
Cc: linux-mm@...ck.org
Cc: lkmm@...ts.linux.dev
---
Changes since v3:
- Rename hazptr_retire to hazptr_release.
- Remove domains.
- Introduce "backup_slot" within hazptr context structure (on stack)
to handle slot overflow.
- Rename hazptr_try_protect to hazptr_acquire.
- Preallocate 8 per-CPU slots, and rely on caller-provided backup
slots (typically on stack) for out-of-slots situations.
Changes since v2:
- Address Peter Zijlstra's comments.
- Address Paul E. McKenney's comments.
Changes since v0:
- Remove slot variable from hp_dereference_allocate().
---
include/linux/hazptr.h | 182 +++++++++++++++++++++++++++++++++++++++++
init/main.c | 2 +
kernel/Makefile | 2 +-
kernel/hazptr.c | 150 +++++++++++++++++++++++++++++++++
4 files changed, 335 insertions(+), 1 deletion(-)
create mode 100644 include/linux/hazptr.h
create mode 100644 kernel/hazptr.c
diff --git a/include/linux/hazptr.h b/include/linux/hazptr.h
new file mode 100644
index 000000000000..70c066ddb0f5
--- /dev/null
+++ b/include/linux/hazptr.h
@@ -0,0 +1,182 @@
+// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
+//
+// SPDX-License-Identifier: LGPL-2.1-or-later
+
+#ifndef _LINUX_HAZPTR_H
+#define _LINUX_HAZPTR_H
+
+/*
+ * hazptr: Hazard Pointers
+ *
+ * This API provides existence guarantees of objects through hazard
+ * pointers.
+ *
+ * Its main benefit over RCU is that it allows fast reclaim of
+ * HP-protected pointers without needing to wait for a grace period.
+ *
+ * References:
+ *
+ * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
+ * lock-free objects," in IEEE Transactions on Parallel and
+ * Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
+ */
+
+#include <linux/percpu.h>
+#include <linux/types.h>
+#include <linux/cleanup.h>
+
+/* 8 slots (each sizeof(void *)) fit in a single cache line. */
+#define NR_HAZPTR_PERCPU_SLOTS 8
+
+/*
+ * Hazard pointer slot.
+ */
+struct hazptr_slot {
+ void *addr;
+};
+
+struct hazptr_backup_slot {
+ struct list_head node;
+ struct hazptr_slot slot;
+ /* CPU requesting the backup slot. */
+ int cpu;
+};
+
+struct hazptr_ctx {
+ struct hazptr_slot *slot;
+ /* Backup slot in case all per-CPU slots are used. */
+ struct hazptr_backup_slot backup_slot;
+};
+
+struct hazptr_percpu_slots {
+ struct hazptr_slot slots[NR_HAZPTR_PERCPU_SLOTS];
+} ____cacheline_aligned;
+
+DECLARE_PER_CPU(struct hazptr_percpu_slots, hazptr_percpu_slots);
+
+/*
+ * hazptr_synchronize: Wait until @addr is released from all slots.
+ *
+ * Wait to observe that each slot contains a value that differs from
+ * @addr before returning.
+ * Should be called from preemptible context.
+ */
+void hazptr_synchronize(void *addr);
+
+/*
+ * hazptr_chain_backup_slot: Chain backup slot into overflow list.
+ *
+ * Set backup slot address to @addr, and chain it into the overflow
+ * list.
+ */
+struct hazptr_slot *hazptr_chain_backup_slot(struct hazptr_ctx *ctx);
+
+/*
+ * hazptr_unchain_backup_slot: Unchain backup slot from overflow list.
+ */
+void hazptr_unchain_backup_slot(struct hazptr_ctx *ctx);
+
+static inline
+struct hazptr_slot *hazptr_get_free_percpu_slot(void)
+{
+ struct hazptr_percpu_slots *percpu_slots = this_cpu_ptr(&hazptr_percpu_slots);
+ unsigned int idx;
+
+ for (idx = 0; idx < NR_HAZPTR_PERCPU_SLOTS; idx++) {
+ struct hazptr_slot *slot = &percpu_slots->slots[idx];
+
+ if (!READ_ONCE(slot->addr))
+ return slot;
+ }
+ /* All slots are in use. */
+ return NULL;
+}
+
+static inline
+bool hazptr_slot_is_backup(struct hazptr_ctx *ctx, struct hazptr_slot *slot)
+{
+ return slot == &ctx->backup_slot.slot;
+}
+
+/*
+ * hazptr_acquire: Load pointer at address and protect with hazard pointer.
+ *
+ * Load @addr_p, and protect the loaded pointer with hazard pointer.
+ *
+ * Returns a non-NULL protected address if the loaded pointer is non-NULL.
+ * Returns NULL if the loaded pointer is NULL.
+ *
+ * On success the protected hazptr slot is stored in @ctx->slot.
+ */
+static inline
+void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
+{
+ struct hazptr_slot *slot = NULL;
+ void *addr, *addr2;
+
+ /*
+ * Load @addr_p to know which address should be protected.
+ */
+ addr = READ_ONCE(*addr_p);
+ for (;;) {
+ if (!addr)
+ return NULL;
+ guard(preempt)();
+ if (likely(!hazptr_slot_is_backup(ctx, slot))) {
+ slot = hazptr_get_free_percpu_slot();
+ /*
+ * If all the per-CPU slots are already in use, fallback
+ * to the backup slot.
+ */
+ if (unlikely(!slot))
+ slot = hazptr_chain_backup_slot(ctx);
+ }
+ WRITE_ONCE(slot->addr, addr); /* Store B */
+
+ /* Memory ordering: Store B before Load A. */
+ smp_mb();
+
+ /*
+ * Re-load @addr_p after storing it to the hazard pointer slot.
+ */
+ addr2 = READ_ONCE(*addr_p); /* Load A */
+ if (likely(ptr_eq(addr2, addr)))
+ break;
+ /*
+ * If @addr_p content has changed since the first load,
+ * release the hazard pointer and try again.
+ */
+ WRITE_ONCE(slot->addr, NULL);
+ if (!addr2) {
+ if (hazptr_slot_is_backup(ctx, slot))
+ hazptr_unchain_backup_slot(ctx);
+ return NULL;
+ }
+ addr = addr2;
+ }
+ ctx->slot = slot;
+ /*
+ * Use addr2 loaded from the second READ_ONCE() to preserve
+ * address dependency ordering.
+ */
+ return addr2;
+}
+
+/* Release the protected hazard pointer from @slot. */
+static inline
+void hazptr_release(struct hazptr_ctx *ctx, void *addr)
+{
+ struct hazptr_slot *slot;
+
+ if (!addr)
+ return;
+ slot = ctx->slot;
+ WARN_ON_ONCE(slot->addr != addr);
+ smp_store_release(&slot->addr, NULL);
+ if (unlikely(hazptr_slot_is_backup(ctx, slot)))
+ hazptr_unchain_backup_slot(ctx);
+}
+
+void hazptr_init(void);
+
+#endif /* _LINUX_HAZPTR_H */
diff --git a/init/main.c b/init/main.c
index 07a3116811c5..858eaa87bde7 100644
--- a/init/main.c
+++ b/init/main.c
@@ -104,6 +104,7 @@
#include <linux/pidfs.h>
#include <linux/ptdump.h>
#include <linux/time_namespace.h>
+#include <linux/hazptr.h>
#include <net/net_namespace.h>
#include <asm/io.h>
@@ -1002,6 +1003,7 @@ void start_kernel(void)
workqueue_init_early();
rcu_init();
+ hazptr_init();
kvfree_rcu_init();
/* Trace events are available after this */
diff --git a/kernel/Makefile b/kernel/Makefile
index 9fe722305c9b..1178907fe0ea 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \
cpu.o exit.o softirq.o resource.o \
sysctl.o capability.o ptrace.o user.o \
signal.o sys.o umh.o workqueue.o pid.o task_work.o \
- extable.o params.o \
+ extable.o params.o hazptr.o \
kthread.o sys_ni.o nsproxy.o nstree.o nscommon.o \
notifier.o ksysfs.o cred.o reboot.o \
async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
diff --git a/kernel/hazptr.c b/kernel/hazptr.c
new file mode 100644
index 000000000000..2ec288bc1132
--- /dev/null
+++ b/kernel/hazptr.c
@@ -0,0 +1,150 @@
+// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
+//
+// SPDX-License-Identifier: LGPL-2.1-or-later
+
+/*
+ * hazptr: Hazard Pointers
+ */
+
+#include <linux/hazptr.h>
+#include <linux/percpu.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/export.h>
+
+struct overflow_list {
+ raw_spinlock_t lock; /* Lock protecting overflow list and list generation. */
+ struct list_head head; /* Overflow list head. */
+ uint64_t gen; /* Overflow list generation. */
+};
+
+static DEFINE_PER_CPU(struct overflow_list, percpu_overflow_list);
+
+DEFINE_PER_CPU(struct hazptr_percpu_slots, hazptr_percpu_slots);
+EXPORT_PER_CPU_SYMBOL_GPL(hazptr_percpu_slots);
+
+/*
+ * Perform piecewise iteration on overflow list waiting until "addr" is
+ * not present. Raw spinlock is released and taken between each list
+ * item and busy loop iteration. The overflow list generation is checked
+ * each time the lock is taken to validate that the list has not changed
+ * before resuming iteration or busy wait. If the generation has
+ * changed, retry the entire list traversal.
+ */
+static
+void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
+{
+ struct hazptr_backup_slot *backup_slot;
+ uint64_t snapshot_gen;
+
+ raw_spin_lock(&overflow_list->lock);
+retry:
+ snapshot_gen = overflow_list->gen;
+ list_for_each_entry(backup_slot, &overflow_list->head, node) {
+ /* Busy-wait if node is found. */
+ while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
+ raw_spin_unlock(&overflow_list->lock);
+ cpu_relax();
+ raw_spin_lock(&overflow_list->lock);
+ if (overflow_list->gen != snapshot_gen)
+ goto retry;
+ }
+ raw_spin_unlock(&overflow_list->lock);
+ /*
+ * Release raw spinlock, validate generation after
+ * re-acquiring the lock.
+ */
+ raw_spin_lock(&overflow_list->lock);
+ if (overflow_list->gen != snapshot_gen)
+ goto retry;
+ }
+ raw_spin_unlock(&overflow_list->lock);
+}
+
+static
+void hazptr_synchronize_cpu_slots(int cpu, void *addr)
+{
+ struct hazptr_percpu_slots *percpu_slots = per_cpu_ptr(&hazptr_percpu_slots, cpu);
+ unsigned int idx;
+
+ for (idx = 0; idx < NR_HAZPTR_PERCPU_SLOTS; idx++) {
+ struct hazptr_slot *slot = &percpu_slots->slots[idx];
+
+ /* Busy-wait if node is found. */
+ smp_cond_load_acquire(&slot->addr, VAL != addr); /* Load B */
+ }
+}
+
+/*
+ * hazptr_synchronize: Wait until @addr is released from all slots.
+ *
+ * Wait to observe that each slot contains a value that differs from
+ * @addr before returning.
+ * Should be called from preemptible context.
+ */
+void hazptr_synchronize(void *addr)
+{
+ int cpu;
+
+ /*
+ * Busy-wait should only be done from preemptible context.
+ */
+ lockdep_assert_preemption_enabled();
+
+ /*
+ * Store A precedes hazptr_scan(): it unpublishes addr (sets it to
+ * NULL or to a different value), and thus hides it from hazard
+ * pointer readers.
+ */
+ if (!addr)
+ return;
+ /* Memory ordering: Store A before Load B. */
+ smp_mb();
+ /* Scan all CPUs slots. */
+ for_each_possible_cpu(cpu) {
+ /* Scan CPU slots. */
+ hazptr_synchronize_cpu_slots(cpu, addr);
+ /* Scan backup slots in percpu overflow list. */
+ hazptr_synchronize_overflow_list(per_cpu_ptr(&percpu_overflow_list, cpu), addr);
+ }
+}
+EXPORT_SYMBOL_GPL(hazptr_synchronize);
+
+struct hazptr_slot *hazptr_chain_backup_slot(struct hazptr_ctx *ctx)
+{
+ struct overflow_list *overflow_list = this_cpu_ptr(&percpu_overflow_list);
+ struct hazptr_slot *slot = &ctx->backup_slot.slot;
+
+ slot->addr = NULL;
+
+ raw_spin_lock(&overflow_list->lock);
+ overflow_list->gen++;
+ list_add(&ctx->backup_slot.node, &overflow_list->head);
+ ctx->backup_slot.cpu = smp_processor_id();
+ raw_spin_unlock(&overflow_list->lock);
+ return slot;
+}
+EXPORT_SYMBOL_GPL(hazptr_chain_backup_slot);
+
+void hazptr_unchain_backup_slot(struct hazptr_ctx *ctx)
+{
+ struct overflow_list *overflow_list = per_cpu_ptr(&percpu_overflow_list, ctx->backup_slot.cpu);
+
+ raw_spin_lock(&overflow_list->lock);
+ overflow_list->gen++;
+ list_del(&ctx->backup_slot.node);
+ raw_spin_unlock(&overflow_list->lock);
+}
+EXPORT_SYMBOL_GPL(hazptr_unchain_backup_slot);
+
+void __init hazptr_init(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct overflow_list *overflow_list = per_cpu_ptr(&percpu_overflow_list, cpu);
+
+ raw_spin_lock_init(&overflow_list->lock);
+ INIT_LIST_HEAD(&overflow_list->head);
+ }
+}
--
2.39.5
Powered by blists - more mailing lists