[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <42607ed5-f543-41bd-94da-aa0ee7ec71cd@efficios.com>
Date: Thu, 18 Dec 2025 18:36:00 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Boqun Feng <boqun.feng@...il.com>
Cc: Joel Fernandes <joel@...lfernandes.org>,
"Paul E. McKenney" <paulmck@...nel.org>, linux-kernel@...r.kernel.org,
Nicholas Piggin <npiggin@...il.com>, Michael Ellerman <mpe@...erman.id.au>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
Will Deacon <will@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
Alan Stern <stern@...land.harvard.edu>, John Stultz <jstultz@...gle.com>,
Neeraj Upadhyay <Neeraj.Upadhyay@....com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Frederic Weisbecker <frederic@...nel.org>,
Josh Triplett <josh@...htriplett.org>, Uladzislau Rezki <urezki@...il.com>,
Steven Rostedt <rostedt@...dmis.org>, Lai Jiangshan
<jiangshanlai@...il.com>, Zqiang <qiang.zhang1211@...il.com>,
Ingo Molnar <mingo@...hat.com>, Waiman Long <longman@...hat.com>,
Mark Rutland <mark.rutland@....com>, Thomas Gleixner <tglx@...utronix.de>,
Vlastimil Babka <vbabka@...e.cz>, maged.michael@...il.com,
Mateusz Guzik <mjguzik@...il.com>,
Jonas Oberhauser <jonas.oberhauser@...weicloud.com>, rcu@...r.kernel.org,
linux-mm@...ck.org, lkmm@...ts.linux.dev
Subject: Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
On 2025-12-18 15:22, Boqun Feng wrote:
[...]
>>> Could you utilize this[1] to see a
>>> comparison of the reader-side performance against RCU/SRCU?
>>
>> Good point ! Let's see.
>>
>> On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
>> hyperthreading disabled,
>> CONFIG_PREEMPT=y,
>> CONFIG_PREEMPT_RCU=y,
>> CONFIG_PREEMPT_HAZPTR=y.
>>
>> scale_type ns
>> -----------------------
>> hazptr-smp-mb 13.1 <- this implementation
>> hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
>> hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
>> rcu 17.0
>> srcu 20.0
>> srcu-fast 1.5
>> rcu-tasks 0.0
>> rcu-trace 1.7
>> refcnt 1148.0
>> rwlock 1190.0
>> rwsem 4199.3
>> lock 41070.6
>> lock-irq 46176.3
>> acqrel 1.1
>>
>> So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
>> appear to beat hazptr read-side performance.
>>
>
> Could you also see the reader-side performance impact when the percpu
> hazard pointer slots are used up? I.e. the worst case.
I've modified the code to populate "(void *)1UL" in the 7 first slots
at bootup, here is the result:
hazptr-smp-mb-7-fail 16.3 ns
So we go from 13.1 ns to 16.3 ns when all but one slots are used.
And if we pre-populate the 8 slots for each cpu, and thus force
fallback to overflow list:
hazptr-smp-mb-8-fail 67.1 ns
>
>> [...]
>>
>>>> +/*
>>>> + * Perform piecewise iteration on overflow list waiting until "addr" is
>>>> + * not present. Raw spinlock is released and taken between each list
>>>> + * item and busy loop iteration. The overflow list generation is checked
>>>> + * each time the lock is taken to validate that the list has not changed
>>>> + * before resuming iteration or busy wait. If the generation has
>>>> + * changed, retry the entire list traversal.
>>>> + */
>>>> +static
>>>> +void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
>>>> +{
>>>> + struct hazptr_backup_slot *backup_slot;
>>>> + uint64_t snapshot_gen;
>>>> +
>>>> + raw_spin_lock(&overflow_list->lock);
>>>> +retry:
>>>> + snapshot_gen = overflow_list->gen;
>>>> + list_for_each_entry(backup_slot, &overflow_list->head, node) {
>>>> + /* Busy-wait if node is found. */
>>>> + while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
>>>> + raw_spin_unlock(&overflow_list->lock);
>>>> + cpu_relax();
>>>
>>> I think we should prioritize the scan thread solution [2] instead of
>>> busy waiting hazrd pointer updaters, because when we have multiple
>>> hazard pointer usages we would want to consolidate the scans from
>>> updater side.
>>
>> I agree that batching scans with a worker thread is a logical next step.
>>
>>> If so, the whole ->gen can be avoided.
>>
>> How would it allow removing the generation trick without causing long
>> raw spinlock latencies ?
>>
>
> Because we won't need to busy-wait for the readers to go away, we can
> check whether they are still there in the next scan.
>
> so:
>
> list_for_each_entry(backup_slot, &overflow_list->head, node) {
> /* Busy-wait if node is found. */
> if (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
> <mark addr as unable to free and move on>
But then you still iterate on a possibly large list of overflow nodes,
with a raw spinlock held. That raw spinlock is taken by the scheduler
on context switch. This can cause very long scheduler latency.
So breaking up the iteration into pieces is not just to handle
busy-waiting, but also to make sure we don't increase the
system latency by holding a raw spinlock (taken with rq lock
held) for more than the little time needed to iterate to the next
node.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists