linux-kernel - RE: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <IA1PR11MB617102EB293516D2DC81F28089769@IA1PR11MB6171.namprd11.prod.outlook.com>
Date:   Tue, 9 May 2023 02:45:34 +0000
From:   "Zhuo, Qiuxu" <qiuxu.zhuo@...el.com>
To:     Waiman Long <longman@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>
CC:     Boqun Feng <boqun.feng@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
 locked_pending bits

Hi Waiman,

Thanks for your review of this patch. 
Please see the comments below.

> From: Waiman Long <longman@...hat.com>
> Sent: Monday, May 8, 2023 11:31 PM
> To: Zhuo, Qiuxu <qiuxu.zhuo@...el.com>; Peter Zijlstra
> <peterz@...radead.org>; Ingo Molnar <mingo@...hat.com>; Will Deacon
> <will@...nel.org>
> Cc: Boqun Feng <boqun.feng@...il.com>; linux-kernel@...r.kernel.org
> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
> locked_pending bits
> 
> 
> On 5/8/23 04:15, Qiuxu Zhuo wrote:
> > The 1st spinner (header of the MCS queue) spins on the whole qspinlock
> > variable to check whether the lock is released. For a contended
> > qspinlock, this spinning is a hotspot as each CPU queued in the MCS
> > queue performs the spinning when it becomes the 1st spinner (header of
> the MCS queue).
> >
> > The granularity among SMT h/w threads in the same core could be "byte"
> > which the Load-Store Unit (LSU) inside the core handles. Making the
> > 1st spinner only spin on locked_pending bits (not the whole qspinlock)
> > can avoid the false dependency between the tail field and the
> > locked_pending field. So this micro-optimization helps the h/w thread
> > (the 1st spinner) stay in a low power state and prevents it from being
> > woken up by other h/w threads in the same core when they perform
> > xchg_tail() to update the tail field. Please see a similar discussion in the link
> [1].
> >
> > [1]
> > https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org
> >
> > Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@...el.com>
> > ---
> >   kernel/locking/qspinlock.c | 13 +++++++++++++
> >   1 file changed, 13 insertions(+)
> >
> > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > index efebbf19f887..e7b990b28610 100644
> > --- a/kernel/locking/qspinlock.c
> > +++ b/kernel/locking/qspinlock.c
> > @@ -513,7 +513,20 @@ void __lockfunc
> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >   	if ((val = pv_wait_head_or_lock(lock, node)))
> >   		goto locked;
> >
> > +#if _Q_PENDING_BITS == 8
> > +	/*
> > +	 * Spinning on the 2-byte locked_pending instead of the 4-byte
> qspinlock
> > +	 * variable can avoid the false dependency between the tail field and
> > +	 * the locked_pending field. This helps the h/w thread (the 1st
> spinner)
> > +	 * stay in a low power state and prevents it from being woken up by
> other
> > +	 * h/w threads in the same core when they perform xchg_tail() to
> update
> > +	 * the tail field only.
> > +	 */
> > +	smp_cond_load_acquire(&lock->locked_pending, !VAL);
> > +	val = atomic_read_acquire(&lock->val); #else
> >   	val = atomic_cond_read_acquire(&lock->val, !(VAL &
> > _Q_LOCKED_PENDING_MASK));
> > +#endif
> >
> >   locked:
> >   	/*
> 
> What hardware can benefit from this change? Do you have any micro-
> benchmark that can show the performance benefit?

i)  I don't have the hardware to measure the data. 
    But I run a benchmark [1] for the contended spinlock on an Intel Sapphire Rapids 
    server (192 h/w threads, 2sockets) that showed that the 1st spinner spinning was 
    a hotspot (contributed ~55% cache bouncing traffic measured by the perf C2C.
     I don't analyze the cache bouncing here, but just say the spinning is a hotspot).

ii) The similar micro-optimization discussion [2] (looked like it was accepted by you 😉) that 
    avoiding the false dependency (between the tail field and the locked_pending field) 
    should help some arches (e.g., some ARM64???) the h/w thread (spinner) stay in a 
    low-power state without the disruption by its sibling h/w threads in the same core. 

[1] https://github.com/yamahata/ipi_benchmark.git
[2] https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org

Thanks
-Qiuxu

> Cheers,
> Longman