linux-kernel - Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3432de9e-6642-10c0-31e5-ac0ce65bea23@redhat.com>
Date:   Mon, 8 May 2023 22:57:39 -0400
From:   Waiman Long <longman@...hat.com>
To:     "Zhuo, Qiuxu" <qiuxu.zhuo@...el.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>
Cc:     Boqun Feng <boqun.feng@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
 locked_pending bits


On 5/8/23 22:45, Zhuo, Qiuxu wrote:
> Hi Waiman,
>
> Thanks for your review of this patch.
> Please see the comments below.
>
>> From: Waiman Long <longman@...hat.com>
>> Sent: Monday, May 8, 2023 11:31 PM
>> To: Zhuo, Qiuxu <qiuxu.zhuo@...el.com>; Peter Zijlstra
>> <peterz@...radead.org>; Ingo Molnar <mingo@...hat.com>; Will Deacon
>> <will@...nel.org>
>> Cc: Boqun Feng <boqun.feng@...il.com>; linux-kernel@...r.kernel.org
>> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
>> locked_pending bits
>>
>>
>> On 5/8/23 04:15, Qiuxu Zhuo wrote:
>>> The 1st spinner (header of the MCS queue) spins on the whole qspinlock
>>> variable to check whether the lock is released. For a contended
>>> qspinlock, this spinning is a hotspot as each CPU queued in the MCS
>>> queue performs the spinning when it becomes the 1st spinner (header of
>> the MCS queue).
>>> The granularity among SMT h/w threads in the same core could be "byte"
>>> which the Load-Store Unit (LSU) inside the core handles. Making the
>>> 1st spinner only spin on locked_pending bits (not the whole qspinlock)
>>> can avoid the false dependency between the tail field and the
>>> locked_pending field. So this micro-optimization helps the h/w thread
>>> (the 1st spinner) stay in a low power state and prevents it from being
>>> woken up by other h/w threads in the same core when they perform
>>> xchg_tail() to update the tail field. Please see a similar discussion in the link
>> [1].
>>> [1]
>>> https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org
>>>
>>> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@...el.com>
>>> ---
>>>    kernel/locking/qspinlock.c | 13 +++++++++++++
>>>    1 file changed, 13 insertions(+)
>>>
>>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>>> index efebbf19f887..e7b990b28610 100644
>>> --- a/kernel/locking/qspinlock.c
>>> +++ b/kernel/locking/qspinlock.c
>>> @@ -513,7 +513,20 @@ void __lockfunc
>> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>>    	if ((val = pv_wait_head_or_lock(lock, node)))
>>>    		goto locked;
>>>
>>> +#if _Q_PENDING_BITS == 8
>>> +	/*
>>> +	 * Spinning on the 2-byte locked_pending instead of the 4-byte
>> qspinlock
>>> +	 * variable can avoid the false dependency between the tail field and
>>> +	 * the locked_pending field. This helps the h/w thread (the 1st
>> spinner)
>>> +	 * stay in a low power state and prevents it from being woken up by
>> other
>>> +	 * h/w threads in the same core when they perform xchg_tail() to
>> update
>>> +	 * the tail field only.
>>> +	 */
>>> +	smp_cond_load_acquire(&lock->locked_pending, !VAL);
>>> +	val = atomic_read_acquire(&lock->val); #else
>>>    	val = atomic_cond_read_acquire(&lock->val, !(VAL &
>>> _Q_LOCKED_PENDING_MASK));
>>> +#endif
>>>
>>>    locked:
>>>    	/*
>> What hardware can benefit from this change? Do you have any micro-
>> benchmark that can show the performance benefit?
> i)  I don't have the hardware to measure the data.
>      But I run a benchmark [1] for the contended spinlock on an Intel Sapphire Rapids
>      server (192 h/w threads, 2sockets) that showed that the 1st spinner spinning was
>      a hotspot (contributed ~55% cache bouncing traffic measured by the perf C2C.
>       I don't analyze the cache bouncing here, but just say the spinning is a hotspot).
I believe the amount of cacheline bouncing will be the same whether you 
read 32 or 16 bits from the lock word. At least this is my understanding 
of the x86 arch. Please correct me if my assumption is incorrect.

>
> ii) The similar micro-optimization discussion [2] (looked like it was accepted by you 😉) that
>      avoiding the false dependency (between the tail field and the locked_pending field)
>      should help some arches (e.g., some ARM64???) the h/w thread (spinner) stay in a
>      low-power state without the disruption by its sibling h/w threads in the same core.

That is true. However, this patch causes one more read from the lock 
cacheline which isn't necessary for arches that won't benefit from it.  
So I am less incline to accept it unless there is evidence of the 
benefit it can bring.

Cheers,
Longman