lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <29297e16-2c79-8367-cd6c-efedd0a8d9ed@redhat.com>
Date:   Mon, 8 May 2023 11:31:18 -0400
From:   Waiman Long <longman@...hat.com>
To:     Qiuxu Zhuo <qiuxu.zhuo@...el.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>
Cc:     Boqun Feng <boqun.feng@...il.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
 locked_pending bits


On 5/8/23 04:15, Qiuxu Zhuo wrote:
> The 1st spinner (header of the MCS queue) spins on the whole qspinlock
> variable to check whether the lock is released. For a contended qspinlock,
> this spinning is a hotspot as each CPU queued in the MCS queue performs
> the spinning when it becomes the 1st spinner (header of the MCS queue).
>
> The granularity among SMT h/w threads in the same core could be "byte"
> which the Load-Store Unit (LSU) inside the core handles. Making the 1st
> spinner only spin on locked_pending bits (not the whole qspinlock) can
> avoid the false dependency between the tail field and the locked_pending
> field. So this micro-optimization helps the h/w thread (the 1st spinner)
> stay in a low power state and prevents it from being woken up by other
> h/w threads in the same core when they perform xchg_tail() to update
> the tail field. Please see a similar discussion in the link [1].
>
> [1] https://lore.kernel.org/r/20230105021952.3090070-1-guoren@kernel.org
>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@...el.com>
> ---
>   kernel/locking/qspinlock.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
>
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index efebbf19f887..e7b990b28610 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -513,7 +513,20 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>   	if ((val = pv_wait_head_or_lock(lock, node)))
>   		goto locked;
>   
> +#if _Q_PENDING_BITS == 8
> +	/*
> +	 * Spinning on the 2-byte locked_pending instead of the 4-byte qspinlock
> +	 * variable can avoid the false dependency between the tail field and
> +	 * the locked_pending field. This helps the h/w thread (the 1st spinner)
> +	 * stay in a low power state and prevents it from being woken up by other
> +	 * h/w threads in the same core when they perform xchg_tail() to update
> +	 * the tail field only.
> +	 */
> +	smp_cond_load_acquire(&lock->locked_pending, !VAL);
> +	val = atomic_read_acquire(&lock->val);
> +#else
>   	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
> +#endif
>   
>   locked:
>   	/*

What hardware can benefit from this change? Do you have any 
micro-benchmark that can show the performance benefit?

Cheers,
Longman

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ