linux-kernel - Re: [PATCH RESEND] x86/paravirt: add backoff mechanism to virt_spin

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <31f27919-926f-4cbd-81fa-5a52c453feca@intel.com>
Date: Wed, 16 Jul 2025 17:25:43 +0800
From: "Guo, Wangyang" <wangyang.guo@...el.com>
To: Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
 Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
 x86@...nel.org, "H. Peter Anvin" <hpa@...or.com>,
 linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
 Paolo Bonzini <pbonzini@...hat.com>, Vitaly Kuznetsov <vkuznets@...hat.com>,
 Sean Christopherson <seanjc@...gle.com>
Cc: Tianyou Li <tianyou.li@...el.com>, Tim Chen <tim.c.chen@...ux.intel.com>
Subject: Re: [PATCH RESEND] x86/paravirt: add backoff mechanism to
 virt_spin_lock

Any comments or suggestions to this patch? Is there any further updates 
or changes needed?

BR
Wangyang

On 7/3/2025 10:23 AM, Wangyang Guo wrote:
> When multiple threads waiting for lock at the same time, once lock owner
> releases the lock, waiters will see lock available and all try to lock,
> which may cause an expensive CAS storm.
> 
> Binary exponential backoff is introduced. As try-lock attempt increases,
> there is more likely that a larger number threads compete for the same
> lock, so increase wait time in exponential.
> 
> The optimization can improves SpecCPU2017 502.gcc_r benchmark by ~4% for
> 288 cores VM on Intel Xeon 6 E-cores platform.
> 
> For micro benchmark, the patch can have significant performance gain
> in high contention case. Slight regression is found in some of mid-
> conetented cases because the last waiter might take longer to check
> unlocked. No changes to low contented scenario as expected.
> 
> Micro Bench: https://github.com/guowangy/kernel-lock-bench
> Test Platform: Xeon 8380L
> First Row: critical section length
> First Col: CPU core number
> Values: backoff vs linux-6.15, throughput based, higher is better
> 
> non-critical-length: 1
>         0     1     2     4     8    16    32    64   128
> 1   1.01  1.00  1.00  1.00  1.01  1.01  1.01  1.01  1.00
> 2   1.02  1.01  1.02  0.97  1.02  1.05  1.01  1.00  1.01
> 4   1.15  1.20  1.14  1.11  1.34  1.26  0.99  0.93  0.98
> 8   1.59  1.71  1.18  1.80  1.95  1.45  1.05  0.99  1.17
> 16  1.04  1.37  1.08  1.31  1.85  1.50  1.24  0.99  1.24
> 32  1.24  1.36  1.23  1.40  1.50  1.86  1.45  1.18  1.48
> 64  1.12  1.24  1.11  1.31  1.34  1.37  2.01  1.60  1.43
> 
> non-critical-length: 32
>         0     1     2     4     8    16    32    64   128
> 1   1.00  1.00  1.00  1.00  1.00  0.99  1.00  1.00  1.01
> 2   1.00  1.00  1.00  1.00  1.00  1.00  1.00  0.99  1.00
> 4   1.12  1.25  1.09  1.07  1.12  1.16  1.13  1.16  1.09
> 8   1.02  1.16  1.03  1.02  1.04  1.07  1.04  0.99  0.98
> 16  0.97  0.95  0.84  0.96  0.99  0.98  0.98  1.01  1.03
> 32  1.05  1.03  0.87  1.05  1.25  1.16  1.25  1.30  1.27
> 64  1.83  1.10  1.07  1.02  1.19  1.18  1.21  1.14  1.13
> 
> non-critical-length: 128
>         0     1     2     4     8    16    32    64   128
> 1   1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00
> 2   0.99  1.02  1.00  1.00  1.00  1.00  1.00  1.00  1.00
> 4   0.98  0.99  1.00  1.00  0.99  1.04  0.99  0.99  1.02
> 8   1.08  1.08  1.08  1.07  1.15  1.12  1.03  0.94  1.00
> 16  1.00  1.00  1.00  1.01  1.01  1.01  1.36  1.06  1.02
> 32  1.07  1.08  1.07  1.07  1.09  1.10  1.22  1.36  1.25
> 64  1.03  1.04  1.04  1.06  1.13  1.18  0.82  1.02  1.14
> 
> Reviewed-by: Tianyou Li <tianyou.li@...el.com>
> Reviewed-by: Tim Chen <tim.c.chen@...ux.intel.com>
> Signed-off-by: Wangyang Guo <wangyang.guo@...el.com>
> ---
>   arch/x86/include/asm/qspinlock.h | 28 +++++++++++++++++++++++++---
>   1 file changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
> index 68da67df304d..ac6e1bbd9ba4 100644
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -87,7 +87,7 @@ DECLARE_STATIC_KEY_FALSE(virt_spin_lock_key);
>   #define virt_spin_lock virt_spin_lock
>   static inline bool virt_spin_lock(struct qspinlock *lock)
>   {
> -	int val;
> +	int val, locked;
>   
>   	if (!static_branch_likely(&virt_spin_lock_key))
>   		return false;
> @@ -98,11 +98,33 @@ static inline bool virt_spin_lock(struct qspinlock *lock)
>   	 * horrible lock 'holder' preemption issues.
>   	 */
>   
> +#define MAX_BACKOFF 64
> +	int backoff = 1;
> +
>    __retry:
>   	val = atomic_read(&lock->val);
> +	locked = val;
> +
> +	if (locked || !atomic_try_cmpxchg(&lock->val, &val, _Q_LOCKED_VAL)) {
> +		int spin_count = backoff;
> +
> +		while (spin_count--)
> +			cpu_relax();
> +
> +		/*
> +		 * Here not locked means lock tried, but fails.
> +		 *
> +		 * When multiple threads waiting for lock at the same time,
> +		 * once lock owner releases the lock, waiters will see lock available
> +		 * and all try to lock, which may cause an expensive CAS storm.
> +		 *
> +		 * Binary exponential backoff is introduced. As try-lock attempt
> +		 * increases, there is more likely that a larger number threads
> +		 * compete for the same lock, so increase wait time in exponential.
> +		 */
> +		if (!locked)
> +			backoff = (backoff < MAX_BACKOFF) ? backoff << 1 : backoff;
>   
> -	if (val || !atomic_try_cmpxchg(&lock->val, &val, _Q_LOCKED_VAL)) {
> -		cpu_relax();
>   		goto __retry;
>   	}
>