linux-kernel - Re: [PATCH v2] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YrlhGqqce0NCQ6hi@hirez.programming.kicks-ass.net>
Date:   Mon, 27 Jun 2022 09:49:46 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Guo Hui <guohui@...ontech.com>
Cc:     longman@...hat.com, jgross@...e.com, srivatsa@...il.mit.edu,
        amakhalov@...are.com, pv-drivers@...are.com, tglx@...utronix.de,
        mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com,
        x86@...nel.org, hpa@...or.com, will@...nel.org,
        boqun.feng@...il.com, virtualization@...ts.linux-foundation.org,
        wangxiaohua@...ontech.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] x86/paravirt: useless assignment instructions cause
 Unixbench full core   performance degradation

On Mon, Jun 27, 2022 at 10:13:50AM +0800, Guo Hui wrote:
> The instructions assigned to the vcpu_is_preempted function parameter
> in the X86 architecture physical machine are redundant instructions,
> causing the multi-core performance of Unixbench to drop by about 4% to 5%.
> The C function is as follows:
> static bool vcpu_is_preempted(long vcpu);
> 
> The parameter 'vcpu' in the function osq_lock
> that calls the function vcpu_is_preempted is assigned as follows:
> 
> The C code is in the function node_cpu:
> cpu = node->cpu - 1;
> 
> The instructions corresponding to the C code are:
> mov 0x14(%rax),%edi
> sub $0x1,%edi
> 
> The above instructions are unnecessary
> in the X86 Native operating environment,
> causing high cache-misses and degrading performance.

The above basically says that argument setup is not patched out and
causes significant pain due to a cache-miss.

> Signed-off-by: Guo Hui <guohui@...ontech.com>
> ---
>  arch/x86/kernel/paravirt-spinlocks.c | 4 ++++
>  kernel/locking/osq_lock.c            | 9 ++++++++-
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
> index 9e1ea99ad..7a55f8407 100644
> --- a/arch/x86/kernel/paravirt-spinlocks.c
> +++ b/arch/x86/kernel/paravirt-spinlocks.c
> @@ -33,6 +33,8 @@ bool pv_is_native_vcpu_is_preempted(void)
>  		__raw_callee_save___native_vcpu_is_preempted;
>  }
>  
> +DECLARE_STATIC_KEY_FALSE(preemted_key);
> +
>  void __init paravirt_set_cap(void)
>  {
>  	if (!pv_is_native_spin_unlock())
> @@ -40,4 +42,6 @@ void __init paravirt_set_cap(void)
>  
>  	if (!pv_is_native_vcpu_is_preempted())
>  		setup_force_cpu_cap(X86_FEATURE_VCPUPREEMPT);
> +	else
> +		static_branch_enable(&preemted_key);
>  }

At least for x86 it makes sense to have the static_key default the other
way around. That is, enable it along with vcpu_is_preempted().

> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d5610ad52..a8798e701 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -22,9 +22,16 @@ static inline int encode_cpu(int cpu_nr)
>  	return cpu_nr + 1;
>  }
>  
> +DEFINE_STATIC_KEY_FALSE(preemted_key);
> +
>  static inline int node_cpu(struct optimistic_spin_node *node)
>  {
> -	return node->cpu - 1;
> +	int cpu = 0;
> +
> +	if (!static_branch_unlikely(&preemted_key))
> +		cpu = node->cpu - 1;
> +
> +	return cpu;
>  }

Would not something like:

static inline bool
vcpu_is_preempted_node(struct optimistic_spin_node *node)
{
	if (!static_branch_unlikely(&vcpu_has_preemption))
		return false;

	return vcpu_is_preempted(node_cpu(node->prev));
}

And then use that like:

	if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
				  vcpu_is_preempted_node(node)))

Not generate better code still?