lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 23 Jun 2022 17:54:07 -0400
From:   Waiman Long <longman@...hat.com>
To:     Guo Hui <guohui@...ontech.com>, peterz@...radead.org
Cc:     jpoimboe@...nel.org, song@...nel.org, tglx@...utronix.de,
        mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com,
        x86@...nel.org, hpa@...or.com, daniel@...earbox.net,
        will@...nel.org, boqun.feng@...il.com, wangxiaohua@...ontech.com,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86/paravirt: useless assignment instructions cause
 Unixbench full core performance degradation

On 6/23/22 11:50, Guo Hui wrote:
> The instructions assigned to the parameters of
> the vcpu_is_preempted function in the X86 architecture
> physical machine are redundant instructions, which cause the
> multi-core performance of Unixbench to drop by
> about 300 to 500 points. The C function is as follows:
> static bool vcpu_is_preempted(long vcpu);
>
> The parameter assignments in the function osq_lock
> that call the function vcpu_is_preempted are as follows:
> mov 0x14(%rax),%edi
> sub $0x1,%edi
>
> The above instructions are unnecessary
> in the X86 Native operating environment,
> causing high cache-misses and degrading performance.
>
> This patch implements the replacement of the above instructions with
> the nop instruction at system startup.
> When the assignment C code changes,
> the above assignment instructions may also change accordingly.
> In order to flexibly replace the instructions generated by the C code,
> this patch defines the macro ALTERNATIVE_C_CODE
> from the ALTERNATIVE macro. Different from the first parameter
> of the ALTERNATIVE macro, the ALTERNATIVE_C_CODE macro is
> the first parameter is a C code statement and cannot contain
> a return statement. To use the macro, for example,
> replace the instructions generated by the
> C code 'cpu = node->cpu - 1;' with the nop instruction:
>
> OSQ_ALTERNATIVE_C_CODE(cpu = node->cpu - 1, "nop", 0);
>
> The patch effect is as follows two machines,
> Unixbench runs with full core score:
>
> 1. Machine configuration:
> Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
> CPU core: 40
> Memory: 256G
> OS Kernel: 5.19-rc3
>
> Before using the patch:
>
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0  944608003.9  80943.3
> Double-Precision Whetstone                       55.0     212045.8  38553.8
> Execl Throughput                                 43.0      42772.3   9947.1
> File Copy 1024 bufsize 2000 maxblocks          3960.0     462656.6   1168.3
> File Copy 256 bufsize 500 maxblocks            1655.0     120043.6    725.3
> File Copy 4096 bufsize 8000 maxblocks          5800.0    1544525.5   2663.0
> Pipe Throughput                               12440.0   47277698.5  38004.6
> Pipe-based Context Switching                   4000.0    1894556.8   4736.4
> Process Creation                                126.0      86077.0   6831.5
> Shell Scripts (1 concurrent)                     42.4      70236.3  16565.2
> Shell Scripts (8 concurrent)                      6.0       8978.1  14963.4
> System Call Overhead                          15000.0    4691260.0   3127.5
>                                                                     ========
> System Benchmarks Index Score                                        7980.9
>
> After using the patch:
>
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0 2253984916.9 193143.5
> Double-Precision Whetstone                       55.0     438940.3  79807.3
> Execl Throughput                                 43.0      10720.3   2493.1
> File Copy 1024 bufsize 2000 maxblocks          3960.0     312233.0    788.5
> File Copy 256 bufsize 500 maxblocks            1655.0      80050.9    483.7
> File Copy 4096 bufsize 8000 maxblocks          5800.0    1036101.7   1786.4
> Pipe Throughput                               12440.0  117700315.3  94614.4
> Pipe-based Context Switching                   4000.0    8421909.8  21054.8
> Process Creation                                126.0      36742.0   2916.0
> Shell Scripts (1 concurrent)                     42.4      52846.2  12463.7
> Shell Scripts (8 concurrent)                      6.0       7058.1  11763.6
> System Call Overhead                          15000.0    6791548.2   4527.7
>                                                                     ========
> System Benchmarks Index Score                                        8260.6
>
> 2. Machine configuration:
> Hygon C86 7185 32-core Processor
> CPU core: 128
> Memory: 256G
> OS Kernel: 5.19-rc3
>
> Before using the patch:
>
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0 2256283941.6 193340.5
> Double-Precision Whetstone                       55.0     439577.3  79923.2
> Execl Throughput                                 43.0      10013.6   2328.7
> File Copy 1024 bufsize 2000 maxblocks          3960.0     278121.5    702.3
> File Copy 256 bufsize 500 maxblocks            1655.0      71835.5    434.1
> File Copy 4096 bufsize 8000 maxblocks          5800.0     905654.2   1561.5
> Pipe Throughput                               12440.0  117715166.2  94626.3
> Pipe-based Context Switching                   4000.0    7731331.7  19328.3
> Process Creation                                126.0      30157.8   2393.5
> Shell Scripts (1 concurrent)                     42.4      48670.8  11479.0
> Shell Scripts (8 concurrent)                      6.0       6595.6  10992.7
> System Call Overhead                          15000.0    6766475.9   4511.0
>                                                                     ========
> System Benchmarks Index Score                                        7688.7
>
> After using the patch:
>
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0 2253984916.9 193143.5
> Double-Precision Whetstone                       55.0     438940.3  79807.3
> Execl Throughput                                 43.0      10720.3   2493.1
> File Copy 1024 bufsize 2000 maxblocks          3960.0     312233.0    788.5
> File Copy 256 bufsize 500 maxblocks            1655.0      80050.9    483.7
> File Copy 4096 bufsize 8000 maxblocks          5800.0    1036101.7   1786.4
> Pipe Throughput                               12440.0  117700315.3  94614.4
> Pipe-based Context Switching                   4000.0    8421909.8  21054.8
> Process Creation                                126.0      36742.0   2916.0
> Shell Scripts (1 concurrent)                     42.4      52846.2  12463.7
> Shell Scripts (8 concurrent)                      6.0       7058.1  11763.6
> System Call Overhead                          15000.0    6791548.2   4527.7
>                                                                     ========
> System Benchmarks Index Score                                        8260.6
>
> Signed-off-by: Guo Hui <guohui@...ontech.com>
> ---
>   arch/x86/include/asm/alternative.h | 15 +++++++++++++++
>   arch/x86/kernel/alternative.c      | 11 +++++++++++
>   include/linux/osq_lock.h           |  7 +++++++
>   kernel/locking/osq_lock.c          |  6 +++++-
>   4 files changed, 38 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
> index 9b10c8c76..5979ebe89 100644
> --- a/arch/x86/include/asm/alternative.h
> +++ b/arch/x86/include/asm/alternative.h
> @@ -167,6 +167,21 @@ static inline int alternatives_text_reserved(void *start, void *end)
>   	ALTINSTR_REPLACEMENT(newinstr, 1)				\
>   	".popsection\n"
>   
> +/* alternative c code primitive: */
> +#define ALTERNATIVE_C_CODE(oldinstr, newinstr, feature)			\
> +do {									\
> +	asm volatile("661:\n\t");					\
> +	oldinstr;							\
> +	asm volatile("\n662:\n"						\
> +		alt_end_marker ":\n"					\
> +		".pushsection .altinstructions,\"a\"\n"			\
> +		ALTINSTR_ENTRY(feature, 1)				\
> +		".popsection\n"						\
> +		".pushsection .altinstr_replacement, \"ax\"\n"		\
> +		ALTINSTR_REPLACEMENT(newinstr, 1)			\
> +		".popsection\n");					\
> +} while (0)
> +
>   #define ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2)\
>   	OLDINSTR_2(oldinstr, 1, 2)					\
>   	".pushsection .altinstructions,\"a\"\n"				\
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index e257f6c80..cf77be884 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -251,6 +251,8 @@ static void __init_or_module noinline optimize_nops(u8 *instr, size_t len)
>   	}
>   }
>   
> +extern bool pv_is_native_vcpu_is_preempted(void);
> +
>   /*
>    * Replace instructions with better alternatives for this CPU type. This runs
>    * before SMP is initialized to avoid SMP problems with self modifying code.
> @@ -285,6 +287,15 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
>   
>   		instr = (u8 *)&a->instr_offset + a->instr_offset;
>   		replacement = (u8 *)&a->repl_offset + a->repl_offset;
> +
> +		if (*replacement == 0x90 && a->replacementlen == 1) {
> +#if defined(CONFIG_PARAVIRT_SPINLOCKS)
> +			if (pv_is_native_vcpu_is_preempted())
> +				add_nops(instr, a->instrlen);
> +#endif
> +			continue;
> +		}
> +
This is hacky and it may incorrectly affect other alternatives that 
patches thing to nop.
>   		BUG_ON(a->instrlen > sizeof(insn_buff));
>   		BUG_ON(feature >= (NCAPINTS + NBUGINTS) * 32);
>   
> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
> index 5581dbd3b..ee960e3aa 100644
> --- a/include/linux/osq_lock.h
> +++ b/include/linux/osq_lock.h
> @@ -38,4 +38,11 @@ static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
>   	return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
>   }
>   
> +#ifdef ALTERNATIVE_C_CODE
> +#define OSQ_ALTERNATIVE_C_CODE(oldinstr, newinstr, feature) \
> +		ALTERNATIVE_C_CODE(oldinstr, newinstr, feature)
> +#else
> +#define OSQ_ALTERNATIVE_C_CODE(oldinstr, newinstr, feature) oldinstr
> +#endif
> +
>   #endif
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d5610ad52..bbe7d640c 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -24,7 +24,11 @@ static inline int encode_cpu(int cpu_nr)
>   
>   static inline int node_cpu(struct optimistic_spin_node *node)
>   {
> -	return node->cpu - 1;
> +	int cpu = 0;
> +
> +	OSQ_ALTERNATIVE_C_CODE(cpu = node->cpu - 1, "nop", 0);
> +
> +	return cpu;
>   }

Why don't you use static_key to control the alternative and add a 
late_initcall() to set it. I think that will be simpler and you can 
throw away ALTERNATIVE_C_CODE().

Cheers,
Longman

Powered by blists - more mailing lists