[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f6b68466-968c-4a91-655a-23970280a072@redhat.com>
Date: Thu, 23 Jun 2022 17:54:07 -0400
From: Waiman Long <longman@...hat.com>
To: Guo Hui <guohui@...ontech.com>, peterz@...radead.org
Cc: jpoimboe@...nel.org, song@...nel.org, tglx@...utronix.de,
mingo@...hat.com, bp@...en8.de, dave.hansen@...ux.intel.com,
x86@...nel.org, hpa@...or.com, daniel@...earbox.net,
will@...nel.org, boqun.feng@...il.com, wangxiaohua@...ontech.com,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86/paravirt: useless assignment instructions cause
Unixbench full core performance degradation
On 6/23/22 11:50, Guo Hui wrote:
> The instructions assigned to the parameters of
> the vcpu_is_preempted function in the X86 architecture
> physical machine are redundant instructions, which cause the
> multi-core performance of Unixbench to drop by
> about 300 to 500 points. The C function is as follows:
> static bool vcpu_is_preempted(long vcpu);
>
> The parameter assignments in the function osq_lock
> that call the function vcpu_is_preempted are as follows:
> mov 0x14(%rax),%edi
> sub $0x1,%edi
>
> The above instructions are unnecessary
> in the X86 Native operating environment,
> causing high cache-misses and degrading performance.
>
> This patch implements the replacement of the above instructions with
> the nop instruction at system startup.
> When the assignment C code changes,
> the above assignment instructions may also change accordingly.
> In order to flexibly replace the instructions generated by the C code,
> this patch defines the macro ALTERNATIVE_C_CODE
> from the ALTERNATIVE macro. Different from the first parameter
> of the ALTERNATIVE macro, the ALTERNATIVE_C_CODE macro is
> the first parameter is a C code statement and cannot contain
> a return statement. To use the macro, for example,
> replace the instructions generated by the
> C code 'cpu = node->cpu - 1;' with the nop instruction:
>
> OSQ_ALTERNATIVE_C_CODE(cpu = node->cpu - 1, "nop", 0);
>
> The patch effect is as follows two machines,
> Unixbench runs with full core score:
>
> 1. Machine configuration:
> Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
> CPU core: 40
> Memory: 256G
> OS Kernel: 5.19-rc3
>
> Before using the patch:
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 944608003.9 80943.3
> Double-Precision Whetstone 55.0 212045.8 38553.8
> Execl Throughput 43.0 42772.3 9947.1
> File Copy 1024 bufsize 2000 maxblocks 3960.0 462656.6 1168.3
> File Copy 256 bufsize 500 maxblocks 1655.0 120043.6 725.3
> File Copy 4096 bufsize 8000 maxblocks 5800.0 1544525.5 2663.0
> Pipe Throughput 12440.0 47277698.5 38004.6
> Pipe-based Context Switching 4000.0 1894556.8 4736.4
> Process Creation 126.0 86077.0 6831.5
> Shell Scripts (1 concurrent) 42.4 70236.3 16565.2
> Shell Scripts (8 concurrent) 6.0 8978.1 14963.4
> System Call Overhead 15000.0 4691260.0 3127.5
> ========
> System Benchmarks Index Score 7980.9
>
> After using the patch:
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 2253984916.9 193143.5
> Double-Precision Whetstone 55.0 438940.3 79807.3
> Execl Throughput 43.0 10720.3 2493.1
> File Copy 1024 bufsize 2000 maxblocks 3960.0 312233.0 788.5
> File Copy 256 bufsize 500 maxblocks 1655.0 80050.9 483.7
> File Copy 4096 bufsize 8000 maxblocks 5800.0 1036101.7 1786.4
> Pipe Throughput 12440.0 117700315.3 94614.4
> Pipe-based Context Switching 4000.0 8421909.8 21054.8
> Process Creation 126.0 36742.0 2916.0
> Shell Scripts (1 concurrent) 42.4 52846.2 12463.7
> Shell Scripts (8 concurrent) 6.0 7058.1 11763.6
> System Call Overhead 15000.0 6791548.2 4527.7
> ========
> System Benchmarks Index Score 8260.6
>
> 2. Machine configuration:
> Hygon C86 7185 32-core Processor
> CPU core: 128
> Memory: 256G
> OS Kernel: 5.19-rc3
>
> Before using the patch:
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 2256283941.6 193340.5
> Double-Precision Whetstone 55.0 439577.3 79923.2
> Execl Throughput 43.0 10013.6 2328.7
> File Copy 1024 bufsize 2000 maxblocks 3960.0 278121.5 702.3
> File Copy 256 bufsize 500 maxblocks 1655.0 71835.5 434.1
> File Copy 4096 bufsize 8000 maxblocks 5800.0 905654.2 1561.5
> Pipe Throughput 12440.0 117715166.2 94626.3
> Pipe-based Context Switching 4000.0 7731331.7 19328.3
> Process Creation 126.0 30157.8 2393.5
> Shell Scripts (1 concurrent) 42.4 48670.8 11479.0
> Shell Scripts (8 concurrent) 6.0 6595.6 10992.7
> System Call Overhead 15000.0 6766475.9 4511.0
> ========
> System Benchmarks Index Score 7688.7
>
> After using the patch:
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 2253984916.9 193143.5
> Double-Precision Whetstone 55.0 438940.3 79807.3
> Execl Throughput 43.0 10720.3 2493.1
> File Copy 1024 bufsize 2000 maxblocks 3960.0 312233.0 788.5
> File Copy 256 bufsize 500 maxblocks 1655.0 80050.9 483.7
> File Copy 4096 bufsize 8000 maxblocks 5800.0 1036101.7 1786.4
> Pipe Throughput 12440.0 117700315.3 94614.4
> Pipe-based Context Switching 4000.0 8421909.8 21054.8
> Process Creation 126.0 36742.0 2916.0
> Shell Scripts (1 concurrent) 42.4 52846.2 12463.7
> Shell Scripts (8 concurrent) 6.0 7058.1 11763.6
> System Call Overhead 15000.0 6791548.2 4527.7
> ========
> System Benchmarks Index Score 8260.6
>
> Signed-off-by: Guo Hui <guohui@...ontech.com>
> ---
> arch/x86/include/asm/alternative.h | 15 +++++++++++++++
> arch/x86/kernel/alternative.c | 11 +++++++++++
> include/linux/osq_lock.h | 7 +++++++
> kernel/locking/osq_lock.c | 6 +++++-
> 4 files changed, 38 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
> index 9b10c8c76..5979ebe89 100644
> --- a/arch/x86/include/asm/alternative.h
> +++ b/arch/x86/include/asm/alternative.h
> @@ -167,6 +167,21 @@ static inline int alternatives_text_reserved(void *start, void *end)
> ALTINSTR_REPLACEMENT(newinstr, 1) \
> ".popsection\n"
>
> +/* alternative c code primitive: */
> +#define ALTERNATIVE_C_CODE(oldinstr, newinstr, feature) \
> +do { \
> + asm volatile("661:\n\t"); \
> + oldinstr; \
> + asm volatile("\n662:\n" \
> + alt_end_marker ":\n" \
> + ".pushsection .altinstructions,\"a\"\n" \
> + ALTINSTR_ENTRY(feature, 1) \
> + ".popsection\n" \
> + ".pushsection .altinstr_replacement, \"ax\"\n" \
> + ALTINSTR_REPLACEMENT(newinstr, 1) \
> + ".popsection\n"); \
> +} while (0)
> +
> #define ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2)\
> OLDINSTR_2(oldinstr, 1, 2) \
> ".pushsection .altinstructions,\"a\"\n" \
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index e257f6c80..cf77be884 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -251,6 +251,8 @@ static void __init_or_module noinline optimize_nops(u8 *instr, size_t len)
> }
> }
>
> +extern bool pv_is_native_vcpu_is_preempted(void);
> +
> /*
> * Replace instructions with better alternatives for this CPU type. This runs
> * before SMP is initialized to avoid SMP problems with self modifying code.
> @@ -285,6 +287,15 @@ void __init_or_module noinline apply_alternatives(struct alt_instr *start,
>
> instr = (u8 *)&a->instr_offset + a->instr_offset;
> replacement = (u8 *)&a->repl_offset + a->repl_offset;
> +
> + if (*replacement == 0x90 && a->replacementlen == 1) {
> +#if defined(CONFIG_PARAVIRT_SPINLOCKS)
> + if (pv_is_native_vcpu_is_preempted())
> + add_nops(instr, a->instrlen);
> +#endif
> + continue;
> + }
> +
This is hacky and it may incorrectly affect other alternatives that
patches thing to nop.
> BUG_ON(a->instrlen > sizeof(insn_buff));
> BUG_ON(feature >= (NCAPINTS + NBUGINTS) * 32);
>
> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
> index 5581dbd3b..ee960e3aa 100644
> --- a/include/linux/osq_lock.h
> +++ b/include/linux/osq_lock.h
> @@ -38,4 +38,11 @@ static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
> return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
> }
>
> +#ifdef ALTERNATIVE_C_CODE
> +#define OSQ_ALTERNATIVE_C_CODE(oldinstr, newinstr, feature) \
> + ALTERNATIVE_C_CODE(oldinstr, newinstr, feature)
> +#else
> +#define OSQ_ALTERNATIVE_C_CODE(oldinstr, newinstr, feature) oldinstr
> +#endif
> +
> #endif
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d5610ad52..bbe7d640c 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -24,7 +24,11 @@ static inline int encode_cpu(int cpu_nr)
>
> static inline int node_cpu(struct optimistic_spin_node *node)
> {
> - return node->cpu - 1;
> + int cpu = 0;
> +
> + OSQ_ALTERNATIVE_C_CODE(cpu = node->cpu - 1, "nop", 0);
> +
> + return cpu;
> }
Why don't you use static_key to control the alternative and add a
late_initcall() to set it. I think that will be simpler and you can
throw away ALTERNATIVE_C_CODE().
Cheers,
Longman
Powered by blists - more mailing lists