[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9f5321f9-0b35-7f85-ee09-247d961e5a79@redhat.com>
Date: Wed, 14 Jun 2023 21:58:53 -0400
From: Waiman Long <longman@...hat.com>
To: Guo Hui <guohui@...ontech.com>, peterz@...radead.org,
mingo@...hat.com, will@...nel.org, boqun.feng@...il.com
Cc: linux-kernel@...r.kernel.org, wangxiaohua@...ontech.com
Subject: Re: [PATCH] locking/osq_lock: Fix false sharing of
optimistic_spin_node in osq_lock
On 6/14/23 21:24, Guo Hui wrote:
> For the performance of osq_lock,
> I have made a patch before:
>
> https://lore.kernel.org/lkml/20220628161251.21950-1-guohui@uniontech.com/
>
> the analysis conclusion is due to the memory access
> of the following code caused performance degradation:
>
> cpu = node->cpu - 1;
>
> The instructions corresponding to the C code are:
> mov 0x14(%rax),%edi
> sub $0x1,%edi
>
> in the X86 operating environment,
> causing high cache-misses and degrading performance.
>
> The memory access instructions that cause performance degradation are
> further analyzed.The cache-misses of the above instructions are caused
> by a large number of cache-line false sharing
> when accessing non-local-CPU variables,as follows:
>
> ---------------------------------
> | struct optimistic_spin_node |
> ---------------------------------
> | next | prev | locked | cpu |
> ---------------------------------
> | cache line |
> ---------------------------------
> | CPU0 |
> ---------------------------------
>
> When a CPU other than CPU0 reads the value of
> optimistic_spin_node->cpu of CPU0,CPU0 frequently modifies
> the data of the cache line,which will cause false sharing
> on the currently accessing CPU,and the variable of
> the structure optimistic_spin_node type will be
> defined as a cacheline alignmented per cpu variable,
> each optimistic_spin_node variable is bound to the corresponding CPU core:
>
> DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
>
> Therefore, the value of optimistic_spin_node->cpu is usually unchanged,
> so the false sharing caused by access to optimistic_spin_node->cpu
> is caused by frequent modification of the other three attributes
> of optimistic_spin_node.
>
> There are two solutions as follows:
>
> solution 1:
> Put the cpu attribute of optimistic_spin_node into a cacheline separately.
> The patch is as follows:
>
> struct optimistic_spin_node {
> struct optimistic_spin_node *next, *prev;
> int locked; /* 1 if lock acquired */
> - int cpu; /* encoded CPU # + 1 value */
> + int cpu ____cacheline_aligned; /* encoded CPU # + 1 value */
> };
>
> Unixbench full-core performance data is as follows:
> Machine: Hygon X86, 128 cores
> with patch without patch promote
> Dhrystone 2 using register variables 194923.07 195091 -0.09%
> Double-Precision Whetstone 79885.47 79838.87 +0.06%
> Execl Throughput 2327.17 2272.1 +2.42%
> File Copy 1024 bufsize 2000 maxblocks 742.1 687.53 +7.94%
> File Copy 256 bufsize 500 maxblocks 462.73 428.03 +8.11%
> File Copy 4096 bufsize 8000 maxblocks 1600.37 1520.53 +5.25%
> Pipe Throughput 79815.33 79522.13 +0.37%
> Pipe-based Context Switching 28962.9 27987.8 +3.48%
> Process Creation 3084.4 2999.1 +2.84%
> Shell Scripts 1 concurrent 11687.1 11394.67 +2.57%
> Shell Scripts 8 concurrent 10787.1 10496.17 +2.77%
> System Call Overhead 4322.77 4322.23 +0.01%
> System Benchmarks Index Score 8079.4 7848.37 +3.0%
>
> solution 2:
> The core idea of osq lock is that
> the lock applicant spins on the local-CPU variable
> to eliminate cache-line bouncing.Therefore,
> the same method is used for the above degradation.
> For the optimistic_spin_node of the current CPU,
> the cpu attribute of its predecessor optimistic_spin_node is
> non-local-CPU variables,the cpu attribute of
> the predecessor optimistic_spin_node is cached
> in the optimistic_spin_node of the current CPU
> to eliminate performance degradation
> caused by non-local-CPU variable access, as follows:
>
> bool osq_lock(struct optimistic_spin_queue *lock)
> {
> [... ...]
>
> node->prev = prev;
> node->prev_cpu = prev->cpu; --------------- A
>
> [... ...]
>
> WRITE_ONCE(next->prev, prev);
> WRITE_ONCE(prev->next, next);
> WRITE_ONCE(next->prev_cpu, prev->cpu); -------------- B
>
> [... ...]
> }
>
> static inline int node_cpu(struct optimistic_spin_node *node)
> {
> return node->prev_cpu - 1; ----------------------------- C
> }
>
> While setting the prev attribute of the optimistic_spin_node of
> the current CPU,the current patch also caches the prev cpu attribute
> in the prev_cpu attribute of the optimistic_spin_node of the current CPU,
> as in the above code lines A and B,where node is a per cpu variable,
> so each one node corresponds to a CPU core,and the cpu attribute of
> the node corresponding to the CPU core will not change.
> Only when the prev attribute of the node is set,
> the prev_cpu of the node may change with the change of prev.
> At other times, the prev attribute of the node will not change.
> so the prev_cpu of node will not change.
> This patch greatly reduces the non-local-CPU variable
> access at code line C and improves performance.
>
> Unixbench full-core performance data is as follows:
> Machine: Hygon X86, 128 cores
> with patch without patch promote
> Dhrystone 2 using register variables 194818.7 195091 -0.14%
> Double-Precision Whetstone 79847.57 79838.87 +0.01%
> Execl Throughput 2372.83 2272.1 +4.43%
> File Copy 1024 bufsize 2000 maxblocks 765 687.53 +11.27%
> File Copy 256 bufsize 500 maxblocks 472.13 428.03 +10.30%
> File Copy 4096 bufsize 8000 maxblocks 1658.13 1520.53 +9.05%
> Pipe Throughput 79634.17 79522.13 +0.14%
> Pipe-based Context Switching 28584.7 27987.8 +2.13%
> Process Creation 3020.27 2999.1 +0.71%
> Shell Scripts 1 concurrent 11890.87 11394.67 +4.35%
> Shell Scripts 8 concurrent 10912.9 10496.17 +3.97%
> System Call Overhead 4320.63 4322.23 -0.04%
> System Benchmarks Index Score 8144.43 7848.37 +4.0%
>
> In summary, the performance of solution 2 is better than solution 1.
> Especially use cases: execl, file copy, shell1, shell8,
> great improvement,because solution 1 still has the possibility of
> remote memory access across NUMA nodes,
> and solution 2 completely accesses local-CPU variables,
> so solution 2 is better than solution 1.
>
> Both solutions also have a great improvement in the X86 virtual machine.
>
> The current patch also uses solution 2.
>
> Signed-off-by: Guo Hui <guohui@...ontech.com>
> ---
> include/linux/osq_lock.h | 1 +
> kernel/locking/osq_lock.c | 6 ++++--
> 2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
> index 5581dbd3bd34..8a1bb36f4a07 100644
> --- a/include/linux/osq_lock.h
> +++ b/include/linux/osq_lock.h
> @@ -10,6 +10,7 @@ struct optimistic_spin_node {
> struct optimistic_spin_node *next, *prev;
> int locked; /* 1 if lock acquired */
> int cpu; /* encoded CPU # + 1 value */
> + int prev_cpu; /* Only for optimizing false sharing */
> };
>
> struct optimistic_spin_queue {
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d5610ad52b92..bdcd216b73c4 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -24,7 +24,7 @@ static inline int encode_cpu(int cpu_nr)
>
> static inline int node_cpu(struct optimistic_spin_node *node)
> {
> - return node->cpu - 1;
> + return node->prev_cpu - 1;
> }
>
From the first glance, the patch looks OK. The only issue that I have
so far is the "node_cpu" name which is misleading in this case. You
should have renamed it to node_prev_cpu() instead as there is only one
caller.
I will look a closer look tomorrow.
Cheers,
Longman
> static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> @@ -110,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>
> prev = decode_cpu(old);
> node->prev = prev;
> + node->prev_cpu = prev->cpu;
>
> /*
> * osq_lock() unqueue
> @@ -141,7 +142,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * polling, be careful.
> */
> if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
> - vcpu_is_preempted(node_cpu(node->prev))))
> + vcpu_is_preempted(node_cpu(node))))
> return true;
>
> /* unqueue */
> @@ -200,6 +201,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>
> WRITE_ONCE(next->prev, prev);
> WRITE_ONCE(prev->next, next);
> + WRITE_ONCE(next->prev_cpu, prev->cpu);
>
> return false;
> }
Powered by blists - more mailing lists