linux-kernel - Re: [PATCH] locking/osq_lock: Avoid false sharing in optimistic_spin

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ef3e34be-a30a-402b-a806-aa6ef29b410d@redhat.com>
Date: Wed, 20 Dec 2023 11:58:25 -0500
From: Waiman Long <longman@...hat.com>
To: Zeng Heng <zengheng4@...wei.com>, boqun.feng@...il.com,
 peterz@...radead.org, mingo@...hat.com, will@...nel.org, guohui@...ontech.com
Cc: linux-kernel@...r.kernel.org, xiexiuqi@...wei.com, liwei391@...wei.com
Subject: Re: [PATCH] locking/osq_lock: Avoid false sharing in
 optimistic_spin_node


On 12/20/23 02:02, Zeng Heng wrote:
> Using the UnixBench test suite, we clearly find that osq_lock() cause
> extremely high overheads with perf tool in the File Copy items:
>
> Overhead  Shared Object            Symbol
>    94.25%  [kernel]                 [k] osq_lock
>     0.74%  [kernel]                 [k] rwsem_spin_on_owner
>     0.32%  [kernel]                 [k] filemap_get_read_batch
>
> In response to this, we conducted an analysis and made some gains:
>
> In the prologue of osq_lock(), it set `cpu` member of percpu struct
> optimistic_spin_node with the local cpu id, after that the value of the
> percpu struct would never change in fact. Based on that, we can regard
> the `cpu` member as a constant variable.
>
> In the meanwhile, other members of the percpu struct like next, prev and
> locked are frequently modified by osq_lock() and osq_unlock() which are
> called by rwsem, mutex and so on. However, that would invalidate the cache
> of the cpu member on other CPUs.
>
> Therefore, we can place padding here and split them into different cache
> lines to avoid cache misses when the next CPU is spinning to check other
> node's cpu member by vcpu_is_preempted().
>
> Here provide the UnixBench full-core test result based on v6.6 as below:
> Machine Intel(R) Xeon(R) Gold 6248 CPU, 40 cores, 80 threads
> Run the command of "./Run -c 80 -i 3" over 20 times and take the average.
>
> System Benchmarks Index Values           Without Patch   With Patch     Diff
> Dhrystone 2 using register variables         185518.38    185329.56   -0.10%
> Double-Precision Whetstone                    79330.46     79268.22   -0.08%
> Execl Throughput                               9725.14     10390.18    6.84%
> File Copy 1024 bufsize 2000 maxblocks          1658.42      2035.55   22.74%
> File Copy 256 bufsize 500 maxblocks            1086.54      1316.96   21.21%
> File Copy 4096 bufsize 8000 maxblocks          3610.42      4152.79   15.02%
> Pipe Throughput                               69325.18     69913.85    0.85%
> Pipe-based Context Switching                  14026.32     14703.07    4.82%
> Process Creation                               8329.94      8355.31    0.30%
> Shell Scripts (1 concurrent)                  38942.41     41518.39    6.61%
> Shell Scripts (8 concurrent)                  37762.35     40224.49    6.52%
> System Call Overhead                           4064.44      4004.45   -1.48%
>                                                                      ========
> System Benchmarks Index Score                 13634.17     14560.71    6.80%
>
> Signed-off-by: Zeng Heng <zengheng4@...wei.com>
> ---
>   include/linux/osq_lock.h  | 6 +++++-
>   kernel/locking/osq_lock.c | 8 +++++++-
>   2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
> index 5581dbd3bd34..f33f47ec0c90 100644
> --- a/include/linux/osq_lock.h
> +++ b/include/linux/osq_lock.h
> @@ -9,7 +9,11 @@
>   struct optimistic_spin_node {
>   	struct optimistic_spin_node *next, *prev;
>   	int locked; /* 1 if lock acquired */
> -	int cpu; /* encoded CPU # + 1 value */
> +	/*
> +	 * Stores an encoded CPU # + 1 value.
> +	 * Only read by other cpus, so split into different cache lines.
> +	 */
> +	int cpu ____cacheline_aligned;
>   };
>   
>   struct optimistic_spin_queue {
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d5610ad52b92..17618d62343f 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -96,7 +96,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>   
>   	node->locked = 0;
>   	node->next = NULL;
> -	node->cpu = curr;
> +	/*
> +	 * After this cpu member is initialized for the first time, it
> +	 * would no longer change in fact. That could avoid cache misses
> +	 * when spin and access the cpu member by other CPUs.
> +	 */
> +	if (node->cpu != curr)
> +		node->cpu = curr;
>   
>   	/*
>   	 * We need both ACQUIRE (pairs with corresponding RELEASE in

The contention on prev->cpu is due to the vcpu_is_preempted() call in 
osq_lock(). Could you try the attached patch to see if that helps?

Thanks,
Longman


View attachment "0001-locking-osq_lock-Minimize-spinning-on-prev-cpu.patch" of type "text/x-patch" (2048 bytes)