linux-kernel - Re: [PATCH] locking/osq_lock: Optimize osq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <03c1eb2e-4eae-49be-94cb-b90894cc00a9@redhat.com>
Date: Tue, 20 Feb 2024 13:16:26 -0500
From: Waiman Long <longman@...hat.com>
To: Guo Hui <guohui@...ontech.com>, peterz@...radead.org, mingo@...hat.com,
 will@...nel.org, boqun.feng@...il.com, David.Laight@...LAB.COM
Cc: linux-kernel@...r.kernel.org
Subject: Re: [PATCH] locking/osq_lock: Optimize osq_lock performance using
 per-NUMA


On 2/20/24 02:30, Guo Hui wrote:
> After extensive testing of osq_lock,
> we found that the performance of osq_lock is closely related to
> the distance between NUMA nodes.The greater the distance
> between NUMA nodes,the more serious the performance degradation of
> osq_lock.When a group of processes that need to compete for
> the same lock are on the same NUMA node,the performance of osq_lock
> is the best.when the group of processes is distributed on
> different NUMA nodes,as the distance between NUMA nodes increases,
> the performance of osq_lock becomes worse.
>
> This patch uses the following solutions to improve performance:
> Divide the osq_lock linked list according to NUMA nodes.
> Each NUMA node corresponds to an osq linked list.
> Each CPU is added to the linked list corresponding to
> its respective NUMA node.When the last CPU of
> the NUMA node releases osq_lock,osq_lock is passed to
> the next NUMA node.
>
> As shown in the figure below, the last osq_node1 on NUMA0 passes the lock
> to the first node (osq_node3) of the next NUMA1 node.
>
> -----------------------------------------------------------
> |            NUMA0           |            NUMA1           |
> |----------------------------|----------------------------|
> |  osq_node0 ---> osq_node1 -|-> osq_node3 ---> osq_node4 |
> -----------------------------|-----------------------------
>
> Set an atomic type global variable osq_lock_node to
> record the NUMA node number that can currently obtain
> the osq_lock lock.When the osq_lock_node value is
> a certain node number,the CPU on the node obtains
> the osq_lock lock in turn,and the CPUs on
> other NUMA nodes poll wait.
>
> This solution greatly reduces the performance degradation caused
> by communication between CPUs on different NUMA nodes.
>
> The effect on the 96-core 4-NUMA ARM64 platform is as follows:
> System Benchmarks Partial Index       with patch  without patch  promote
> File Copy 1024 bufsize 2000 maxblocks   2060.8      980.3        +110.22%
> File Copy 256 bufsize 500 maxblocks     1346.5      601.9        +123.71%
> File Copy 4096 bufsize 8000 maxblocks   4229.9      2216.1       +90.87%
>
> The effect on the 128-core 8-NUMA X86_64 platform is as follows:
> System Benchmarks Partial Index       with patch  without patch  promote
> File Copy 1024 bufsize 2000 maxblocks   841.1       553.7        +51.91%
> File Copy 256 bufsize 500 maxblocks     517.4       339.8        +52.27%
> File Copy 4096 bufsize 8000 maxblocks   2058.4      1392.8       +47.79%
That is similar in idea to the numa-aware qspinlock patch series.
> Signed-off-by: Guo Hui <guohui@...ontech.com>
> ---
>   include/linux/osq_lock.h  | 20 +++++++++++--
>   kernel/locking/osq_lock.c | 60 +++++++++++++++++++++++++++++++++------
>   2 files changed, 69 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
> index ea8fb31379e3..c016c1cf5e8b 100644
> --- a/include/linux/osq_lock.h
> +++ b/include/linux/osq_lock.h
> @@ -2,6 +2,8 @@
>   #ifndef __LINUX_OSQ_LOCK_H
>   #define __LINUX_OSQ_LOCK_H
>   
> +#include <linux/nodemask.h>
> +
>   /*
>    * An MCS like lock especially tailored for optimistic spinning for sleeping
>    * lock implementations (mutex, rwsem, etc).
> @@ -11,8 +13,9 @@ struct optimistic_spin_queue {
>   	/*
>   	 * Stores an encoded value of the CPU # of the tail node in the queue.
>   	 * If the queue is empty, then it's set to OSQ_UNLOCKED_VAL.
> +	 * The actual number of NUMA nodes is generally not greater than 32.
>   	 */
> -	atomic_t tail;
> +	atomic_t tail[32];

That is a no-go. You are increasing the size of a mutex/rwsem by 128 
bytes. If you want to enable this numa-awareness, you have to do it in a 
way without increasing the size of optimistic_spin_queue. My suggestion 
is to queue optimistic_spin_node in a numa-aware way in osq_lock.c 
without touching optimistic_spin_queue.

Cheers,
Longman