[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <03c1eb2e-4eae-49be-94cb-b90894cc00a9@redhat.com>
Date: Tue, 20 Feb 2024 13:16:26 -0500
From: Waiman Long <longman@...hat.com>
To: Guo Hui <guohui@...ontech.com>, peterz@...radead.org, mingo@...hat.com,
will@...nel.org, boqun.feng@...il.com, David.Laight@...LAB.COM
Cc: linux-kernel@...r.kernel.org
Subject: Re: [PATCH] locking/osq_lock: Optimize osq_lock performance using
per-NUMA
On 2/20/24 02:30, Guo Hui wrote:
> After extensive testing of osq_lock,
> we found that the performance of osq_lock is closely related to
> the distance between NUMA nodes.The greater the distance
> between NUMA nodes,the more serious the performance degradation of
> osq_lock.When a group of processes that need to compete for
> the same lock are on the same NUMA node,the performance of osq_lock
> is the best.when the group of processes is distributed on
> different NUMA nodes,as the distance between NUMA nodes increases,
> the performance of osq_lock becomes worse.
>
> This patch uses the following solutions to improve performance:
> Divide the osq_lock linked list according to NUMA nodes.
> Each NUMA node corresponds to an osq linked list.
> Each CPU is added to the linked list corresponding to
> its respective NUMA node.When the last CPU of
> the NUMA node releases osq_lock,osq_lock is passed to
> the next NUMA node.
>
> As shown in the figure below, the last osq_node1 on NUMA0 passes the lock
> to the first node (osq_node3) of the next NUMA1 node.
>
> -----------------------------------------------------------
> | NUMA0 | NUMA1 |
> |----------------------------|----------------------------|
> | osq_node0 ---> osq_node1 -|-> osq_node3 ---> osq_node4 |
> -----------------------------|-----------------------------
>
> Set an atomic type global variable osq_lock_node to
> record the NUMA node number that can currently obtain
> the osq_lock lock.When the osq_lock_node value is
> a certain node number,the CPU on the node obtains
> the osq_lock lock in turn,and the CPUs on
> other NUMA nodes poll wait.
>
> This solution greatly reduces the performance degradation caused
> by communication between CPUs on different NUMA nodes.
>
> The effect on the 96-core 4-NUMA ARM64 platform is as follows:
> System Benchmarks Partial Index with patch without patch promote
> File Copy 1024 bufsize 2000 maxblocks 2060.8 980.3 +110.22%
> File Copy 256 bufsize 500 maxblocks 1346.5 601.9 +123.71%
> File Copy 4096 bufsize 8000 maxblocks 4229.9 2216.1 +90.87%
>
> The effect on the 128-core 8-NUMA X86_64 platform is as follows:
> System Benchmarks Partial Index with patch without patch promote
> File Copy 1024 bufsize 2000 maxblocks 841.1 553.7 +51.91%
> File Copy 256 bufsize 500 maxblocks 517.4 339.8 +52.27%
> File Copy 4096 bufsize 8000 maxblocks 2058.4 1392.8 +47.79%
That is similar in idea to the numa-aware qspinlock patch series.
> Signed-off-by: Guo Hui <guohui@...ontech.com>
> ---
> include/linux/osq_lock.h | 20 +++++++++++--
> kernel/locking/osq_lock.c | 60 +++++++++++++++++++++++++++++++++------
> 2 files changed, 69 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
> index ea8fb31379e3..c016c1cf5e8b 100644
> --- a/include/linux/osq_lock.h
> +++ b/include/linux/osq_lock.h
> @@ -2,6 +2,8 @@
> #ifndef __LINUX_OSQ_LOCK_H
> #define __LINUX_OSQ_LOCK_H
>
> +#include <linux/nodemask.h>
> +
> /*
> * An MCS like lock especially tailored for optimistic spinning for sleeping
> * lock implementations (mutex, rwsem, etc).
> @@ -11,8 +13,9 @@ struct optimistic_spin_queue {
> /*
> * Stores an encoded value of the CPU # of the tail node in the queue.
> * If the queue is empty, then it's set to OSQ_UNLOCKED_VAL.
> + * The actual number of NUMA nodes is generally not greater than 32.
> */
> - atomic_t tail;
> + atomic_t tail[32];
That is a no-go. You are increasing the size of a mutex/rwsem by 128
bytes. If you want to enable this numa-awareness, you have to do it in a
way without increasing the size of optimistic_spin_queue. My suggestion
is to queue optimistic_spin_node in a numa-aware way in osq_lock.c
without touching optimistic_spin_queue.
Cheers,
Longman
Powered by blists - more mailing lists