linux-kernel - Re: [PATCH] locking/osq_lock: Optimize osq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <1EEDB3CB3E97FFDE+e8c7acce-ec6b-f2b7-9c2d-0d1b471a671f@uniontech.com>
Date: Wed, 21 Feb 2024 10:42:04 +0800
From: Guo Hui <guohui@...ontech.com>
To: Waiman Long <longman@...hat.com>, peterz@...radead.org, mingo@...hat.com,
 will@...nel.org, boqun.feng@...il.com, David.Laight@...LAB.COM
Cc: linux-kernel@...r.kernel.org
Subject: Re: [PATCH] locking/osq_lock: Optimize osq_lock performance using
 per-NUMA

On 2/21/24 2:16 AM, Waiman Long wrote:

>
> On 2/20/24 02:30, Guo Hui wrote:
>> After extensive testing of osq_lock,
>> we found that the performance of osq_lock is closely related to
>> the distance between NUMA nodes.The greater the distance
>> between NUMA nodes,the more serious the performance degradation of
>> osq_lock.When a group of processes that need to compete for
>> the same lock are on the same NUMA node,the performance of osq_lock
>> is the best.when the group of processes is distributed on
>> different NUMA nodes,as the distance between NUMA nodes increases,
>> the performance of osq_lock becomes worse.
>>
>> This patch uses the following solutions to improve performance:
>> Divide the osq_lock linked list according to NUMA nodes.
>> Each NUMA node corresponds to an osq linked list.
>> Each CPU is added to the linked list corresponding to
>> its respective NUMA node.When the last CPU of
>> the NUMA node releases osq_lock,osq_lock is passed to
>> the next NUMA node.
>>
>> As shown in the figure below, the last osq_node1 on NUMA0 passes the 
>> lock
>> to the first node (osq_node3) of the next NUMA1 node.
>>
>> -----------------------------------------------------------
>> |            NUMA0           |            NUMA1           |
>> |----------------------------|----------------------------|
>> |  osq_node0 ---> osq_node1 -|-> osq_node3 ---> osq_node4 |
>> -----------------------------|-----------------------------
>>
>> Set an atomic type global variable osq_lock_node to
>> record the NUMA node number that can currently obtain
>> the osq_lock lock.When the osq_lock_node value is
>> a certain node number,the CPU on the node obtains
>> the osq_lock lock in turn,and the CPUs on
>> other NUMA nodes poll wait.
>>
>> This solution greatly reduces the performance degradation caused
>> by communication between CPUs on different NUMA nodes.
>>
>> The effect on the 96-core 4-NUMA ARM64 platform is as follows:
>> System Benchmarks Partial Index       with patch  without patch promote
>> File Copy 1024 bufsize 2000 maxblocks   2060.8      980.3 +110.22%
>> File Copy 256 bufsize 500 maxblocks     1346.5      601.9 +123.71%
>> File Copy 4096 bufsize 8000 maxblocks   4229.9      2216.1 +90.87%
>>
>> The effect on the 128-core 8-NUMA X86_64 platform is as follows:
>> System Benchmarks Partial Index       with patch  without patch promote
>> File Copy 1024 bufsize 2000 maxblocks   841.1       553.7 +51.91%
>> File Copy 256 bufsize 500 maxblocks     517.4       339.8 +52.27%
>> File Copy 4096 bufsize 8000 maxblocks   2058.4      1392.8 +47.79%
> That is similar in idea to the numa-aware qspinlock patch series.
>> Signed-off-by: Guo Hui <guohui@...ontech.com>
>> ---
>>   include/linux/osq_lock.h  | 20 +++++++++++--
>>   kernel/locking/osq_lock.c | 60 +++++++++++++++++++++++++++++++++------
>>   2 files changed, 69 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
>> index ea8fb31379e3..c016c1cf5e8b 100644
>> --- a/include/linux/osq_lock.h
>> +++ b/include/linux/osq_lock.h
>> @@ -2,6 +2,8 @@
>>   #ifndef __LINUX_OSQ_LOCK_H
>>   #define __LINUX_OSQ_LOCK_H
>>   +#include <linux/nodemask.h>
>> +
>>   /*
>>    * An MCS like lock especially tailored for optimistic spinning for 
>> sleeping
>>    * lock implementations (mutex, rwsem, etc).
>> @@ -11,8 +13,9 @@ struct optimistic_spin_queue {
>>       /*
>>        * Stores an encoded value of the CPU # of the tail node in the 
>> queue.
>>        * If the queue is empty, then it's set to OSQ_UNLOCKED_VAL.
>> +     * The actual number of NUMA nodes is generally not greater than 
>> 32.
>>        */
>> -    atomic_t tail;
>> +    atomic_t tail[32];
>
> That is a no-go. You are increasing the size of a mutex/rwsem by 128 
> bytes. If you want to enable this numa-awareness, you have to do it in 
> a way without increasing the size of optimistic_spin_queue. My 
> suggestion is to queue optimistic_spin_node in a numa-aware way in 
> osq_lock.c without touching optimistic_spin_queue.
>
> Cheers,
> Longman
>
>
>
Thank you for your suggestion, I will make a better solution according 
to your suggestion.