linux-kernel - Re: [PATCH 4/4] locking/osq_lock: The numa-aware lock memory prepare, assign and cleanup.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1d9b40ac-f5d8-4242-bb12-92b7a50a3d05@zhaoxin.com>
Date: Thu, 19 Sep 2024 17:41:15 +0800
From: yongli-os <yongli-oc@...oxin.com>
To: Waiman Long <longman@...hat.com>, <peterz@...radead.org>,
	<mingo@...hat.com>, <will@...nel.org>, <boqun.feng@...il.com>
CC: <linux-kernel@...r.kernel.org>, <yongli@...oxin.com>,
	<louisqi@...oxin.com>, <cobechen@...oxin.com>, <jiangbowang@...oxin.com>
Subject: Re: [PATCH 4/4] locking/osq_lock: The numa-aware lock memory prepare,
 assign and cleanup.


On 2024/9/15 01:21, Waiman Long wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On 9/14/24 04:53, yongli-oc wrote:
>> The numa-aware lock kernel memory cache preparation, and a
>> workqueue to turn numa-aware lock back to osq lock.
>> The /proc interface. Enable dynamic switch by
>> echo 1 > /proc/zx_numa_lock/dynamic_enable
>>
>> Signed-off-by: yongli-oc <yongli-oc@...oxin.com>
>> ---
>>   kernel/locking/zx_numa.c | 537 +++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 537 insertions(+)
>>   create mode 100644 kernel/locking/zx_numa.c
>>
>> diff --git a/kernel/locking/zx_numa.c b/kernel/locking/zx_numa.c
>> new file mode 100644
>> index 000000000000..89df6670a024
>> --- /dev/null
>> +++ b/kernel/locking/zx_numa.c
>> @@ -0,0 +1,537 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Dynamic numa-aware osq lock
>> + * Crossing from numa-aware lock to osq_lock
>> + * Numa lock memory initialize and /proc interface
>> + * Author: LiYong <yongli-oc@...oxin.com>
>> + *
>> + */
>> +#include <linux/cpumask.h>
>> +#include <asm/byteorder.h>
>> +#include <asm/kvm_para.h>
>> +#include <linux/percpu.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/osq_lock.h>
>> +#include <linux/module.h>
>> +#include <linux/proc_fs.h>
>> +#include <linux/seq_file.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/reboot.h>
>> +
>> +#include "numa.h"
>> +#include "numa_osq.h"
>> +
>> +int enable_zx_numa_osq_lock;
>> +struct delayed_work zx_numa_start_work;
>> +struct delayed_work zx_numa_cleanup_work;
>> +
>> +atomic_t numa_count;
>> +struct _numa_buf *zx_numa_entry;
>> +int zx_numa_lock_total = 256;
>> +LIST_HEAD(_zx_numa_head);
>> +LIST_HEAD(_zx_numa_lock_head);
>> +
>> +struct kmem_cache *zx_numa_entry_cachep;
>> +struct kmem_cache *zx_numa_lock_cachep;
>> +int NUMASHIFT;
>> +int NUMACLUSTERS;
>> +static atomic_t lockindex;
>> +int dynamic_enable;
>> +
>> +static const struct numa_cpu_info numa_cpu_list[] = {
>> +     /*feature1=1, a numa node includes two clusters*/
>> +     //{1, 23, X86_VENDOR_AMD, 0, 1},
>> +     {0x5b, 7, X86_VENDOR_CENTAUR, 0, 1},
>> +     {0x5b, 7, X86_VENDOR_ZHAOXIN, 0, 1}
>> +};
>
> Why are this zx_*() code specifically for ZhaoXin and Centaur family of
> CPUs? Are there some special hardware features that are specific to
> these CPUs?

> Zhaoxin cpu is a x86 architecture processor. The processor has no any

special hardware features about the dynamic numa-aware lock patch.

But since different processor always has different  NUMA architecture

features,  I listed Zhaoxin CPU only.

When I tested the patch, I found the AMD EPYC 7551 is something like

  the Zhaoxin CPU. Both one node has  two clusters,  unlock processes

  in one cluster is much faster than unlock them in NUMA node.

I am not sure if it is fit for AMD CPU or not. so I comment the code for

the AMD CPU.

BTW, your patch series lacks performance data to justify the addition of

> quite a lot of complexity to the core locking code. We are unlikely to
> take this without sufficient justification.
>
In the cover letter,  these is performance test result for AMD EPYC 7551 and

Zhaoxin KH40000. I listed the perf epoll, locktorture mutex, unixbench 
and fxmark.

What test do you think is important for the Lock performance?

I will do more test in next submission.


> Another question that I have is that the base osq_lock() can coexist
> with your xz_osq_lock(). A cpu can dynamically switch from using
> osq_lock() to xz_osq_lock() and vice versa. What happens if some CPUs
> use osq_lock() while others use xz_osq_lock()? Will that cause a
> problem? Have you fully test this scenario to make sure that nothing
> breaks?
> Cheers,
> Longman 

The x_osq_lock uses a 16 bits tail,  the program is the nearly the same as

osq_lock before turning to numa-aware lock. By my opinion, from Intel

instruction set,  the atomic_xchg 32bits and cmpxchg 16 bits, both have

LOCK prefix,  the cacheline for tail are all accessed exclusively.


After dynamic switch enable,  some processes will enter the

x_osq_lock/x_osq_unlock,  if these processes meet queue tail, it will

atomic set the numa_enable to OSQTONUMADETECT. If some processes

are still in osq_lock, the numa_enable will be cleaned by atomic_xchg and

old &= 0xffff;  it will be set again when x_osq_unlock meets queue tail

next time.

After the numa_enable is set to OSQTONUMADETECT, the x_osq_unlock

will start to record contention depth(the serial in queue tail 's

optimistic_spin_node minus it in current unlocked CPU's node). If the depth

is more than osq_lock_depth, it will start increase the locked variable

in struct optimistic_spin_node.  After the locked variable is more than

osq_keep_times, it starts to turn to numa-aware lock.

If some processes in osq_lock/osq_unlock, the locked variable is

always set to 1.

So when set numa_enable to OSQLOCKSTOPPING, start switching to numa-aware

lock, so many lock()/unlock() are finished, all the processes should 
read the

enable_zx_numa_osq_lock as 2, to execute the x_osq_lock().

Consider unnecessarily to enable/disable dynamic switch frequently,

I did not add stopping protection here.


I prefer to use x_osq_lock to replace the osq_lock when

CONFIG_LOCK_SPIN_ON_OWNER_NUMA=y.

As I know,  in x86_64,  with __LOCK prefix,   the performance of 32 bits 
operand

is nearly the same as its of 16 bits operand.  From the test result in 
cover letter,

one or two processes, the performance difference is very little. I do 
not know if it

  is the same for other platform?

Best regards.

Li Yong


>