[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1d9b40ac-f5d8-4242-bb12-92b7a50a3d05@zhaoxin.com>
Date: Thu, 19 Sep 2024 17:41:15 +0800
From: yongli-os <yongli-oc@...oxin.com>
To: Waiman Long <longman@...hat.com>, <peterz@...radead.org>,
<mingo@...hat.com>, <will@...nel.org>, <boqun.feng@...il.com>
CC: <linux-kernel@...r.kernel.org>, <yongli@...oxin.com>,
<louisqi@...oxin.com>, <cobechen@...oxin.com>, <jiangbowang@...oxin.com>
Subject: Re: [PATCH 4/4] locking/osq_lock: The numa-aware lock memory prepare,
assign and cleanup.
On 2024/9/15 01:21, Waiman Long wrote:
>
>
> [这封邮件来自外部发件人 谨防风险]
>
> On 9/14/24 04:53, yongli-oc wrote:
>> The numa-aware lock kernel memory cache preparation, and a
>> workqueue to turn numa-aware lock back to osq lock.
>> The /proc interface. Enable dynamic switch by
>> echo 1 > /proc/zx_numa_lock/dynamic_enable
>>
>> Signed-off-by: yongli-oc <yongli-oc@...oxin.com>
>> ---
>> kernel/locking/zx_numa.c | 537 +++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 537 insertions(+)
>> create mode 100644 kernel/locking/zx_numa.c
>>
>> diff --git a/kernel/locking/zx_numa.c b/kernel/locking/zx_numa.c
>> new file mode 100644
>> index 000000000000..89df6670a024
>> --- /dev/null
>> +++ b/kernel/locking/zx_numa.c
>> @@ -0,0 +1,537 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Dynamic numa-aware osq lock
>> + * Crossing from numa-aware lock to osq_lock
>> + * Numa lock memory initialize and /proc interface
>> + * Author: LiYong <yongli-oc@...oxin.com>
>> + *
>> + */
>> +#include <linux/cpumask.h>
>> +#include <asm/byteorder.h>
>> +#include <asm/kvm_para.h>
>> +#include <linux/percpu.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/osq_lock.h>
>> +#include <linux/module.h>
>> +#include <linux/proc_fs.h>
>> +#include <linux/seq_file.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/reboot.h>
>> +
>> +#include "numa.h"
>> +#include "numa_osq.h"
>> +
>> +int enable_zx_numa_osq_lock;
>> +struct delayed_work zx_numa_start_work;
>> +struct delayed_work zx_numa_cleanup_work;
>> +
>> +atomic_t numa_count;
>> +struct _numa_buf *zx_numa_entry;
>> +int zx_numa_lock_total = 256;
>> +LIST_HEAD(_zx_numa_head);
>> +LIST_HEAD(_zx_numa_lock_head);
>> +
>> +struct kmem_cache *zx_numa_entry_cachep;
>> +struct kmem_cache *zx_numa_lock_cachep;
>> +int NUMASHIFT;
>> +int NUMACLUSTERS;
>> +static atomic_t lockindex;
>> +int dynamic_enable;
>> +
>> +static const struct numa_cpu_info numa_cpu_list[] = {
>> + /*feature1=1, a numa node includes two clusters*/
>> + //{1, 23, X86_VENDOR_AMD, 0, 1},
>> + {0x5b, 7, X86_VENDOR_CENTAUR, 0, 1},
>> + {0x5b, 7, X86_VENDOR_ZHAOXIN, 0, 1}
>> +};
>
> Why are this zx_*() code specifically for ZhaoXin and Centaur family of
> CPUs? Are there some special hardware features that are specific to
> these CPUs?
> Zhaoxin cpu is a x86 architecture processor. The processor has no any
special hardware features about the dynamic numa-aware lock patch.
But since different processor always has different NUMA architecture
features, I listed Zhaoxin CPU only.
When I tested the patch, I found the AMD EPYC 7551 is something like
the Zhaoxin CPU. Both one node has two clusters, unlock processes
in one cluster is much faster than unlock them in NUMA node.
I am not sure if it is fit for AMD CPU or not. so I comment the code for
the AMD CPU.
BTW, your patch series lacks performance data to justify the addition of
> quite a lot of complexity to the core locking code. We are unlikely to
> take this without sufficient justification.
>
In the cover letter, these is performance test result for AMD EPYC 7551 and
Zhaoxin KH40000. I listed the perf epoll, locktorture mutex, unixbench
and fxmark.
What test do you think is important for the Lock performance?
I will do more test in next submission.
> Another question that I have is that the base osq_lock() can coexist
> with your xz_osq_lock(). A cpu can dynamically switch from using
> osq_lock() to xz_osq_lock() and vice versa. What happens if some CPUs
> use osq_lock() while others use xz_osq_lock()? Will that cause a
> problem? Have you fully test this scenario to make sure that nothing
> breaks?
> Cheers,
> Longman
The x_osq_lock uses a 16 bits tail, the program is the nearly the same as
osq_lock before turning to numa-aware lock. By my opinion, from Intel
instruction set, the atomic_xchg 32bits and cmpxchg 16 bits, both have
LOCK prefix, the cacheline for tail are all accessed exclusively.
After dynamic switch enable, some processes will enter the
x_osq_lock/x_osq_unlock, if these processes meet queue tail, it will
atomic set the numa_enable to OSQTONUMADETECT. If some processes
are still in osq_lock, the numa_enable will be cleaned by atomic_xchg and
old &= 0xffff; it will be set again when x_osq_unlock meets queue tail
next time.
After the numa_enable is set to OSQTONUMADETECT, the x_osq_unlock
will start to record contention depth(the serial in queue tail 's
optimistic_spin_node minus it in current unlocked CPU's node). If the depth
is more than osq_lock_depth, it will start increase the locked variable
in struct optimistic_spin_node. After the locked variable is more than
osq_keep_times, it starts to turn to numa-aware lock.
If some processes in osq_lock/osq_unlock, the locked variable is
always set to 1.
So when set numa_enable to OSQLOCKSTOPPING, start switching to numa-aware
lock, so many lock()/unlock() are finished, all the processes should
read the
enable_zx_numa_osq_lock as 2, to execute the x_osq_lock().
Consider unnecessarily to enable/disable dynamic switch frequently,
I did not add stopping protection here.
I prefer to use x_osq_lock to replace the osq_lock when
CONFIG_LOCK_SPIN_ON_OWNER_NUMA=y.
As I know, in x86_64, with __LOCK prefix, the performance of 32 bits
operand
is nearly the same as its of 16 bits operand. From the test result in
cover letter,
one or two processes, the performance difference is very little. I do
not know if it
is the same for other platform?
Best regards.
Li Yong
>
Powered by blists - more mailing lists