[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d63dfa33-f194-4368-ab9a-3be27c4cd893@redhat.com>
Date: Mon, 18 Mar 2024 21:59:35 -0400
From: Waiman Long <longman@...hat.com>
To: Guo Hui <guohui@...ontech.com>, peterz@...radead.org, mingo@...hat.com,
will@...nel.org, boqun.feng@...il.com, David.Laight@...LAB.COM
Cc: linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] locking/osq_lock: Optimize osq_lock performance using
per-NUMA
On 3/18/24 01:50, Guo Hui wrote:
> Changes in version v1:
> The queue is divided according to NUMA nodes,
> but the tail of each NUMA node is still stored
> in the structure optimistic_spin_queue.
>
> Changes in version v2:
> 1,On the basis of v1,
> the tail of each NUMA node is separated from
> the structure optimistic_spin_queue,
> and a separate memory space is allocated.
> This memory space is a pre-statically allocated
> fixed-size array osnq_queue[1024], array length is 1024.
> Each array element of osnq_queue is an array of atomic_t type.
> The length of this atomic_t type array is MAX_NUMNODES.
> The total memory size of statically allocated arrays is as follows:
>
> 4 * MAX_NUMNODES * 1024
>
> When MAX_NUMNODES is 64, the memory is 256K,
> and when it is 1024, the memory is 4M.
>
> The relationship between the statically allocated array osnq_queue
> and the structure optimistic_spin_queue is to use the hash value of
> the optimistic_spin_queue structure type pointer as the index of
> the array osnq_queue, and obtain the array element
> corresponding to osq_lock from osnq_queue.
> This array element stores the tail value of
> each NUMA node corresponding to the osq_lock lock.
>
> The form of the osnq_queue array is as follows:
>
> ----------------------------------------------------------
> | | |
> | | MAX_NUMNODES |
> | | |
> | |-------------------------------------|
> | | | | |
> | | atomic_t tail | atomic_t tail | ... |
> | | | | |
> | |-------------------------------------|
> | | | | |
> | | atomic_t tail | atomic_t tail | ... |
> | | | | |
> | |-------------------------------------|
> | | | | |
> | osnq_queue[1024] | atomic_t tail | atomic_t tail | ... |
> | | | | |
> | |-------------------------------------|
> | The hash value ->| | | |
> | of osq_lock is | atomic_t tail | atomic_t tail | ... |
> | the index | | | |
> | |-------------------------------------|
> | | | | |
> | | ... ... | ... ... | ... |
> | | | | |
> |------------------|-------------------------------------|
>
> There is a very small probability that different osq_locks
> with the same hash value will run concurrently on different CPUs.
> After extensive testing, this probability is no greater than 0.01%.
> This situation is also a normal situation.
> In this case, two different osq_locks will share the same
> osnq_queue array element,these two different osq_locks
> share the same NUMA linked list.Put different CPU nodes waiting
> for different osq_locks into the same NUMA linked list,
> which means that CPU nodes with different osq_locks
> share the same lock of the same NUMA queue.
> This is essentially the same method as using zippers
> to resolve hash collisions.
>
> Make an extreme case and set the above osnq_queue array
> to an array element.Then all osq_locks in the kernel
> will share the same queue on different NUMA nodes.
> After verification, the kernel can also run normally.
> However, the performance of some test cases will deteriorate.
> This patch solution greatly reduces the probability of
> shared queues to less than 0.01%,greatly improving
> the kernel osq_lock lock performance.
I don't like the idea of using a hash table and having hash collision.
Is it possible to adopt the numa-aware qspinlock's idea of having a
primary queue of the same numa node and a secondary queue for all the
other nodes? I know it will be more complex because of the need to back
out when rescheduling is needed.
Also I will prefer to make this numa-awareness optional so that it is
determined at boot time if the numa-aware version or the original
version is being used. At the beginning, the numa-aware osq_lock will be
default to off unless it is forced on by a command line or kconfig
option. We want to minimize risk due to the introduction of bug in the
new code.
>
> 2. Achieve fairness in transferring locks between nodes
> and prevent the same NUMA node from holding locks for a long time.
> This method borrows from the qspinlock numa-aware scheme.
>
> The effect on the 96-core 4-NUMA ARM64 platform is as follows:
> System Benchmarks Partial Index with patch without patch promote
> Execl Throughput 7255.8 5632.8 +28.81%
> File Copy 1024 bufsize 2000 maxblocks 1817.2 910.9 +99.50%
> File Copy 256 bufsize 500 maxblocks 1168.1 570.4 +104.79%
> File Copy 4096 bufsize 8000 maxblocks 3321.1 2088.7 +59.00%
>
> The effect on the 128-core 8-NUMA X86_64 platform is as follows:
> System Benchmarks Partial Index with patch without patch promote
> Execl Throughput 3947 3533.6 +11.70%
> File Copy 1024 bufsize 2000 maxblocks 819.1 553 +48.12%
> File Copy 256 bufsize 500 maxblocks 508.5 330.2 +54.00%
> File Copy 4096 bufsize 8000 maxblocks 1982.2 1377.1 +43.94%
>
> Signed-off-by: Guo Hui <guohui@...ontech.com>
> ---
> include/linux/osq_lock.h | 29 ++++++++--
> kernel/locking/osq_lock.c | 111 ++++++++++++++++++++++++++++++++++----
> 2 files changed, 126 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
> index ea8fb31379e3..3433f13276b8 100644
> --- a/include/linux/osq_lock.h
> +++ b/include/linux/osq_lock.h
> @@ -2,6 +2,13 @@
> #ifndef __LINUX_OSQ_LOCK_H
> #define __LINUX_OSQ_LOCK_H
>
> +#include <linux/nodemask.h>
> +#include <linux/hash.h>
> +
> +struct optimistic_spin_numa_queue {
> + atomic_t tail[MAX_NUMNODES]; /* Store the tail of each NUMA queue */
> +};
> +
> /*
> * An MCS like lock especially tailored for optimistic spinning for sleeping
> * lock implementations (mutex, rwsem, etc).
> @@ -9,12 +16,16 @@
>
> struct optimistic_spin_queue {
> /*
> - * Stores an encoded value of the CPU # of the tail node in the queue.
> - * If the queue is empty, then it's set to OSQ_UNLOCKED_VAL.
> + * Stores the hash value of the structure type pointer.
> */
> atomic_t tail;
> };
>
> +#define HASH_BITS_LEN 10
> +
> +/* this value is 2^@...H_BITS_LEN */
> +#define NUMA_QUEUE_SIZE 1024
> +
> #define OSQ_UNLOCKED_VAL (0)
>
> /* Init macro and function. */
> @@ -22,15 +33,25 @@ struct optimistic_spin_queue {
>
> static inline void osq_lock_init(struct optimistic_spin_queue *lock)
> {
> - atomic_set(&lock->tail, OSQ_UNLOCKED_VAL);
> + atomic_set(&lock->tail, hash_ptr(lock, HASH_BITS_LEN));
> }
>
> extern bool osq_lock(struct optimistic_spin_queue *lock);
> extern void osq_unlock(struct optimistic_spin_queue *lock);
>
> +extern struct optimistic_spin_numa_queue osnq_queue[NUMA_QUEUE_SIZE];
> +
> static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
> {
> - return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
> + int node;
> + atomic_t *numa_tail = osnq_queue[atomic_read(&lock->tail)].tail;
> +
> + for_each_online_node(node) {
> + if (atomic_read(&numa_tail[node]) != OSQ_UNLOCKED_VAL)
> + return true;
> + }
> +
> + return false;
> }
That greatly increase of cost of calling osq_is_locked().
>
> #endif
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 75a6f6133866..bea6a2784b5e 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -2,6 +2,8 @@
> #include <linux/percpu.h>
> #include <linux/sched.h>
> #include <linux/osq_lock.h>
> +#include <linux/topology.h>
> +#include <linux/random.h>
>
> /*
> * An MCS like lock especially tailored for optimistic spinning for sleeping
> @@ -14,12 +16,48 @@
>
> struct optimistic_spin_node {
> struct optimistic_spin_node *next, *prev;optimistic_spin_queue
> + atomic_t *tail;
Don't use the name "tail" here as it is also used in
optimistic_spin_queue. It is hard to tell which tail is which when
reading the code. Make them unique.
> int locked; /* 1 if lock acquired */
> int cpu; /* encoded CPU # + 1 value */
> + int numa_node;
> };
>
> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
>
> +/* Use the hash value of the structure optimistic_spin_node type pointer as the index. */
> +struct optimistic_spin_numa_queue osnq_queue[NUMA_QUEUE_SIZE];
> +
> +#define INVALID_NUMA_NODE (-1)
> +
> +/* Per-CPU pseudo-random number seed */
> +static DEFINE_PER_CPU(u32, seed);
> +
> +/*
> + * Controls the probability for intra-node lock hand-off. It can be
> + * tuned and depend, e.g., on the number of CPUs per node. For now,
> + * choose a value that provides reasonable long-term fairness without
> + * sacrificing performance compared to a version that does not have any
> + * fairness guarantees.
> + */
> +#define INTRA_NODE_HANDOFF_PROB_ARG (16)
> +
> +
> +/*
> + * Return false with probability 1 / 2^@..._bits.
> + * Intuitively, the larger @num_bits the less likely false is to be returned.
> + * @num_bits must be a number between 0 and 31.
> + */
> +static bool probably(unsigned int num_bits)
> +{
> + u32 s;
> +
> + s = this_cpu_read(seed);
> + s = next_pseudo_random32(s);
> + this_cpu_write(seed, s);
> +
> + return s & ((1 << num_bits) - 1);
> +}
> +
> /*
> * We use the value 0 to represent "no CPU", thus the encoded value
> * will be the CPU number incremented by 1.
> @@ -58,8 +96,8 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> int curr = encode_cpu(smp_processor_id());
>
> for (;;) {
> - if (atomic_read(&lock->tail) == curr &&
> - atomic_cmpxchg_acquire(&lock->tail, curr, old_cpu) == curr) {
> + if (atomic_read(&node->tail[node->numa_node]) == curr &&
> + atomic_cmpxchg_acquire(&node->tail[node->numa_node], curr, old_cpu) == curr) {
> /*
> * We were the last queued, we moved @lock back. @prev
> * will now observe @lock and will complete its
> @@ -90,6 +128,19 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> }
> }
>
> +static atomic_t osq_lock_node = ATOMIC_INIT(-1);
> +
> +static void osq_wait_numa_node(struct optimistic_spin_node *node)
> +{
> + int old_node;
> +
> + while (!need_resched() &&
> + ((old_node = atomic_cmpxchg_acquire(&osq_lock_node, INVALID_NUMA_NODE,
> + node->numa_node)) != INVALID_NUMA_NODE) &&
> + (node->numa_node != old_node))
> + cpu_relax();
> +}
> +
> bool osq_lock(struct optimistic_spin_queue *lock)
> {
> struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> @@ -100,6 +151,8 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> node->locked = 0;
> node->next = NULL;
> node->cpu = curr;
> + node->numa_node = cpu_to_node(smp_processor_id());
> + node->tail = osnq_queue[atomic_read(&lock->tail)].tail;
>
> /*optimistic_spin_queue
> * We need both ACQUIRE (pairs with corresponding RELEASE in
> @@ -107,9 +160,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * the node fields we just initialised) semantics when updating
> * the lock tail.
> */
> - old = atomic_xchg(&lock->tail, curr);
> - if (old == OSQ_UNLOCKED_VAL)
> + old = atomic_xchg(&node->tail[node->numa_node], curr);
> + if (old == OSQ_UNLOCKED_VAL) {
> + osq_wait_numa_node(node);
> return true;
> + }
>
> prev = decode_cpu(old);
> node->prev = prev;
> @@ -144,8 +199,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * polling, be careful.
> */
> if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
> - vcpu_is_preempted(node_cpu(node->prev))))
> + vcpu_is_preempted(node_cpu(node->prev)))) {
> + osq_wait_numa_node(node);
> return true;
> + }
>
> /* unqueue */
> /*
> @@ -170,8 +227,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * in which case we should observe @node->locked becoming
> * true.
> */
> - if (smp_load_acquire(&node->locked))
> + if (smp_load_acquire(&node->locked)) {
> + osq_wait_numa_node(node);
> return true;
> + }
>
> cpu_relax();
>
> @@ -207,29 +266,61 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> return false;
> }
>
> +/*
> + * Pass the lock to the next NUMA node.
> + */
> +static void pass_lock_numa_node(struct optimistic_spin_node *node)
> +{
> + int curr_node = cpu_to_node(smp_processor_id());
> + int numa_node = curr_node;
> + int num_nodes = num_online_nodes();
> +
> + do {
> + numa_node = (numa_node + 1) % num_nodes;
> + if (numa_node == curr_node) {
> + atomic_set(&osq_lock_node, INVALID_NUMA_NODE);
> + return;
> + }
> + } while (atomic_read(&node->tail[numa_node]) == OSQ_UNLOCKED_VAL);
> + atomic_set(&osq_lock_node, numa_node);
> +}
> +
> +static inline void pass_lock_fair(struct optimistic_spin_node *node)
> +{
> + if (!probably(INTRA_NODE_HANDOFF_PROB_ARG))
> + pass_lock_numa_node(node);
> +}
> +
> void osq_unlock(struct optimistic_spin_queue *lock)
> {
> struct optimistic_spin_node *node, *next;
> int curr = encode_cpu(smp_processor_id());
>
> + node = this_cpu_ptr(&osq_node);
> +
> /*
> * Fast path for the uncontended case.
> */
> - if (likely(atomic_cmpxchg_release(&lock->tail, curr,
> - OSQ_UNLOCKED_VAL) == curr))
> + if (likely(atomic_cmpxchg_release(&node->tail[node->numa_node], curr,
> + OSQ_UNLOCKED_VAL) == curr)) {
> + pass_lock_numa_node(node);
> return;
> + }
>
> /*
> * Second most likely case.
> */
> - node = this_cpu_ptr(&osq_node);
> next = xchg(&node->next, NULL);
> if (next) {
> WRITE_ONCE(next->locked, 1);
> + pass_lock_fair(node);
> return;
> }
>
> next = osq_wait_next(lock, node, OSQ_UNLOCKED_VAL);
> - if (next)
> + if (next) {
> WRITE_ONCE(next->locked, 1);
> + pass_lock_fair(node);
> + } else
> + pass_lock_numa_node(node);
> }
With numa-aware qspinlock, most of the action is in the unlock code to
scan the queue.
Cheers,
Longman
Powered by blists - more mailing lists