lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 9 Jan 2020 10:38:31 -0500 From: Waiman Long <longman@...hat.com> To: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, Will Deacon <will.deacon@....com> Cc: linux-kernel@...r.kernel.org, Waiman Long <longman@...hat.com> Subject: [PATCH] locking/osq: Use more optimized spinning for arm64 Arm64 has a more optimized spinning loop (atomic_cond_read_acquire) for spinlock that can boost performance of sibling threads by putting the current cpu to a shallow sleep state that is woken up when the monitored variable changes or an external event happens. OSQ has a more complicated spinning loop. Besides the lock value, it also checks for need_resched() and vcpu_is_preempted(). The check for need_resched() is not a problem as it is only set by the tick interrupt handler. That will be detected by the spinning cpu right after iret. The vcpu_is_preempted() check, however, is a problem as changes to the state of of previous node will not affect the sleep state. For ARM64, vcpu_is_preempted is not defined and so we can just skip the vcpu_is_preempted() check and use smp_cond_load_relaxed() instead. On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking microbenchmark was run for 10s with and without the patch. The performance numbers before patch were: Running locktest with mutex [runtime = 10s, load = 1] Threads = 224, Min/Mean/Max = 316/123,143/2,121,269 Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s After patch, the numbers were: Running locktest with mutex [runtime = 10s, load = 1] Threads = 224, Min/Mean/Max = 334/147,836/1,304,787 Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s So there was about 20% performance improvement. Longer term, we may have to define and use a static_key to indicate that vcpu_is_preempted is defined and it may return a value of true. Signed-off-by: Waiman Long <longman@...hat.com> --- kernel/locking/osq_lock.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 6ef600aa0f47..129e8f56ae71 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -134,6 +134,27 @@ bool osq_lock(struct optimistic_spin_queue *lock) * cmpxchg in an attempt to undo our queueing. */ + /* + * If vcpu_is_preempted is not defined, we can skip the check + * and use smp_cond_load_relaxed() instead. For arm64, this + * could lead to the use of the more optimized wfe instruction. + * As need_sched() is set by interrupt handler, it will break + * out and do the unqueue in a timely manner. + * + * TODO: We may need to add a static_key like vcpu_is_preemptible + * as vcpu_is_preempted() will always return false with + * bare metal even if it is defined. + */ +#ifndef vcpu_is_preempted + { + int locked = smp_cond_load_relaxed(&node->locked, + VAL || need_resched()); + if (!locked) + goto unqueue; + return true; + } +#endif + while (!READ_ONCE(node->locked)) { /* * If we need to reschedule bail... so we can block. -- 2.18.1
Powered by blists - more mailing lists