lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Thu,  9 Jan 2020 10:38:31 -0500
From:   Waiman Long <longman@...hat.com>
To:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Will Deacon <will.deacon@....com>
Cc:     linux-kernel@...r.kernel.org, Waiman Long <longman@...hat.com>
Subject: [PATCH] locking/osq: Use more optimized spinning for arm64

Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
for spinlock that can boost performance of sibling threads by putting
the current cpu to a shallow sleep state that is woken up when the
monitored variable changes or an external event happens.

OSQ has a more complicated spinning loop. Besides the lock value, it
also checks for need_resched() and vcpu_is_preempted(). The check for
need_resched() is not a problem as it is only set by the tick interrupt
handler. That will be detected by the spinning cpu right after iret.

The vcpu_is_preempted() check, however, is a problem as changes to
the state of of previous node will not affect the sleep state. For
ARM64, vcpu_is_preempted is not defined and so we can just skip the
vcpu_is_preempted() check and use smp_cond_load_relaxed() instead.

On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
microbenchmark was run for 10s with and without the patch. The
performance numbers before patch were:

Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s

After patch, the numbers were:

Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s

So there was about 20% performance improvement.

Longer term, we may have to define and use a static_key to indicate
that vcpu_is_preempted is defined and it may return a value of true.

Signed-off-by: Waiman Long <longman@...hat.com>
---
 kernel/locking/osq_lock.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 6ef600aa0f47..129e8f56ae71 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -134,6 +134,27 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	 * cmpxchg in an attempt to undo our queueing.
 	 */
 
+	/*
+	 * If vcpu_is_preempted is not defined, we can skip the check
+	 * and use smp_cond_load_relaxed() instead. For arm64, this
+	 * could lead to the use of the more optimized wfe instruction.
+	 * As need_sched() is set by interrupt handler, it will break
+	 * out and do the unqueue in a timely manner.
+	 *
+	 * TODO: We may need to add a static_key like vcpu_is_preemptible
+	 *	 as vcpu_is_preempted() will always return false with
+	 *	 bare metal even if it is defined.
+	 */
+#ifndef vcpu_is_preempted
+	{
+		int locked = smp_cond_load_relaxed(&node->locked,
+						   VAL || need_resched());
+		if (!locked)
+			goto unqueue;
+		return true;
+	}
+#endif
+
 	while (!READ_ONCE(node->locked)) {
 		/*
 		 * If we need to reschedule bail... so we can block.
-- 
2.18.1

Powered by blists - more mailing lists