linux-kernel - Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 26 Dec 2012 11:10:08 -0800
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Rik van Riel <riel@...hat.com>
Cc:	Steven Rostedt <rostedt@...dmis.org>, linux-kernel@...r.kernel.org,
	aquini@...hat.com, walken@...gle.com, lwoodman@...hat.com,
	jeremy@...p.org, Jan Beulich <JBeulich@...ell.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Tom Herbert <therbert@...gle.com>
Subject: Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay
 factor

On Fri, 2012-12-21 at 22:50 -0500, Rik van Riel wrote:

> I will try to run this test on a really large SMP system
> in the lab during the break.
> 
> Ideally, the auto-tuning will keep the delay value large
> enough that performance will stay flat even when there are
> 100 CPUs contending over the same lock.
> 
> Maybe it turns out that the maximum allowed delay value
> needs to be larger.  Only one way to find out...
> 

Hi Rik

I did some tests with your patches with following configuration :

tc qdisc add dev eth0 root htb r2q 1000 default 3
(to force a contention on qdisc lock, even with a multi queue net
device)

and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"

Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
(24 threads), and a fast NIC (10Gbps)

Resulting in a 13 % regression (676 Mbits -> 595 Mbits)

In this workload we have at least two contended spinlocks, with
different delays. (spinlocks are not held for the same duration)

It clearly defeats your assumption of a single per cpu delay being OK :
Some cpus are spinning too long while the lock was released.

We might try to use a hash on lock address, and an array of 16 different
delays so that different spinlocks have a chance of not sharing the same
delay.

With following patch, I get 982 Mbits/s with same bench, so an increase
of 45 % instead of a 13 % regression.

 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..59f98f6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/hash.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -113,6 +114,55 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
+ * Wait on a congested ticket spinlock.
+ */
+#define MIN_SPINLOCK_DELAY 1
+#define MAX_SPINLOCK_DELAY 1000
+#define DELAY_HASH_SHIFT 4
+DEFINE_PER_CPU(int [1 << DELAY_HASH_SHIFT], spinlock_delay) = { 
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+};
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+	unsigned int slot = hash_32((u32)(unsigned long)lock, DELAY_HASH_SHIFT);
+	int delay = __this_cpu_read(spinlock_delay[slot]);
+
+	for (;;) {
+		int loops = delay * (__ticket_t)(inc.tail - inc.head);
+
+		while (loops--)
+			cpu_relax();
+
+		inc.head = ACCESS_ONCE(lock->tickets.head);
+
+		if (inc.head == inc.tail) {
+			/* Decrease the delay, since we may have overslept. */
+			if (delay > MIN_SPINLOCK_DELAY)
+				delay--;
+			break;
+		}
+
+		/*
+		 * The lock is still busy, the delay was not long enough.
+		 * Going through here 2.7 times will, on average, cancel
+		 * out the decrement above. Using a non-integer number
+		 * gets rid of performance artifacts and reduces oversleeping.
+		 */
+		if (delay < MAX_SPINLOCK_DELAY &&
+		    (!(inc.head & 3) == 0 || (inc.head & 7) == 1))
+			delay++;
+	}
+	__this_cpu_write(spinlock_delay[slot], delay);
+}
+
+/*
  * this function sends a 'reschedule' IPI to another CPU.
  * it goes straight through and wastes no time serializing
  * anything. Worst case is that we lose a reschedule ...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/