linux-kernel - [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v3b)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 10 Jan 2010 23:30:16 -0500
From:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
To:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
Cc:	Steven Rostedt <rostedt@...dmis.org>,
	Oleg Nesterov <oleg@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
	akpm@...ux-foundation.org, josh@...htriplett.org,
	tglx@...utronix.de, Valdis.Kletnieks@...edu, dhowells@...hat.com,
	laijs@...fujitsu.com, dipankar@...ibm.com
Subject: [RFC PATCH] introduce sys_membarrier(): process-wide memory
	barrier (v3b)

Here is an implementation of a new system call, sys_membarrier(), which
executes a memory barrier on all threads of the current process.
 
It aims at greatly simplifying and enhancing the current signal-based
liburcu userspace RCU synchronize_rcu() implementation.
(found at http://lttng.org/urcu)

Changelog since v1:

- Only perform the IPI in CONFIG_SMP.
- Only perform the IPI if the process has more than one thread.
- Only send IPIs to CPUs involved with threads belonging to our process.
- Adaptative IPI scheme (single vs many IPI with threshold).
- Issue smp_mb() at the beginning and end of the system call.

Changelog since v2:

- Iteration on min(num_online_cpus(), nr threads in the process),
  taking runqueue spinlocks, allocating a cpumask, ipi to many to the
  cpumask. Does not allocate the cpumask if only a single IPI is needed.


Both the signal-based and the sys_membarrier userspace RCU schemes
permit us to remove the memory barrier from the userspace RCU
rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
accelerating them. These memory barriers are replaced by compiler
barriers on the read-side, and all matching memory barriers on the 
write-side are turned into an invokation of a memory barrier on all
active threads in the process. By letting the kernel perform this
synchronization rather than dumbly sending a signal to every process
threads (as we currently do), we diminish the number of unnecessary wake
ups and only issue the memory barriers on active threads. Non-running
threads do not need to execute such barrier anyway, because these are
implied by the scheduler context switches.

To explain the benefit of this scheme, let's introduce two example threads:
 
Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock())

In a scheme where all smp_mb() in thread A synchronize_rcu() are
ordering memory accesses with respect to smp_mb() present in 
rcu_read_lock/unlock(), we can change all smp_mb() from
synchronize_rcu() into calls to sys_membarrier() and all smp_mb() from
rcu_read_lock/unlock() into compiler barriers "barrier()".

Before the change, we had, for each smp_mb() pairs:

Thread A                    Thread B
prev mem accesses           prev mem accesses
smp_mb()                    smp_mb()
follow mem accesses         follow mem accesses

After the change, these pairs become:

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

As we can see, there are two possible scenarios: either Thread B memory
accesses do not happen concurrently with Thread A accesses (1), or they
do (2).

1) Non-concurrent Thread A vs Thread B accesses:

Thread A                    Thread B
prev mem accesses
sys_membarrier()
follow mem accesses
                            prev mem accesses
                            barrier()
                            follow mem accesses

In this case, thread B accesses will be weakly ordered. This is OK,
because at that point, thread A is not particularly interested in
ordering them with respect to its own accesses.

2) Concurrent Thread A vs Thread B accesses

Thread A                    Thread B
prev mem accesses           prev mem accesses
sys_membarrier()            barrier()
follow mem accesses         follow mem accesses

In this case, thread B accesses, which are ensured to be in program
order thanks to the compiler barrier, will be "upgraded" to full
smp_mb() thanks to the IPIs executing memory barriers on each active
system threads. Each non-running process threads are intrinsically
serialized by the scheduler.

Just tried with a cache-hot kernel compilation using 6/8 CPUs.

Normally:                                              real 2m41.852s
With the sys_membarrier+1 busy-looping thread running: real 5m41.830s

So... 2x slower. That hurts.

So let's try allocating a cpu mask for PeterZ scheme. I prefer to have a
small allocation overhead and benefit from cpumask broadcast if
possible so we scale better. But that all depends on how big the
allocation overhead is.

Impact of allocating a cpumask (time for 10,000,000 sys_membarrier
calls, one thread is doing the sys_membarrier, the others are busy
looping)).  Given that it costs almost half as much to perform the
cpumask allocation than to send a single IPI, as we iterate on the CPUs
until we find more than N match or iterated on all cpus.  If we only have
N match or less, we send single IPIs. If we need more than that, then we
switch to the cpumask allocation and send a broadcast IPI to the cpumask
we construct for the matching CPUs. Let's call it the "adaptative IPI
scheme".

For my Intel Xeon E5405

*This is calibration only, not taking the runqueue locks*

Just doing local mb()+single IPI to T other threads:

T=1: 0m18.801s
T=2: 0m29.086s
T=3: 0m46.841s
T=4: 0m53.758s
T=5: 1m10.856s
T=6: 1m21.142s
T=7: 1m38.362s

Just doing cpumask alloc+IPI-many to T other threads:

T=1: 0m21.778s
T=2: 0m22.741s
T=3: 0m22.185s
T=4: 0m24.660s
T=5: 0m26.855s
T=6: 0m30.841s
T=7: 0m29.551s

So I think the right threshold should be 1 thread (assuming other
architecture will behave like mine). So starting with 2 threads, we
allocate the cpumask before sending IPIs.

*end of calibration*

Resulting adaptative scheme, with runqueue locks:

T=1: 0m20.990s
T=2: 0m22.588s
T=3: 0m27.028s
T=4: 0m29.027s
T=5: 0m32.592s
T=6: 0m36.556s
T=7: 0m33.093s

The expected top pattern, when using 1 CPU for a thread doing sys_membarrier()
in a loop and other threads busy-waiting in user-space on a variable shows that
the thread doing sys_membarrier is doing mostly system calls, and other threads
are mostly running in user-space. Side-note, in this test, it's important to
check that individual threads are not always fully at 100% user-space time (they
range between ~95% and 100%), because when some thread in the test is always at
100% on the same CPU, this means it does not get the IPI at all. (I actually
found out about a bug in my own code while developing it with this test.)

Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 99.7%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
Cpu2  : 99.3%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.7%hi,  0.0%si,  0.0%st
Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 96.0%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  2.6%si,  0.0%st
Cpu6  :  1.3%us, 98.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 96.1%us,  3.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st

The system call number is only assigned for x86_64 in this RFC patch.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
CC: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
CC: mingo@...e.hu
CC: laijs@...fujitsu.com
CC: dipankar@...ibm.com
CC: akpm@...ux-foundation.org
CC: josh@...htriplett.org
CC: dvhltc@...ibm.com
CC: niv@...ibm.com
CC: tglx@...utronix.de
CC: peterz@...radead.org
CC: rostedt@...dmis.org
CC: Valdis.Kletnieks@...edu
CC: dhowells@...hat.com
---
 arch/x86/include/asm/unistd_64.h |    2 
 kernel/sched.c                   |  219 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 221 insertions(+)

Index: linux-2.6-lttng/arch/x86/include/asm/unistd_64.h
===================================================================
--- linux-2.6-lttng.orig/arch/x86/include/asm/unistd_64.h	2010-01-10 22:23:59.000000000 -0500
+++ linux-2.6-lttng/arch/x86/include/asm/unistd_64.h	2010-01-10 22:29:30.000000000 -0500
@@ -661,6 +661,8 @@ __SYSCALL(__NR_pwritev, sys_pwritev)
 __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 #define __NR_perf_event_open			298
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
+#define __NR_membarrier				299
+__SYSCALL(__NR_membarrier, sys_membarrier)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c	2010-01-10 22:23:59.000000000 -0500
+++ linux-2.6-lttng/kernel/sched.c	2010-01-10 23:12:35.000000000 -0500
@@ -119,6 +119,11 @@
  */
 #define RUNTIME_INF	((u64)~0ULL)
 
+/*
+ * IPI vs cpumask broadcast threshold. Threshold of 1 IPI.
+ */
+#define ADAPT_IPI_THRESHOLD	1
+
 static inline int rt_policy(int policy)
 {
 	if (unlikely(policy == SCHED_FIFO || policy == SCHED_RR))
@@ -10822,6 +10827,220 @@ struct cgroup_subsys cpuacct_subsys = {
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
 
+/*
+ * Execute a memory barrier on all CPUs on SMP systems.
+ * Do not rely on implicit barriers in smp_call_function(), just in case they
+ * are ever relaxed in the future.
+ */
+static void membarrier_ipi(void *unused)
+{
+	smp_mb();
+}
+
+/*
+ * Handle out-of-mem by sending per-cpu IPIs instead.
+ */
+static void membarrier_cpus_retry(int this_cpu)
+{
+	struct mm_struct *mm;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		if (unlikely(cpu == this_cpu))
+			continue;
+		spin_lock_irq(&cpu_rq(cpu)->lock);
+		mm = cpu_curr(cpu)->mm;
+		spin_unlock_irq(&cpu_rq(cpu)->lock);
+		if (current->mm == mm)
+			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
+	}
+}
+
+static void membarrier_threads_retry(int this_cpu)
+{
+	struct mm_struct *mm;
+	struct task_struct *t;
+	struct rq *rq;
+	int cpu;
+
+	list_for_each_entry_rcu(t, &current->thread_group, thread_group) {
+		local_irq_disable();
+		rq = __task_rq_lock(t);
+		mm = rq->curr->mm;
+		cpu = rq->cpu;
+		__task_rq_unlock(rq);
+		local_irq_enable();
+		if (cpu == this_cpu)
+			continue;
+		if (current->mm == mm)
+			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
+	}
+}
+
+static void membarrier_cpus(int this_cpu)
+{
+	int cpu, i, cpu_ipi[ADAPT_IPI_THRESHOLD], nr_cpus = 0;
+	cpumask_var_t tmpmask;
+	struct mm_struct *mm;
+
+	/* Get CPU IDs up to threshold */
+	for_each_online_cpu(cpu) {
+		if (unlikely(cpu == this_cpu))
+			continue;
+		spin_lock_irq(&cpu_rq(cpu)->lock);
+		mm = cpu_curr(cpu)->mm;
+		spin_unlock_irq(&cpu_rq(cpu)->lock);
+		if (current->mm == mm) {
+			if (nr_cpus == ADAPT_IPI_THRESHOLD) {
+				nr_cpus++;
+				break;
+			}
+			cpu_ipi[nr_cpus++] = cpu;
+		}
+	}
+	if (likely(nr_cpus <= ADAPT_IPI_THRESHOLD)) {
+		for (i = 0; i < nr_cpus; i++) {
+			smp_call_function_single(cpu_ipi[i],
+						 membarrier_ipi,
+						 NULL, 1);
+		}
+	} else {
+		if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
+			membarrier_cpus_retry(this_cpu);
+			return;
+		}
+		for (i = 0; i < ADAPT_IPI_THRESHOLD; i++)
+			cpumask_set_cpu(cpu_ipi[i], tmpmask);
+		/* Continue previous online cpu iteration */
+		cpumask_set_cpu(cpu, tmpmask);
+		for (;;) {
+			cpu = cpumask_next(cpu, cpu_online_mask);
+			if (unlikely(cpu == this_cpu))
+				continue;
+			if (unlikely(cpu >= nr_cpu_ids))
+				break;
+			spin_lock_irq(&cpu_rq(cpu)->lock);
+			mm = cpu_curr(cpu)->mm;
+			spin_unlock_irq(&cpu_rq(cpu)->lock);
+			if (current->mm == mm)
+				cpumask_set_cpu(cpu, tmpmask);
+		}
+		smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
+		free_cpumask_var(tmpmask);
+	}
+}
+
+static void membarrier_threads(int this_cpu)
+{
+	int cpu, i, cpu_ipi[ADAPT_IPI_THRESHOLD], nr_cpus = 0;
+	cpumask_var_t tmpmask;
+	struct mm_struct *mm;
+	struct task_struct *t;
+	struct rq *rq;
+
+	/* Get CPU IDs up to threshold */
+	list_for_each_entry_rcu(t, &current->thread_group,
+				thread_group) {
+		local_irq_disable();
+		rq = __task_rq_lock(t);
+		mm = rq->curr->mm;
+		cpu = rq->cpu;
+		__task_rq_unlock(rq);
+		local_irq_enable();
+		if (cpu == this_cpu)
+			continue;
+		if (current->mm == mm) {
+			if (nr_cpus == ADAPT_IPI_THRESHOLD) {
+				nr_cpus++;
+				break;
+			}
+			cpu_ipi[nr_cpus++] = cpu;
+		}
+	}
+	if (likely(nr_cpus <= ADAPT_IPI_THRESHOLD)) {
+		for (i = 0; i < nr_cpus; i++) {
+			smp_call_function_single(cpu_ipi[i],
+						 membarrier_ipi,
+						 NULL, 1);
+		}
+	} else {
+		if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
+			membarrier_threads_retry(this_cpu);
+			return;
+		}
+		for (i = 0; i < ADAPT_IPI_THRESHOLD; i++)
+			cpumask_set_cpu(cpu_ipi[i], tmpmask);
+		/* Continue previous thread iteration */
+		cpumask_set_cpu(cpu, tmpmask);
+		list_for_each_entry_continue_rcu(t,
+						 &current->thread_group,
+						 thread_group) {
+			local_irq_disable();
+			rq = __task_rq_lock(t);
+			mm = rq->curr->mm;
+			cpu = rq->cpu;
+			__task_rq_unlock(rq);
+			local_irq_enable();
+			if (cpu == this_cpu)
+				continue;
+			if (current->mm == mm)
+				cpumask_set_cpu(cpu, tmpmask);
+		}
+		smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
+		free_cpumask_var(tmpmask);
+	}
+}
+
+/*
+ * sys_membarrier - issue memory barrier on current process running threads
+ *
+ * Execute a memory barrier on all running threads of the current process.
+ * Upon completion, the caller thread is ensured that all process threads
+ * have passed through a state where memory accesses match program order.
+ * (non-running threads are de facto in such a state)
+ *
+ * We do not use mm_cpumask because there is no guarantee that each architecture
+ * switch_mm issues a smp_mb() before and after mm_cpumask modification upon
+ * scheduling change. Furthermore, leave_mm is also modifying the mm_cpumask (at
+ * least on x86) from the TLB flush IPI handler. So rather than playing tricky
+ * games with lazy TLB flush, let's simply iterate on online cpus/thread group,
+ * whichever is the smallest.
+ */
+SYSCALL_DEFINE0(membarrier)
+{
+#ifdef CONFIG_SMP
+	int this_cpu;
+
+	if (unlikely(thread_group_empty(current)))
+		return 0;
+
+	rcu_read_lock();	/* protect cpu_curr(cpu)-> and rcu list */
+	preempt_disable();
+	/*
+	 * Memory barrier on the caller thread _before_ sending first IPI.
+	 */
+	smp_mb();
+	/*
+	 * We don't need to include ourself in IPI, as we already
+	 * surround our execution with memory barriers.
+	 */
+	this_cpu = smp_processor_id();
+	/* Approximate which is fastest: CPU or thread group iteration ? */
+	if (num_online_cpus() <= atomic_read(&current->mm->mm_users))
+		membarrier_cpus(this_cpu);
+	else
+		membarrier_threads(this_cpu);
+	/*
+	 * Memory barrier on the caller thread _after_ we finished
+	 * waiting for the last IPI.
+	 */
+	smp_mb();
+	preempt_enable();
+	rcu_read_unlock();
+#endif	/* #ifdef CONFIG_SMP */
+	return 0;
+}
+
 #ifndef CONFIG_SMP
 
 int rcu_expedited_torture_stats(char *page)
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/