linux-kernel - [RFC PATCH] percpu system call: fast userspace percpu critical sections

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1432219487-13364-1-git-send-email-mathieu.desnoyers@efficios.com>
Date:	Thu, 21 May 2015 10:44:47 -0400
From:	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:	Paul Turner <pjt@...gle.com>, Andrew Hunter <ahh@...gle.com>,
	Ben Maurer <bmaurer@...com>
Cc:	linux-kernel@...r.kernel.org,
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Josh Triplett <josh@...htriplett.org>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: [RFC PATCH] percpu system call: fast userspace percpu critical sections

Expose a new system call allowing userspace threads to register
a TLS area used as an ABI between the kernel and userspace to
share information required to create efficient per-cpu critical
sections in user-space.

This ABI consists of a thread-local structure containing:

- a nesting count surrounding the critical section,
- a signal number to be sent to the thread when preempting a thread
  with non-zero nesting count,
- a flag indicating whether the signal has been sent within the
  critical section,
- an integer where to store the current CPU number, updated whenever
  the thread is preempted. This CPU number cache is not strictly
  needed, but performs better than getcpu vdso.

This approach is inspired by Paul Turner and Andrew Hunter's work
on percpu atomics, which lets the kernel handle restart of critical
sections, ref. http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf

What is done differently here compared to percpu atomics: we track
a single nesting counter per thread rather than many ranges of
instruction pointer values. We deliver a signal to user-space and
let the logic of restart be handled in user-space, thus moving
the complexity out of the kernel. The nesting counter approach
allows us to skip the complexity of interacting with signals that
would be otherwise needed with the percpu atomics approach, which
needs to know which instruction pointers are preempted, including
when preemption occurs on a signal handler nested over an instruction
pointer of interest.

Advantages of this approach over percpu atomics:
- kernel code is relatively simple: complexity of restart sections
  is in user-space,
- easy to port to other architectures: just need to reserve a new
  system call,
- for threads which have registered a TLS structure, the fast-path
  at preemption is only a nesting counter check, along with the
  optional store of the current CPU number, rather than comparing
  instruction pointer with possibly many registered ranges,

Caveats of this approach compared to the percpu atomics:
- We need a signal number for this, so it cannot be done without
  designing the application accordingly,
- Handling restart in user-space is currently performed with page
  protection, for which we install a SIGSEGV signal handler. Again,
  this requires designing the application accordingly, especially
  if the application installs its own segmentation fault handler,
- It cannot be used for tracing of processes by injection of code
  into their address space, due to interactions with application
  signal handlers.

The user-space proof of concept code implementing the restart section
can be found here: https://github.com/compudj/percpu-dev

Benchmarking sched_getcpu() vs tls cache approach. Getting the
current CPU number:

- With Linux vdso:            12.7 ns
- With TLS-cached cpu number:  0.3 ns

We will use the TLS-cached cpu number for the following
benchmarks.

On an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, comparison
with a baseline running very few load/stores (no locking,
no getcpu, assuming one thread per CPU with affinity),
against locking scheme based on "lock; cmpxchg", "cmpxchg"
(using restart signal), load-store (using restart signal).
This is performed with 32 threads on a 16-core, hyperthread
system:

                 ns/loop      overhead (ns)
Baseline:          3.7           0.0
lock; cmpxchg:    22.0          18.3
cmpxchg:          11.1           7.4
load-store:        9.4           5.7

Therefore, the load-store scheme has a speedup of 3.2x over the
"lock; cmpxchg" scheme if both are using the tls-cache for the
CPU number. If we use Linux sched_getcpu() for "lock; cmpxchg"
we reach of speedup of 5.4x for load-store+tls-cache vs
"lock; cmpxchg"+vdso-getcpu.

I'm sending this out to trigger discussion, and hopefully to see
Paul and Andrew's patches being posted publicly at some point, so
we can compare our approaches.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
CC: Paul Turner <pjt@...gle.com>
CC: Andrew Hunter <ahh@...gle.com>
CC: Peter Zijlstra <peterz@...radead.org>
CC: Ingo Molnar <mingo@...hat.com>
CC: Ben Maurer <bmaurer@...com>
CC: Steven Rostedt <rostedt@...dmis.org>
CC: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
CC: Josh Triplett <josh@...htriplett.org>
CC: Lai Jiangshan <laijs@...fujitsu.com>
CC: Linus Torvalds <torvalds@...ux-foundation.org>
CC: Andrew Morton <akpm@...ux-foundation.org>
---
 arch/x86/syscalls/syscall_64.tbl  |   1 +
 fs/exec.c                         |   1 +
 include/linux/sched.h             |  18 ++++++
 include/uapi/asm-generic/unistd.h |   4 +-
 init/Kconfig                      |  10 +++
 kernel/Makefile                   |   1 +
 kernel/fork.c                     |   2 +
 kernel/percpu-user.c              | 126 ++++++++++++++++++++++++++++++++++++++
 kernel/sys_ni.c                   |   3 +
 9 files changed, 165 insertions(+), 1 deletion(-)
 create mode 100644 kernel/percpu-user.c

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..0499703 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	common	percpu			sys_percpu
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/exec.c b/fs/exec.c
index c7f9b73..0a2f0b2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1555,6 +1555,7 @@ static int do_execveat_common(int fd, struct filename *filename,
 	/* execve succeeded */
 	current->fs->in_exec = 0;
 	current->in_execve = 0;
+	percpu_user_execve(current);
 	acct_update_integrals(current);
 	task_numa_free(current);
 	free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a419b65..9c88bff 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1275,6 +1275,8 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+struct thread_percpu_user;
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1710,6 +1712,10 @@ struct task_struct {
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	unsigned long	task_state_change;
 #endif
+#ifdef CONFIG_PERCPU_USER
+	struct preempt_notifier percpu_user_notifier;
+	struct thread_percpu_user __user *percpu_user;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
@@ -3090,4 +3096,16 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+#ifdef CONFIG_PERCPU_USER
+void percpu_user_fork(struct task_struct *t);
+void percpu_user_execve(struct task_struct *t);
+#else
+static inline void percpu_user_fork(struct task_struct *t)
+{
+}
+static inline void percpu_user_execve(struct task_struct *t)
+{
+}
+#endif
+
 #endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..f4350d9 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
 __SYSCALL(__NR_bpf, sys_bpf)
 #define __NR_execveat 281
 __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_percpu 282
+__SYSCALL(__NR_percpu, sys_percpu)
 
 #undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 283
 
 /*
  * All syscalls below here should go away really,
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..73c4070 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1559,6 +1559,16 @@ config PCI_QUIRKS
 	  bugs/quirks. Disable this only if your target machine is
 	  unaffected by PCI quirks.
 
+config PERCPU_USER
+	bool "Enable percpu() system call" if EXPERT
+	default y
+	select PREEMPT_NOTIFIERS
+	help
+	  Enable the percpu() system call which provides a building block
+	  for fast per-cpu critical sections in user-space.
+
+	  If unsure, say Y.
+
 config EMBEDDED
 	bool "Embedded system"
 	option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 1408b33..76919a6 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -96,6 +96,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
 obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
 obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_PERCPU_USER) += percpu-user.o
 
 $(obj)/configs.o: $(obj)/config_data.h
 
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139..63aaf5a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1549,6 +1549,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	cgroup_post_fork(p);
 	if (clone_flags & CLONE_THREAD)
 		threadgroup_change_end(current);
+	if (!(clone_flags & CLONE_THREAD))
+		percpu_user_fork(p);
 	perf_event_fork(p);
 
 	trace_task_newtask(p, clone_flags);
diff --git a/kernel/percpu-user.c b/kernel/percpu-user.c
new file mode 100644
index 0000000..be3d439
--- /dev/null
+++ b/kernel/percpu-user.c
@@ -0,0 +1,126 @@
+/*
+ * Copyright (C) 2015 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
+ *
+ * percpu system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/preempt.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+
+struct thread_percpu_user {
+	int32_t nesting;
+	int32_t signal_sent;
+	int32_t signo;
+	int32_t current_cpu;
+};
+
+static void percpu_user_sched_in(struct preempt_notifier *notifier, int cpu)
+{
+	struct thread_percpu_user __user *tpu_user;
+	struct thread_percpu_user tpu;
+	struct task_struct *t = current;
+
+	tpu_user = t->percpu_user;
+	if (tpu_user == NULL)
+		return;
+	if (unlikely(t->flags & PF_EXITING))
+		return;
+	/*
+	 * access_ok() of tpu_user has already been checked by sys_percpu().
+	 */
+	if (__put_user(smp_processor_id(), &tpu_user->current_cpu)) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+	if (__copy_from_user(&tpu, tpu_user, sizeof(tpu))) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+	if (!tpu.nesting || tpu.signal_sent)
+		return;
+	if (do_send_sig_info(tpu.signo, SEND_SIG_PRIV, t, 0)) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+	tpu.signal_sent = 1;
+	if (__copy_to_user(tpu_user, &tpu, sizeof(tpu))) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+}
+
+static void percpu_user_sched_out(struct preempt_notifier *notifier,
+		struct task_struct *next)
+{
+}
+
+static struct preempt_ops percpu_user_ops = {
+	.sched_in = percpu_user_sched_in,
+	.sched_out = percpu_user_sched_out,
+};
+
+/*
+ * If parent had a percpu-user preempt notifier, we need to setup our own.
+ */
+void percpu_user_fork(struct task_struct *t)
+{
+	struct task_struct *parent = current;
+
+	if (!parent->percpu_user)
+		return;
+	preempt_notifier_init(&t->percpu_user_notifier, &percpu_user_ops);
+	preempt_notifier_register(&t->percpu_user_notifier);
+	t->percpu_user = parent->percpu_user;
+}
+
+void percpu_user_execve(struct task_struct *t)
+{
+	if (!t->percpu_user)
+		return;
+	preempt_notifier_unregister(&t->percpu_user_notifier);
+	t->percpu_user = NULL;
+}
+
+/*
+ * sys_percpu - setup user-space per-cpu critical section for caller thread
+ */
+SYSCALL_DEFINE1(percpu, struct thread_percpu_user __user *, tpu)
+{
+	struct task_struct *t = current;
+
+	if (tpu == NULL) {
+		if (t->percpu_user)
+			preempt_notifier_unregister(&t->percpu_user_notifier);
+		goto set_tpu;
+	}
+	if (!access_ok(VERIFY_WRITE, tpu, sizeof(struct thread_percpu_user)))
+		return -EFAULT;
+	preempt_disable();
+	if (__put_user(smp_processor_id(), &tpu->current_cpu)) {
+		WARN_ON_ONCE(1);
+		preempt_enable();
+		return -EFAULT;
+	}
+	preempt_enable();
+	if (!current->percpu_user) {
+		preempt_notifier_init(&t->percpu_user_notifier,
+				&percpu_user_ops);
+		preempt_notifier_register(&t->percpu_user_notifier);
+	}
+set_tpu:
+	current->percpu_user = tpu;
+	return 0;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5adcb0a..16e2bc8 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -229,3 +229,6 @@ cond_syscall(sys_bpf);
 
 /* execveat */
 cond_syscall(sys_execveat);
+
+/* percpu userspace critical sections */
+cond_syscall(sys_percpu);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/