linux-kernel - Re: [RFC PATCH] percpu system call: fast userspace percpu critical sections

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <147341510.6437.1432325571645.JavaMail.zimbra@efficios.com>
Date:	Fri, 22 May 2015 20:12:51 +0000 (UTC)
From:	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:	Paul Turner <pjt@...gle.com>
Cc:	Andrew Hunter <ahh@...gle.com>, Ben Maurer <bmaurer@...com>,
	LKML <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Josh Triplett <josh@...htriplett.org>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [RFC PATCH] percpu system call: fast userspace percpu critical
 sections

----- Original Message -----
> Very nice!
> 
> Some inline notes:
> 
> On Thu, May 21, 2015 at 7:44 AM, Mathieu Desnoyers
> <mathieu.desnoyers@...icios.com> wrote:
> > Expose a new system call allowing userspace threads to register
> > a TLS area used as an ABI between the kernel and userspace to
> > share information required to create efficient per-cpu critical
> > sections in user-space.
> >
> > This ABI consists of a thread-local structure containing:
> >
> > - a nesting count surrounding the critical section,
> > - a signal number to be sent to the thread when preempting a thread
> >   with non-zero nesting count,
> > - a flag indicating whether the signal has been sent within the
> >   critical section,
> > - an integer where to store the current CPU number, updated whenever
> >   the thread is preempted. This CPU number cache is not strictly
> >   needed, but performs better than getcpu vdso.
> 
> While it is fractionally faster, this is actually something we've
> strongly considered dropping.  The complexity of correct TLS
> addressing is non-trivial.

Do you mean that crafting the TLS addressing in assembly is
non-trivial ? This is a constraint specific to the instruction
pointer based solution that I side-track with my approach.

In my implementation, I don't need to perform the TLS addressing
in assembly. By using the nesting counter increment/decrement, as
well as a data protection and restart with page fault handler, I can
do the getcpu TLS read directly in C, no special assembly crafting
is needed.

12 ns -> 0.3 ns acceleration for getting the CPU number seems to
be more than a marginal speedup in this context.

> 
> >
> > This approach is inspired by Paul Turner and Andrew Hunter's work
> > on percpu atomics, which lets the kernel handle restart of critical
> > sections, ref.
> > http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
> >
> > What is done differently here compared to percpu atomics: we track
> > a single nesting counter per thread rather than many ranges of
> > instruction pointer values. We deliver a signal to user-space and
> > let the logic of restart be handled in user-space, thus moving
> > the complexity out of the kernel. The nesting counter approach
> > allows us to skip the complexity of interacting with signals that
> > would be otherwise needed with the percpu atomics approach, which
> > needs to know which instruction pointers are preempted, including
> > when preemption occurs on a signal handler nested over an instruction
> > pointer of interest.
> >
> > Advantages of this approach over percpu atomics:
> > - kernel code is relatively simple: complexity of restart sections
> >   is in user-space,
> 
> I think delivering a signal is actually much more complex than the
> address check here.

Delivering a signal indeed uses a complex delivery mechanism, but
it already exists, so we're not inventing anything new here.


> > - easy to port to other architectures: just need to reserve a new
> >   system call,
> 
> Not having an ABI is a potential nice advantage here.

Not sure what you mean by "not having an ABI" ? In some sense,
both your approach and mine need to add some sort of ABI.

> 
> > - for threads which have registered a TLS structure, the fast-path
> >   at preemption is only a nesting counter check, along with the
> >   optional store of the current CPU number, rather than comparing
> >   instruction pointer with possibly many registered ranges,
> 
> I'd argue this is not actually an advantage:  The preemption is the
> _slow path_.  For a reasonably behaving application this is something
> that should be happening on O(ms) boundaries.  Whereas entry/exit of
> the critical section may be occurring much more frequently.  This is
> also coming at the cost of a slower fast path (nesting updates).

I agree that entry/exit of c.s. is indeed the uttermost fast path we
care about here. However, since not all applications will use it,
other (badly designed) applications may be relying on very fast
scheduler execution due to frequent blocking, so in those pre-existing
cases, preemption would actually be a fast-path. I would not want to
impact those pre-existing applications too much.

Now looking more specifically at the critical section entry/exit:
since we're talking about a per-cpu critical section, doing anything
useful implies:

- entering the critical section
- getting the current cpu number
- doing something on per-cpu data (e.g. grabbing a lock)
- exiting the critical section

My benchmarks show that by implementing sched_getcpu() with a TLS
cache rather than a vdso, we can save about 12 ns. Explicit nesting
counter increment/decrement around the critical section takes much
less than that. If the nesting counter approach allows us to use the
getcpu TLS cache, it seems like a win all over.

> 
> It's also worth noting that the range comparison can be done in logN
> (binary search) or even constant time (bound max sequence size and
> mask out bits on address, or use a lookup table).
> 
> That said there are two very specific advantages:
> 1) CONFIG_PREEMPT greatly complicates correct inspection of the user address.
> 2) One region per address space is a difficult model for binaries with
> shared linkage.
> 
> [ Note:  Compatibility of (1) is considered the primary problem with
> our current patch-set.   I'll publish it as is, then try to follow up
> with some alternatives we've been considering to get around this. ]

Ah! That becomes interesting. One thing here is that the nesting counter
approach completely side-tracks the issues you can have with CONFIG_PREEMPT.
If I understand well, those issues would be:

- page fault occurring within the critical section, which gets preempted.
  (can this actually be an issue with PREEMPT_NONE kernels too with swap ?)
- system call issued within critical section, which gets preempted.
  (however this can be controlled by not having system calls within the
  restart critical section)
- Am I missing anything else ? ...

> 
> >
> > Caveats of this approach compared to the percpu atomics:
> > - We need a signal number for this, so it cannot be done without
> >   designing the application accordingly,
> > - Handling restart in user-space is currently performed with page
> >   protection, for which we install a SIGSEGV signal handler. Again,
> >   this requires designing the application accordingly, especially
> >   if the application installs its own segmentation fault handler,
> > - It cannot be used for tracing of processes by injection of code
> >   into their address space, due to interactions with application
> >   signal handlers.
> >
> > The user-space proof of concept code implementing the restart section
> > can be found here: https://github.com/compudj/percpu-dev
> >
> > Benchmarking sched_getcpu() vs tls cache approach. Getting the
> > current CPU number:
> >
> > - With Linux vdso:            12.7 ns
> > - With TLS-cached cpu number:  0.3 ns
> >
> 
> Have you tried also a naked read against the segment limit?  This is
> more comparable to TLS.  However, it's a definite ABI challenge.

Not sure what is a "naked read against the segment limit" ?
Do you mean bypassing the vdso function and directly reading
the value at a fixed offset from the vdso base ?

Thanks a lot for the feedback!

Mathieu

> > We will use the TLS-cached cpu number for the following
> > benchmarks.
> >
> > On an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, comparison
> > with a baseline running very few load/stores (no locking,
> > no getcpu, assuming one thread per CPU with affinity),
> > against locking scheme based on "lock; cmpxchg", "cmpxchg"
> > (using restart signal), load-store (using restart signal).
> > This is performed with 32 threads on a 16-core, hyperthread
> > system:
> >
> >                  ns/loop      overhead (ns)
> > Baseline:          3.7           0.0
> > lock; cmpxchg:    22.0          18.3
> > cmpxchg:          11.1           7.4
> > load-store:        9.4           5.7
> >
> > Therefore, the load-store scheme has a speedup of 3.2x over the
> > "lock; cmpxchg" scheme if both are using the tls-cache for the
> > CPU number. If we use Linux sched_getcpu() for "lock; cmpxchg"
> > we reach of speedup of 5.4x for load-store+tls-cache vs
> > "lock; cmpxchg"+vdso-getcpu.
> >
> > I'm sending this out to trigger discussion, and hopefully to see
> > Paul and Andrew's patches being posted publicly at some point, so
> > we can compare our approaches.
> >
> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> > CC: Paul Turner <pjt@...gle.com>
> > CC: Andrew Hunter <ahh@...gle.com>
> > CC: Peter Zijlstra <peterz@...radead.org>
> > CC: Ingo Molnar <mingo@...hat.com>
> > CC: Ben Maurer <bmaurer@...com>
> > CC: Steven Rostedt <rostedt@...dmis.org>
> > CC: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
> > CC: Josh Triplett <josh@...htriplett.org>
> > CC: Lai Jiangshan <laijs@...fujitsu.com>
> > CC: Linus Torvalds <torvalds@...ux-foundation.org>
> > CC: Andrew Morton <akpm@...ux-foundation.org>
> > ---
> >  arch/x86/syscalls/syscall_64.tbl  |   1 +
> >  fs/exec.c                         |   1 +
> >  include/linux/sched.h             |  18 ++++++
> >  include/uapi/asm-generic/unistd.h |   4 +-
> >  init/Kconfig                      |  10 +++
> >  kernel/Makefile                   |   1 +
> >  kernel/fork.c                     |   2 +
> >  kernel/percpu-user.c              | 126
> >  ++++++++++++++++++++++++++++++++++++++
> >  kernel/sys_ni.c                   |   3 +
> >  9 files changed, 165 insertions(+), 1 deletion(-)
> >  create mode 100644 kernel/percpu-user.c
> >
> > diff --git a/arch/x86/syscalls/syscall_64.tbl
> > b/arch/x86/syscalls/syscall_64.tbl
> > index 8d656fb..0499703 100644
> > --- a/arch/x86/syscalls/syscall_64.tbl
> > +++ b/arch/x86/syscalls/syscall_64.tbl
> > @@ -329,6 +329,7 @@
> >  320    common  kexec_file_load         sys_kexec_file_load
> >  321    common  bpf                     sys_bpf
> >  322    64      execveat                stub_execveat
> > +323    common  percpu                  sys_percpu
> >
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> > diff --git a/fs/exec.c b/fs/exec.c
> > index c7f9b73..0a2f0b2 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -1555,6 +1555,7 @@ static int do_execveat_common(int fd, struct filename
> > *filename,
> >         /* execve succeeded */
> >         current->fs->in_exec = 0;
> >         current->in_execve = 0;
> > +       percpu_user_execve(current);
> >         acct_update_integrals(current);
> >         task_numa_free(current);
> >         free_bprm(bprm);
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index a419b65..9c88bff 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1275,6 +1275,8 @@ enum perf_event_task_context {
> >         perf_nr_task_contexts,
> >  };
> >
> > +struct thread_percpu_user;
> > +
> >  struct task_struct {
> >         volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */
> >         void *stack;
> > @@ -1710,6 +1712,10 @@ struct task_struct {
> >  #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> >         unsigned long   task_state_change;
> >  #endif
> > +#ifdef CONFIG_PERCPU_USER
> > +       struct preempt_notifier percpu_user_notifier;
> > +       struct thread_percpu_user __user *percpu_user;
> > +#endif
> >  };
> >
> >  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> > @@ -3090,4 +3096,16 @@ static inline unsigned long rlimit_max(unsigned int
> > limit)
> >         return task_rlimit_max(current, limit);
> >  }
> >
> > +#ifdef CONFIG_PERCPU_USER
> > +void percpu_user_fork(struct task_struct *t);
> > +void percpu_user_execve(struct task_struct *t);
> > +#else
> > +static inline void percpu_user_fork(struct task_struct *t)
> > +{
> > +}
> > +static inline void percpu_user_execve(struct task_struct *t)
> > +{
> > +}
> > +#endif
> > +
> >  #endif
> > diff --git a/include/uapi/asm-generic/unistd.h
> > b/include/uapi/asm-generic/unistd.h
> > index e016bd9..f4350d9 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
> >  __SYSCALL(__NR_bpf, sys_bpf)
> >  #define __NR_execveat 281
> >  __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
> > +#define __NR_percpu 282
> > +__SYSCALL(__NR_percpu, sys_percpu)
> >
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 282
> > +#define __NR_syscalls 283
> >
> >  /*
> >   * All syscalls below here should go away really,
> > diff --git a/init/Kconfig b/init/Kconfig
> > index f5dbc6d..73c4070 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1559,6 +1559,16 @@ config PCI_QUIRKS
> >           bugs/quirks. Disable this only if your target machine is
> >           unaffected by PCI quirks.
> >
> > +config PERCPU_USER
> > +       bool "Enable percpu() system call" if EXPERT
> > +       default y
> > +       select PREEMPT_NOTIFIERS
> > +       help
> > +         Enable the percpu() system call which provides a building block
> > +         for fast per-cpu critical sections in user-space.
> > +
> > +         If unsure, say Y.
> > +
> >  config EMBEDDED
> >         bool "Embedded system"
> >         option allnoconfig_y
> > diff --git a/kernel/Makefile b/kernel/Makefile
> > index 1408b33..76919a6 100644
> > --- a/kernel/Makefile
> > +++ b/kernel/Makefile
> > @@ -96,6 +96,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
> >  obj-$(CONFIG_JUMP_LABEL) += jump_label.o
> >  obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
> >  obj-$(CONFIG_TORTURE_TEST) += torture.o
> > +obj-$(CONFIG_PERCPU_USER) += percpu-user.o
> >
> >  $(obj)/configs.o: $(obj)/config_data.h
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index cf65139..63aaf5a 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1549,6 +1549,8 @@ static struct task_struct *copy_process(unsigned long
> > clone_flags,
> >         cgroup_post_fork(p);
> >         if (clone_flags & CLONE_THREAD)
> >                 threadgroup_change_end(current);
> > +       if (!(clone_flags & CLONE_THREAD))
> > +               percpu_user_fork(p);
> >         perf_event_fork(p);
> >
> >         trace_task_newtask(p, clone_flags);
> > diff --git a/kernel/percpu-user.c b/kernel/percpu-user.c
> > new file mode 100644
> > index 0000000..be3d439
> > --- /dev/null
> > +++ b/kernel/percpu-user.c
> > @@ -0,0 +1,126 @@
> > +/*
> > + * Copyright (C) 2015 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> > + *
> > + * percpu system call
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + */
> > +
> > +#include <linux/preempt.h>
> > +#include <linux/init.h>
> > +#include <linux/sched.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/syscalls.h>
> > +
> > +struct thread_percpu_user {
> > +       int32_t nesting;
> > +       int32_t signal_sent;
> > +       int32_t signo;
> > +       int32_t current_cpu;
> > +};
> > +
> > +static void percpu_user_sched_in(struct preempt_notifier *notifier, int
> > cpu)
> > +{
> > +       struct thread_percpu_user __user *tpu_user;
> > +       struct thread_percpu_user tpu;
> > +       struct task_struct *t = current;
> > +
> > +       tpu_user = t->percpu_user;
> > +       if (tpu_user == NULL)
> > +               return;
> > +       if (unlikely(t->flags & PF_EXITING))
> > +               return;
> > +       /*
> > +        * access_ok() of tpu_user has already been checked by
> > sys_percpu().
> > +        */
> > +       if (__put_user(smp_processor_id(), &tpu_user->current_cpu)) {
> > +               WARN_ON_ONCE(1);
> > +               return;
> > +       }
> > +       if (__copy_from_user(&tpu, tpu_user, sizeof(tpu))) {
> > +               WARN_ON_ONCE(1);
> > +               return;
> > +       }
> > +       if (!tpu.nesting || tpu.signal_sent)
> > +               return;
> > +       if (do_send_sig_info(tpu.signo, SEND_SIG_PRIV, t, 0)) {
> > +               WARN_ON_ONCE(1);
> > +               return;
> > +       }
> > +       tpu.signal_sent = 1;
> > +       if (__copy_to_user(tpu_user, &tpu, sizeof(tpu))) {
> > +               WARN_ON_ONCE(1);
> > +               return;
> > +       }
> > +}
> > +
> > +static void percpu_user_sched_out(struct preempt_notifier *notifier,
> > +               struct task_struct *next)
> > +{
> > +}
> > +
> > +static struct preempt_ops percpu_user_ops = {
> > +       .sched_in = percpu_user_sched_in,
> > +       .sched_out = percpu_user_sched_out,
> > +};
> > +
> > +/*
> > + * If parent had a percpu-user preempt notifier, we need to setup our own.
> > + */
> > +void percpu_user_fork(struct task_struct *t)
> > +{
> > +       struct task_struct *parent = current;
> > +
> > +       if (!parent->percpu_user)
> > +               return;
> > +       preempt_notifier_init(&t->percpu_user_notifier, &percpu_user_ops);
> > +       preempt_notifier_register(&t->percpu_user_notifier);
> > +       t->percpu_user = parent->percpu_user;
> > +}
> > +
> > +void percpu_user_execve(struct task_struct *t)
> > +{
> > +       if (!t->percpu_user)
> > +               return;
> > +       preempt_notifier_unregister(&t->percpu_user_notifier);
> > +       t->percpu_user = NULL;
> > +}
> > +
> > +/*
> > + * sys_percpu - setup user-space per-cpu critical section for caller
> > thread
> > + */
> > +SYSCALL_DEFINE1(percpu, struct thread_percpu_user __user *, tpu)
> > +{
> > +       struct task_struct *t = current;
> > +
> > +       if (tpu == NULL) {
> > +               if (t->percpu_user)
> > +
> > preempt_notifier_unregister(&t->percpu_user_notifier);
> > +               goto set_tpu;
> > +       }
> > +       if (!access_ok(VERIFY_WRITE, tpu, sizeof(struct
> > thread_percpu_user)))
> > +               return -EFAULT;
> > +       preempt_disable();
> > +       if (__put_user(smp_processor_id(), &tpu->current_cpu)) {
> > +               WARN_ON_ONCE(1);
> > +               preempt_enable();
> > +               return -EFAULT;
> > +       }
> > +       preempt_enable();
> > +       if (!current->percpu_user) {
> > +               preempt_notifier_init(&t->percpu_user_notifier,
> > +                               &percpu_user_ops);
> > +               preempt_notifier_register(&t->percpu_user_notifier);
> > +       }
> > +set_tpu:
> > +       current->percpu_user = tpu;
> > +       return 0;
> > +}
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 5adcb0a..16e2bc8 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -229,3 +229,6 @@ cond_syscall(sys_bpf);
> >
> >  /* execveat */
> >  cond_syscall(sys_execveat);
> > +
> > +/* percpu userspace critical sections */
> > +cond_syscall(sys_percpu);
> > --
> > 2.1.4
> >
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/