linux-kernel - Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgNAkjrh_OMi+7EUJxqM0-84WUxL0d_vse4neOL93EB-sGKXw@mail.gmail.com>
Date:   Wed, 15 Nov 2017 08:44:56 +0100
From:   "Michael Kerrisk (man-pages)" <mtk.manpages@...il.com>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        "Paul E . McKenney" <paulmck@...ux.vnet.ibm.com>,
        Boqun Feng <boqun.feng@...il.com>,
        Andy Lutomirski <luto@...capital.net>,
        Dave Watson <davejwatson@...com>,
        lkml <linux-kernel@...r.kernel.org>,
        Linux API <linux-api@...r.kernel.org>,
        Paul Turner <pjt@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Russell King <linux@....linux.org.uk>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>,
        "H . Peter Anvin" <hpa@...or.com>, Andrew Hunter <ahh@...gle.com>,
        Andi Kleen <andi@...stfloor.org>, Chris Lameter <cl@...ux.com>,
        Ben Maurer <bmaurer@...com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Josh Triplett <josh@...htriplett.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will.deacon@....com>
Subject: Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

Hi Matthieu

On 14 November 2017 at 21:03, Mathieu Desnoyers
<mathieu.desnoyers@...icios.com> wrote:
> This new cpu_opv system call executes a vector of operations on behalf
> of user-space on a specific CPU with preemption disabled. It is inspired
> from readv() and writev() system calls which take a "struct iovec" array
> as argument.

Do you have a man page spfr this syscall already?

Thanks,

Michael


> The operations available are: comparison, memcpy, add, or, and, xor,
> left shift, right shift, and mb. The system call receives a CPU number
> from user-space as argument, which is the CPU on which those operations
> need to be performed. All preparation steps such as loading pointers,
> and applying offsets to arrays, need to be performed by user-space
> before invoking the system call. The "comparison" operation can be used
> to check that the data used in the preparation step did not change
> between preparation of system call inputs and operation execution within
> the preempt-off critical section.
>
> The reason why we require all pointer offsets to be calculated by
> user-space beforehand is because we need to use get_user_pages_fast() to
> first pin all pages touched by each operation. This takes care of
> faulting-in the pages. Then, preemption is disabled, and the operations
> are performed atomically with respect to other thread execution on that
> CPU, without generating any page fault.
>
> A maximum limit of 16 operations per cpu_opv syscall invocation is
> enforced, so user-space cannot generate a too long preempt-off critical
> section. Each operation is also limited a length of PAGE_SIZE bytes,
> meaning that an operation can touch a maximum of 4 pages (memcpy: 2
> pages for source, 2 pages for destination if addresses are not aligned
> on page boundaries). Moreover, a total limit of 4216 bytes is applied
> to operation lengths.
>
> If the thread is not running on the requested CPU, a new
> push_task_to_cpu() is invoked to migrate the task to the requested CPU.
> If the requested CPU is not part of the cpus allowed mask of the thread,
> the system call fails with EINVAL. After the migration has been
> performed, preemption is disabled, and the current CPU number is checked
> again and compared to the requested CPU number. If it still differs, it
> means the scheduler migrated us away from that CPU. Return EAGAIN to
> user-space in that case, and let user-space retry (either requesting the
> same CPU number, or a different one, depending on the user-space
> algorithm constraints).
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> CC: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
> CC: Peter Zijlstra <peterz@...radead.org>
> CC: Paul Turner <pjt@...gle.com>
> CC: Thomas Gleixner <tglx@...utronix.de>
> CC: Andrew Hunter <ahh@...gle.com>
> CC: Andy Lutomirski <luto@...capital.net>
> CC: Andi Kleen <andi@...stfloor.org>
> CC: Dave Watson <davejwatson@...com>
> CC: Chris Lameter <cl@...ux.com>
> CC: Ingo Molnar <mingo@...hat.com>
> CC: "H. Peter Anvin" <hpa@...or.com>
> CC: Ben Maurer <bmaurer@...com>
> CC: Steven Rostedt <rostedt@...dmis.org>
> CC: Josh Triplett <josh@...htriplett.org>
> CC: Linus Torvalds <torvalds@...ux-foundation.org>
> CC: Andrew Morton <akpm@...ux-foundation.org>
> CC: Russell King <linux@....linux.org.uk>
> CC: Catalin Marinas <catalin.marinas@....com>
> CC: Will Deacon <will.deacon@....com>
> CC: Michael Kerrisk <mtk.manpages@...il.com>
> CC: Boqun Feng <boqun.feng@...il.com>
> CC: linux-api@...r.kernel.org
> ---
>
> Changes since v1:
> - handle CPU hotplug,
> - cleanup implementation using function pointers: We can use function
>   pointers to implement the operations rather than duplicating all the
>   user-access code.
> - refuse device pages: Performing cpu_opv operations on io map'd pages
>   with preemption disabled could generate long preempt-off critical
>   sections, which leads to unwanted scheduler latency. Return EFAULT if
>   a device page is received as parameter
> - restrict op vector to 4216 bytes length sum: Restrict the operation
>   vector to length sum of:
>   - 4096 bytes (typical page size on most architectures, should be
>     enough for a string, or structures)
>   - 15 * 8 bytes (typical operations on integers or pointers).
>   The goal here is to keep the duration of preempt off critical section
>   short, so we don't add significant scheduler latency.
> - Add INIT_ONSTACK macro: Introduce the
>   CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users
>   correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their
>   stack to 0 on 32-bit architectures.
> - Add CPU_MB_OP operation:
>   Use-cases with:
>   - two consecutive stores,
>   - a mempcy followed by a store,
>   require a memory barrier before the final store operation. A typical
>   use-case is a store-release on the final store. Given that this is a
>   slow path, just providing an explicit full barrier instruction should
>   be sufficient.
> - Add expect fault field:
>   The use-case of list_pop brings interesting challenges. With rseq, we
>   can use rseq_cmpnev_storeoffp_load(), and therefore load a pointer,
>   compare it against NULL, add an offset, and load the target "next"
>   pointer from the object, all within a single req critical section.
>
>   Life is not so easy for cpu_opv in this use-case, mainly because we
>   need to pin all pages we are going to touch in the preempt-off
>   critical section beforehand. So we need to know the target object (in
>   which we apply an offset to fetch the next pointer) when we pin pages
>   before disabling preemption.
>
>   So the approach is to load the head pointer and compare it against
>   NULL in user-space, before doing the cpu_opv syscall. User-space can
>   then compute the address of the head->next field, *without loading it*.
>
>   The cpu_opv system call will first need to pin all pages associated
>   with input data. This includes the page backing the head->next object,
>   which may have been concurrently deallocated and unmapped. Therefore,
>   in this case, getting -EFAULT when trying to pin those pages may
>   happen: it just means they have been concurrently unmapped. This is
>   an expected situation, and should just return -EAGAIN to user-space,
>   to user-space can distinguish between "should retry" type of
>   situations and actual errors that should be handled with extreme
>   prejudice to the program (e.g. abort()).
>
>   Therefore, add "expect_fault" fields along with op input address
>   pointers, so user-space can identify whether a fault when getting a
>   field should return EAGAIN rather than EFAULT.
> - Add compiler barrier between operations: Adding a compiler barrier
>   between store operations in a cpu_opv sequence can be useful when
>   paired with membarrier system call.
>
>   An algorithm with a paired slow path and fast path can use
>   sys_membarrier on the slow path to replace fast-path memory barriers
>   by compiler barrier.
>
>   Adding an explicit compiler barrier between operations allows
>   cpu_opv to be used as fallback for operations meant to match
>   the membarrier system call.
>
> Changes since v2:
>
> - Fix memory leak by introducing struct cpu_opv_pinned_pages.
>   Suggested by Boqun Feng.
> - Cast argument 1 passed to access_ok from integer to void __user *,
>   fixing sparse warning.
> ---
>  MAINTAINERS                  |   7 +
>  include/uapi/linux/cpu_opv.h | 117 ++++++
>  init/Kconfig                 |  14 +
>  kernel/Makefile              |   1 +
>  kernel/cpu_opv.c             | 968 +++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/core.c          |  37 ++
>  kernel/sched/sched.h         |   2 +
>  kernel/sys_ni.c              |   1 +
>  8 files changed, 1147 insertions(+)
>  create mode 100644 include/uapi/linux/cpu_opv.h
>  create mode 100644 kernel/cpu_opv.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c9f95f8b07ed..45a1bbdaa287 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3675,6 +3675,13 @@ B:       https://bugzilla.kernel.org
>  F:     drivers/cpuidle/*
>  F:     include/linux/cpuidle.h
>
> +CPU NON-PREEMPTIBLE OPERATION VECTOR SUPPORT
> +M:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> +L:     linux-kernel@...r.kernel.org
> +S:     Supported
> +F:     kernel/cpu_opv.c
> +F:     include/uapi/linux/cpu_opv.h
> +
>  CRAMFS FILESYSTEM
>  W:     http://sourceforge.net/projects/cramfs/
>  S:     Orphan / Obsolete
> diff --git a/include/uapi/linux/cpu_opv.h b/include/uapi/linux/cpu_opv.h
> new file mode 100644
> index 000000000000..17f7d46e053b
> --- /dev/null
> +++ b/include/uapi/linux/cpu_opv.h
> @@ -0,0 +1,117 @@
> +#ifndef _UAPI_LINUX_CPU_OPV_H
> +#define _UAPI_LINUX_CPU_OPV_H
> +
> +/*
> + * linux/cpu_opv.h
> + *
> + * CPU preempt-off operation vector system call API
> + *
> + * Copyright (c) 2017 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#ifdef __KERNEL__
> +# include <linux/types.h>
> +#else  /* #ifdef __KERNEL__ */
> +# include <stdint.h>
> +#endif /* #else #ifdef __KERNEL__ */
> +
> +#include <asm/byteorder.h>
> +
> +#ifdef __LP64__
> +# define CPU_OP_FIELD_u32_u64(field)                   uint64_t field
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)   field = (intptr_t)v
> +#elif defined(__BYTE_ORDER) ? \
> +       __BYTE_ORDER == __BIG_ENDIAN : defined(__BIG_ENDIAN)
> +# define CPU_OP_FIELD_u32_u64(field)   uint32_t field ## _padding, field
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)   \
> +       field ## _padding = 0, field = (intptr_t)v
> +#else
> +# define CPU_OP_FIELD_u32_u64(field)   uint32_t field, field ## _padding
> +# define CPU_OP_FIELD_u32_u64_INIT_ONSTACK(field, v)   \
> +       field = (intptr_t)v, field ## _padding = 0
> +#endif
> +
> +#define CPU_OP_VEC_LEN_MAX             16
> +#define CPU_OP_ARG_LEN_MAX             24
> +/* Max. data len per operation. */
> +#define CPU_OP_DATA_LEN_MAX            PAGE_SIZE
> +/*
> + * Max. data len for overall vector. We to restrict the amount of
> + * user-space data touched by the kernel in non-preemptible context so
> + * we do not introduce long scheduler latencies.
> + * This allows one copy of up to 4096 bytes, and 15 operations touching
> + * 8 bytes each.
> + * This limit is applied to the sum of length specified for all
> + * operations in a vector.
> + */
> +#define CPU_OP_VEC_DATA_LEN_MAX                (4096 + 15*8)
> +#define CPU_OP_MAX_PAGES               4       /* Max. pages per op. */
> +
> +enum cpu_op_type {
> +       CPU_COMPARE_EQ_OP,      /* compare */
> +       CPU_COMPARE_NE_OP,      /* compare */
> +       CPU_MEMCPY_OP,          /* memcpy */
> +       CPU_ADD_OP,             /* arithmetic */
> +       CPU_OR_OP,              /* bitwise */
> +       CPU_AND_OP,             /* bitwise */
> +       CPU_XOR_OP,             /* bitwise */
> +       CPU_LSHIFT_OP,          /* shift */
> +       CPU_RSHIFT_OP,          /* shift */
> +       CPU_MB_OP,              /* memory barrier */
> +};
> +
> +/* Vector of operations to perform. Limited to 16. */
> +struct cpu_op {
> +       int32_t op;     /* enum cpu_op_type. */
> +       uint32_t len;   /* data length, in bytes. */
> +       union {
> +               struct {
> +                       CPU_OP_FIELD_u32_u64(a);
> +                       CPU_OP_FIELD_u32_u64(b);
> +                       uint8_t expect_fault_a;
> +                       uint8_t expect_fault_b;
> +               } compare_op;
> +               struct {
> +                       CPU_OP_FIELD_u32_u64(dst);
> +                       CPU_OP_FIELD_u32_u64(src);
> +                       uint8_t expect_fault_dst;
> +                       uint8_t expect_fault_src;
> +               } memcpy_op;
> +               struct {
> +                       CPU_OP_FIELD_u32_u64(p);
> +                       int64_t count;
> +                       uint8_t expect_fault_p;
> +               } arithmetic_op;
> +               struct {
> +                       CPU_OP_FIELD_u32_u64(p);
> +                       uint64_t mask;
> +                       uint8_t expect_fault_p;
> +               } bitwise_op;
> +               struct {
> +                       CPU_OP_FIELD_u32_u64(p);
> +                       uint32_t bits;
> +                       uint8_t expect_fault_p;
> +               } shift_op;
> +               char __padding[CPU_OP_ARG_LEN_MAX];
> +       } u;
> +};
> +
> +#endif /* _UAPI_LINUX_CPU_OPV_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index cbedfb91b40a..e4fbb5dd6a24 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1404,6 +1404,7 @@ config RSEQ
>         bool "Enable rseq() system call" if EXPERT
>         default y
>         depends on HAVE_RSEQ
> +       select CPU_OPV
>         select MEMBARRIER
>         help
>           Enable the restartable sequences system call. It provides a
> @@ -1414,6 +1415,19 @@ config RSEQ
>
>           If unsure, say Y.
>
> +config CPU_OPV
> +       bool "Enable cpu_opv() system call" if EXPERT
> +       default y
> +       help
> +         Enable the CPU preempt-off operation vector system call.
> +         It allows user-space to perform a sequence of operations on
> +         per-cpu data with preemption disabled. Useful as
> +         single-stepping fall-back for restartable sequences, and for
> +         performing more complex operations on per-cpu data that would
> +         not be otherwise possible to do with restartable sequences.
> +
> +         If unsure, say Y.
> +
>  config EMBEDDED
>         bool "Embedded system"
>         option allnoconfig_y
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 3574669dafd9..cac8855196ff 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -113,6 +113,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
>
>  obj-$(CONFIG_HAS_IOMEM) += memremap.o
>  obj-$(CONFIG_RSEQ) += rseq.o
> +obj-$(CONFIG_CPU_OPV) += cpu_opv.o
>
>  $(obj)/configs.o: $(obj)/config_data.h
>
> diff --git a/kernel/cpu_opv.c b/kernel/cpu_opv.c
> new file mode 100644
> index 000000000000..a81837a14b17
> --- /dev/null
> +++ b/kernel/cpu_opv.c
> @@ -0,0 +1,968 @@
> +/*
> + * CPU preempt-off operation vector system call
> + *
> + * It allows user-space to perform a sequence of operations on per-cpu
> + * data with preemption disabled. Useful as single-stepping fall-back
> + * for restartable sequences, and for performing more complex operations
> + * on per-cpu data that would not be otherwise possible to do with
> + * restartable sequences.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * Copyright (C) 2017, EfficiOS Inc.,
> + * Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/uaccess.h>
> +#include <linux/syscalls.h>
> +#include <linux/cpu_opv.h>
> +#include <linux/types.h>
> +#include <linux/mutex.h>
> +#include <linux/pagemap.h>
> +#include <asm/ptrace.h>
> +#include <asm/byteorder.h>
> +
> +#include "sched/sched.h"
> +
> +#define TMP_BUFLEN                     64
> +#define NR_PINNED_PAGES_ON_STACK       8
> +
> +union op_fn_data {
> +       uint8_t _u8;
> +       uint16_t _u16;
> +       uint32_t _u32;
> +       uint64_t _u64;
> +#if (BITS_PER_LONG < 64)
> +       uint32_t _u64_split[2];
> +#endif
> +};
> +
> +struct cpu_opv_pinned_pages {
> +       struct page **pages;
> +       size_t nr;
> +       bool is_kmalloc;
> +};
> +
> +typedef int (*op_fn_t)(union op_fn_data *data, uint64_t v, uint32_t len);
> +
> +static DEFINE_MUTEX(cpu_opv_offline_lock);
> +
> +/*
> + * The cpu_opv system call executes a vector of operations on behalf of
> + * user-space on a specific CPU with preemption disabled. It is inspired
> + * from readv() and writev() system calls which take a "struct iovec"
> + * array as argument.
> + *
> + * The operations available are: comparison, memcpy, add, or, and, xor,
> + * left shift, and right shift. The system call receives a CPU number
> + * from user-space as argument, which is the CPU on which those
> + * operations need to be performed. All preparation steps such as
> + * loading pointers, and applying offsets to arrays, need to be
> + * performed by user-space before invoking the system call. The
> + * "comparison" operation can be used to check that the data used in the
> + * preparation step did not change between preparation of system call
> + * inputs and operation execution within the preempt-off critical
> + * section.
> + *
> + * The reason why we require all pointer offsets to be calculated by
> + * user-space beforehand is because we need to use get_user_pages_fast()
> + * to first pin all pages touched by each operation. This takes care of
> + * faulting-in the pages. Then, preemption is disabled, and the
> + * operations are performed atomically with respect to other thread
> + * execution on that CPU, without generating any page fault.
> + *
> + * A maximum limit of 16 operations per cpu_opv syscall invocation is
> + * enforced, and a overall maximum length sum, so user-space cannot
> + * generate a too long preempt-off critical section. Each operation is
> + * also limited a length of PAGE_SIZE bytes, meaning that an operation
> + * can touch a maximum of 4 pages (memcpy: 2 pages for source, 2 pages
> + * for destination if addresses are not aligned on page boundaries).
> + *
> + * If the thread is not running on the requested CPU, a new
> + * push_task_to_cpu() is invoked to migrate the task to the requested
> + * CPU.  If the requested CPU is not part of the cpus allowed mask of
> + * the thread, the system call fails with EINVAL. After the migration
> + * has been performed, preemption is disabled, and the current CPU
> + * number is checked again and compared to the requested CPU number. If
> + * it still differs, it means the scheduler migrated us away from that
> + * CPU. Return EAGAIN to user-space in that case, and let user-space
> + * retry (either requesting the same CPU number, or a different one,
> + * depending on the user-space algorithm constraints).
> + */
> +
> +/*
> + * Check operation types and length parameters.
> + */
> +static int cpu_opv_check(struct cpu_op *cpuop, int cpuopcnt)
> +{
> +       int i;
> +       uint32_t sum = 0;
> +
> +       for (i = 0; i < cpuopcnt; i++) {
> +               struct cpu_op *op = &cpuop[i];
> +
> +               switch (op->op) {
> +               case CPU_MB_OP:
> +                       break;
> +               default:
> +                       sum += op->len;
> +               }
> +               switch (op->op) {
> +               case CPU_COMPARE_EQ_OP:
> +               case CPU_COMPARE_NE_OP:
> +               case CPU_MEMCPY_OP:
> +                       if (op->len > CPU_OP_DATA_LEN_MAX)
> +                               return -EINVAL;
> +                       break;
> +               case CPU_ADD_OP:
> +               case CPU_OR_OP:
> +               case CPU_AND_OP:
> +               case CPU_XOR_OP:
> +                       switch (op->len) {
> +                       case 1:
> +                       case 2:
> +                       case 4:
> +                       case 8:
> +                               break;
> +                       default:
> +                               return -EINVAL;
> +                       }
> +                       break;
> +               case CPU_LSHIFT_OP:
> +               case CPU_RSHIFT_OP:
> +                       switch (op->len) {
> +                       case 1:
> +                               if (op->u.shift_op.bits > 7)
> +                                       return -EINVAL;
> +                               break;
> +                       case 2:
> +                               if (op->u.shift_op.bits > 15)
> +                                       return -EINVAL;
> +                               break;
> +                       case 4:
> +                               if (op->u.shift_op.bits > 31)
> +                                       return -EINVAL;
> +                               break;
> +                       case 8:
> +                               if (op->u.shift_op.bits > 63)
> +                                       return -EINVAL;
> +                               break;
> +                       default:
> +                               return -EINVAL;
> +                       }
> +                       break;
> +               case CPU_MB_OP:
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +       if (sum > CPU_OP_VEC_DATA_LEN_MAX)
> +               return -EINVAL;
> +       return 0;
> +}
> +
> +static unsigned long cpu_op_range_nr_pages(unsigned long addr,
> +               unsigned long len)
> +{
> +       return ((addr + len - 1) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT) + 1;
> +}
> +
> +static int cpu_op_check_page(struct page *page)
> +{
> +       struct address_space *mapping;
> +
> +       if (is_zone_device_page(page))
> +               return -EFAULT;
> +       page = compound_head(page);
> +       mapping = READ_ONCE(page->mapping);
> +       if (!mapping) {
> +               int shmem_swizzled;
> +
> +               /*
> +                * Check again with page lock held to guard against
> +                * memory pressure making shmem_writepage move the page
> +                * from filecache to swapcache.
> +                */
> +               lock_page(page);
> +               shmem_swizzled = PageSwapCache(page) || page->mapping;
> +               unlock_page(page);
> +               if (shmem_swizzled)
> +                       return -EAGAIN;
> +               return -EFAULT;
> +       }
> +       return 0;
> +}
> +
> +/*
> + * Refusing device pages, the zero page, pages in the gate area, and
> + * special mappings. Inspired from futex.c checks.
> + */
> +static int cpu_op_check_pages(struct page **pages,
> +               unsigned long nr_pages)
> +{
> +       unsigned long i;
> +
> +       for (i = 0; i < nr_pages; i++) {
> +               int ret;
> +
> +               ret = cpu_op_check_page(pages[i]);
> +               if (ret)
> +                       return ret;
> +       }
> +       return 0;
> +}
> +
> +static int cpu_op_pin_pages(unsigned long addr, unsigned long len,
> +               struct cpu_opv_pinned_pages *pin_pages, int write)
> +{
> +       struct page *pages[2];
> +       int ret, nr_pages;
> +
> +       if (!len)
> +               return 0;
> +       nr_pages = cpu_op_range_nr_pages(addr, len);
> +       BUG_ON(nr_pages > 2);
> +       if (!pin_pages->is_kmalloc && pin_pages->nr + nr_pages
> +                       > NR_PINNED_PAGES_ON_STACK) {
> +               struct page **pinned_pages =
> +                       kzalloc(CPU_OP_VEC_LEN_MAX * CPU_OP_MAX_PAGES
> +                               * sizeof(struct page *), GFP_KERNEL);
> +               if (!pinned_pages)
> +                       return -ENOMEM;
> +               memcpy(pinned_pages, pin_pages->pages,
> +                       pin_pages->nr * sizeof(struct page *));
> +               pin_pages->pages = pinned_pages;
> +               pin_pages->is_kmalloc = true;
> +       }
> +again:
> +       ret = get_user_pages_fast(addr, nr_pages, write, pages);
> +       if (ret < nr_pages) {
> +               if (ret > 0)
> +                       put_page(pages[0]);
> +               return -EFAULT;
> +       }
> +       /*
> +        * Refuse device pages, the zero page, pages in the gate area,
> +        * and special mappings.
> +        */
> +       ret = cpu_op_check_pages(pages, nr_pages);
> +       if (ret == -EAGAIN) {
> +               put_page(pages[0]);
> +               if (nr_pages > 1)
> +                       put_page(pages[1]);
> +               goto again;
> +       }
> +       if (ret)
> +               goto error;
> +       pin_pages->pages[pin_pages->nr++] = pages[0];
> +       if (nr_pages > 1)
> +               pin_pages->pages[pin_pages->nr++] = pages[1];
> +       return 0;
> +
> +error:
> +       put_page(pages[0]);
> +       if (nr_pages > 1)
> +               put_page(pages[1]);
> +       return -EFAULT;
> +}
> +
> +static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt,
> +               struct cpu_opv_pinned_pages *pin_pages)
> +{
> +       int ret, i;
> +       bool expect_fault = false;
> +
> +       /* Check access, pin pages. */
> +       for (i = 0; i < cpuopcnt; i++) {
> +               struct cpu_op *op = &cpuop[i];
> +
> +               switch (op->op) {
> +               case CPU_COMPARE_EQ_OP:
> +               case CPU_COMPARE_NE_OP:
> +                       ret = -EFAULT;
> +                       expect_fault = op->u.compare_op.expect_fault_a;
> +                       if (!access_ok(VERIFY_READ,
> +                                       (void __user *)op->u.compare_op.a,
> +                                       op->len))
> +                               goto error;
> +                       ret = cpu_op_pin_pages(
> +                                       (unsigned long)op->u.compare_op.a,
> +                                       op->len, pin_pages, 0);
> +                       if (ret)
> +                               goto error;
> +                       ret = -EFAULT;
> +                       expect_fault = op->u.compare_op.expect_fault_b;
> +                       if (!access_ok(VERIFY_READ,
> +                                       (void __user *)op->u.compare_op.b,
> +                                       op->len))
> +                               goto error;
> +                       ret = cpu_op_pin_pages(
> +                                       (unsigned long)op->u.compare_op.b,
> +                                       op->len, pin_pages, 0);
> +                       if (ret)
> +                               goto error;
> +                       break;
> +               case CPU_MEMCPY_OP:
> +                       ret = -EFAULT;
> +                       expect_fault = op->u.memcpy_op.expect_fault_dst;
> +                       if (!access_ok(VERIFY_WRITE,
> +                                       (void __user *)op->u.memcpy_op.dst,
> +                                       op->len))
> +                               goto error;
> +                       ret = cpu_op_pin_pages(
> +                                       (unsigned long)op->u.memcpy_op.dst,
> +                                       op->len, pin_pages, 1);
> +                       if (ret)
> +                               goto error;
> +                       ret = -EFAULT;
> +                       expect_fault = op->u.memcpy_op.expect_fault_src;
> +                       if (!access_ok(VERIFY_READ,
> +                                       (void __user *)op->u.memcpy_op.src,
> +                                       op->len))
> +                               goto error;
> +                       ret = cpu_op_pin_pages(
> +                                       (unsigned long)op->u.memcpy_op.src,
> +                                       op->len, pin_pages, 0);
> +                       if (ret)
> +                               goto error;
> +                       break;
> +               case CPU_ADD_OP:
> +                       ret = -EFAULT;
> +                       expect_fault = op->u.arithmetic_op.expect_fault_p;
> +                       if (!access_ok(VERIFY_WRITE,
> +                                       (void __user *)op->u.arithmetic_op.p,
> +                                       op->len))
> +                               goto error;
> +                       ret = cpu_op_pin_pages(
> +                                       (unsigned long)op->u.arithmetic_op.p,
> +                                       op->len, pin_pages, 1);
> +                       if (ret)
> +                               goto error;
> +                       break;
> +               case CPU_OR_OP:
> +               case CPU_AND_OP:
> +               case CPU_XOR_OP:
> +                       ret = -EFAULT;
> +                       expect_fault = op->u.bitwise_op.expect_fault_p;
> +                       if (!access_ok(VERIFY_WRITE,
> +                                       (void __user *)op->u.bitwise_op.p,
> +                                       op->len))
> +                               goto error;
> +                       ret = cpu_op_pin_pages(
> +                                       (unsigned long)op->u.bitwise_op.p,
> +                                       op->len, pin_pages, 1);
> +                       if (ret)
> +                               goto error;
> +                       break;
> +               case CPU_LSHIFT_OP:
> +               case CPU_RSHIFT_OP:
> +                       ret = -EFAULT;
> +                       expect_fault = op->u.shift_op.expect_fault_p;
> +                       if (!access_ok(VERIFY_WRITE,
> +                                       (void __user *)op->u.shift_op.p,
> +                                       op->len))
> +                               goto error;
> +                       ret = cpu_op_pin_pages(
> +                                       (unsigned long)op->u.shift_op.p,
> +                                       op->len, pin_pages, 1);
> +                       if (ret)
> +                               goto error;
> +                       break;
> +               case CPU_MB_OP:
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +       return 0;
> +
> +error:
> +       for (i = 0; i < pin_pages->nr; i++)
> +               put_page(pin_pages->pages[i]);
> +       pin_pages->nr = 0;
> +       /*
> +        * If faulting access is expected, return EAGAIN to user-space.
> +        * It allows user-space to distinguish between a fault caused by
> +        * an access which is expect to fault (e.g. due to concurrent
> +        * unmapping of underlying memory) from an unexpected fault from
> +        * which a retry would not recover.
> +        */
> +       if (ret == -EFAULT && expect_fault)
> +               return -EAGAIN;
> +       return ret;
> +}
> +
> +/* Return 0 if same, > 0 if different, < 0 on error. */
> +static int do_cpu_op_compare_iter(void __user *a, void __user *b, uint32_t len)
> +{
> +       char bufa[TMP_BUFLEN], bufb[TMP_BUFLEN];
> +       uint32_t compared = 0;
> +
> +       while (compared != len) {
> +               unsigned long to_compare;
> +
> +               to_compare = min_t(uint32_t, TMP_BUFLEN, len - compared);
> +               if (__copy_from_user_inatomic(bufa, a + compared, to_compare))
> +                       return -EFAULT;
> +               if (__copy_from_user_inatomic(bufb, b + compared, to_compare))
> +                       return -EFAULT;
> +               if (memcmp(bufa, bufb, to_compare))
> +                       return 1;       /* different */
> +               compared += to_compare;
> +       }
> +       return 0;       /* same */
> +}
> +
> +/* Return 0 if same, > 0 if different, < 0 on error. */
> +static int do_cpu_op_compare(void __user *a, void __user *b, uint32_t len)
> +{
> +       int ret = -EFAULT;
> +       union {
> +               uint8_t _u8;
> +               uint16_t _u16;
> +               uint32_t _u32;
> +               uint64_t _u64;
> +#if (BITS_PER_LONG < 64)
> +               uint32_t _u64_split[2];
> +#endif
> +       } tmp[2];
> +
> +       pagefault_disable();
> +       switch (len) {
> +       case 1:
> +               if (__get_user(tmp[0]._u8, (uint8_t __user *)a))
> +                       goto end;
> +               if (__get_user(tmp[1]._u8, (uint8_t __user *)b))
> +                       goto end;
> +               ret = !!(tmp[0]._u8 != tmp[1]._u8);
> +               break;
> +       case 2:
> +               if (__get_user(tmp[0]._u16, (uint16_t __user *)a))
> +                       goto end;
> +               if (__get_user(tmp[1]._u16, (uint16_t __user *)b))
> +                       goto end;
> +               ret = !!(tmp[0]._u16 != tmp[1]._u16);
> +               break;
> +       case 4:
> +               if (__get_user(tmp[0]._u32, (uint32_t __user *)a))
> +                       goto end;
> +               if (__get_user(tmp[1]._u32, (uint32_t __user *)b))
> +                       goto end;
> +               ret = !!(tmp[0]._u32 != tmp[1]._u32);
> +               break;
> +       case 8:
> +#if (BITS_PER_LONG >= 64)
> +               if (__get_user(tmp[0]._u64, (uint64_t __user *)a))
> +                       goto end;
> +               if (__get_user(tmp[1]._u64, (uint64_t __user *)b))
> +                       goto end;
> +#else
> +               if (__get_user(tmp[0]._u64_split[0], (uint32_t __user *)a))
> +                       goto end;
> +               if (__get_user(tmp[0]._u64_split[1], (uint32_t __user *)a + 1))
> +                       goto end;
> +               if (__get_user(tmp[1]._u64_split[0], (uint32_t __user *)b))
> +                       goto end;
> +               if (__get_user(tmp[1]._u64_split[1], (uint32_t __user *)b + 1))
> +                       goto end;
> +#endif
> +               ret = !!(tmp[0]._u64 != tmp[1]._u64);
> +               break;
> +       default:
> +               pagefault_enable();
> +               return do_cpu_op_compare_iter(a, b, len);
> +       }
> +end:
> +       pagefault_enable();
> +       return ret;
> +}
> +
> +/* Return 0 on success, < 0 on error. */
> +static int do_cpu_op_memcpy_iter(void __user *dst, void __user *src,
> +               uint32_t len)
> +{
> +       char buf[TMP_BUFLEN];
> +       uint32_t copied = 0;
> +
> +       while (copied != len) {
> +               unsigned long to_copy;
> +
> +               to_copy = min_t(uint32_t, TMP_BUFLEN, len - copied);
> +               if (__copy_from_user_inatomic(buf, src + copied, to_copy))
> +                       return -EFAULT;
> +               if (__copy_to_user_inatomic(dst + copied, buf, to_copy))
> +                       return -EFAULT;
> +               copied += to_copy;
> +       }
> +       return 0;
> +}
> +
> +/* Return 0 on success, < 0 on error. */
> +static int do_cpu_op_memcpy(void __user *dst, void __user *src, uint32_t len)
> +{
> +       int ret = -EFAULT;
> +       union {
> +               uint8_t _u8;
> +               uint16_t _u16;
> +               uint32_t _u32;
> +               uint64_t _u64;
> +#if (BITS_PER_LONG < 64)
> +               uint32_t _u64_split[2];
> +#endif
> +       } tmp;
> +
> +       pagefault_disable();
> +       switch (len) {
> +       case 1:
> +               if (__get_user(tmp._u8, (uint8_t __user *)src))
> +                       goto end;
> +               if (__put_user(tmp._u8, (uint8_t __user *)dst))
> +                       goto end;
> +               break;
> +       case 2:
> +               if (__get_user(tmp._u16, (uint16_t __user *)src))
> +                       goto end;
> +               if (__put_user(tmp._u16, (uint16_t __user *)dst))
> +                       goto end;
> +               break;
> +       case 4:
> +               if (__get_user(tmp._u32, (uint32_t __user *)src))
> +                       goto end;
> +               if (__put_user(tmp._u32, (uint32_t __user *)dst))
> +                       goto end;
> +               break;
> +       case 8:
> +#if (BITS_PER_LONG >= 64)
> +               if (__get_user(tmp._u64, (uint64_t __user *)src))
> +                       goto end;
> +               if (__put_user(tmp._u64, (uint64_t __user *)dst))
> +                       goto end;
> +#else
> +               if (__get_user(tmp._u64_split[0], (uint32_t __user *)src))
> +                       goto end;
> +               if (__get_user(tmp._u64_split[1], (uint32_t __user *)src + 1))
> +                       goto end;
> +               if (__put_user(tmp._u64_split[0], (uint32_t __user *)dst))
> +                       goto end;
> +               if (__put_user(tmp._u64_split[1], (uint32_t __user *)dst + 1))
> +                       goto end;
> +#endif
> +               break;
> +       default:
> +               pagefault_enable();
> +               return do_cpu_op_memcpy_iter(dst, src, len);
> +       }
> +       ret = 0;
> +end:
> +       pagefault_enable();
> +       return ret;
> +}
> +
> +static int op_add_fn(union op_fn_data *data, uint64_t count, uint32_t len)
> +{
> +       int ret = 0;
> +
> +       switch (len) {
> +       case 1:
> +               data->_u8 += (uint8_t)count;
> +               break;
> +       case 2:
> +               data->_u16 += (uint16_t)count;
> +               break;
> +       case 4:
> +               data->_u32 += (uint32_t)count;
> +               break;
> +       case 8:
> +               data->_u64 += (uint64_t)count;
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +       return ret;
> +}
> +
> +static int op_or_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
> +{
> +       int ret = 0;
> +
> +       switch (len) {
> +       case 1:
> +               data->_u8 |= (uint8_t)mask;
> +               break;
> +       case 2:
> +               data->_u16 |= (uint16_t)mask;
> +               break;
> +       case 4:
> +               data->_u32 |= (uint32_t)mask;
> +               break;
> +       case 8:
> +               data->_u64 |= (uint64_t)mask;
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +       return ret;
> +}
> +
> +static int op_and_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
> +{
> +       int ret = 0;
> +
> +       switch (len) {
> +       case 1:
> +               data->_u8 &= (uint8_t)mask;
> +               break;
> +       case 2:
> +               data->_u16 &= (uint16_t)mask;
> +               break;
> +       case 4:
> +               data->_u32 &= (uint32_t)mask;
> +               break;
> +       case 8:
> +               data->_u64 &= (uint64_t)mask;
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +       return ret;
> +}
> +
> +static int op_xor_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
> +{
> +       int ret = 0;
> +
> +       switch (len) {
> +       case 1:
> +               data->_u8 ^= (uint8_t)mask;
> +               break;
> +       case 2:
> +               data->_u16 ^= (uint16_t)mask;
> +               break;
> +       case 4:
> +               data->_u32 ^= (uint32_t)mask;
> +               break;
> +       case 8:
> +               data->_u64 ^= (uint64_t)mask;
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +       return ret;
> +}
> +
> +static int op_lshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len)
> +{
> +       int ret = 0;
> +
> +       switch (len) {
> +       case 1:
> +               data->_u8 <<= (uint8_t)bits;
> +               break;
> +       case 2:
> +               data->_u16 <<= (uint16_t)bits;
> +               break;
> +       case 4:
> +               data->_u32 <<= (uint32_t)bits;
> +               break;
> +       case 8:
> +               data->_u64 <<= (uint64_t)bits;
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +       return ret;
> +}
> +
> +static int op_rshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len)
> +{
> +       int ret = 0;
> +
> +       switch (len) {
> +       case 1:
> +               data->_u8 >>= (uint8_t)bits;
> +               break;
> +       case 2:
> +               data->_u16 >>= (uint16_t)bits;
> +               break;
> +       case 4:
> +               data->_u32 >>= (uint32_t)bits;
> +               break;
> +       case 8:
> +               data->_u64 >>= (uint64_t)bits;
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +       return ret;
> +}
> +
> +/* Return 0 on success, < 0 on error. */
> +static int do_cpu_op_fn(op_fn_t op_fn, void __user *p, uint64_t v,
> +               uint32_t len)
> +{
> +       int ret = -EFAULT;
> +       union op_fn_data tmp;
> +
> +       pagefault_disable();
> +       switch (len) {
> +       case 1:
> +               if (__get_user(tmp._u8, (uint8_t __user *)p))
> +                       goto end;
> +               if (op_fn(&tmp, v, len))
> +                       goto end;
> +               if (__put_user(tmp._u8, (uint8_t __user *)p))
> +                       goto end;
> +               break;
> +       case 2:
> +               if (__get_user(tmp._u16, (uint16_t __user *)p))
> +                       goto end;
> +               if (op_fn(&tmp, v, len))
> +                       goto end;
> +               if (__put_user(tmp._u16, (uint16_t __user *)p))
> +                       goto end;
> +               break;
> +       case 4:
> +               if (__get_user(tmp._u32, (uint32_t __user *)p))
> +                       goto end;
> +               if (op_fn(&tmp, v, len))
> +                       goto end;
> +               if (__put_user(tmp._u32, (uint32_t __user *)p))
> +                       goto end;
> +               break;
> +       case 8:
> +#if (BITS_PER_LONG >= 64)
> +               if (__get_user(tmp._u64, (uint64_t __user *)p))
> +                       goto end;
> +#else
> +               if (__get_user(tmp._u64_split[0], (uint32_t __user *)p))
> +                       goto end;
> +               if (__get_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
> +                       goto end;
> +#endif
> +               if (op_fn(&tmp, v, len))
> +                       goto end;
> +#if (BITS_PER_LONG >= 64)
> +               if (__put_user(tmp._u64, (uint64_t __user *)p))
> +                       goto end;
> +#else
> +               if (__put_user(tmp._u64_split[0], (uint32_t __user *)p))
> +                       goto end;
> +               if (__put_user(tmp._u64_split[1], (uint32_t __user *)p + 1))
> +                       goto end;
> +#endif
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               goto end;
> +       }
> +       ret = 0;
> +end:
> +       pagefault_enable();
> +       return ret;
> +}
> +
> +static int __do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt)
> +{
> +       int i, ret;
> +
> +       for (i = 0; i < cpuopcnt; i++) {
> +               struct cpu_op *op = &cpuop[i];
> +
> +               /* Guarantee a compiler barrier between each operation. */
> +               barrier();
> +
> +               switch (op->op) {
> +               case CPU_COMPARE_EQ_OP:
> +                       ret = do_cpu_op_compare(
> +                                       (void __user *)op->u.compare_op.a,
> +                                       (void __user *)op->u.compare_op.b,
> +                                       op->len);
> +                       /* Stop execution on error. */
> +                       if (ret < 0)
> +                               return ret;
> +                       /*
> +                        * Stop execution, return op index + 1 if comparison
> +                        * differs.
> +                        */
> +                       if (ret > 0)
> +                               return i + 1;
> +                       break;
> +               case CPU_COMPARE_NE_OP:
> +                       ret = do_cpu_op_compare(
> +                                       (void __user *)op->u.compare_op.a,
> +                                       (void __user *)op->u.compare_op.b,
> +                                       op->len);
> +                       /* Stop execution on error. */
> +                       if (ret < 0)
> +                               return ret;
> +                       /*
> +                        * Stop execution, return op index + 1 if comparison
> +                        * is identical.
> +                        */
> +                       if (ret == 0)
> +                               return i + 1;
> +                       break;
> +               case CPU_MEMCPY_OP:
> +                       ret = do_cpu_op_memcpy(
> +                                       (void __user *)op->u.memcpy_op.dst,
> +                                       (void __user *)op->u.memcpy_op.src,
> +                                       op->len);
> +                       /* Stop execution on error. */
> +                       if (ret)
> +                               return ret;
> +                       break;
> +               case CPU_ADD_OP:
> +                       ret = do_cpu_op_fn(op_add_fn,
> +                                       (void __user *)op->u.arithmetic_op.p,
> +                                       op->u.arithmetic_op.count, op->len);
> +                       /* Stop execution on error. */
> +                       if (ret)
> +                               return ret;
> +                       break;
> +               case CPU_OR_OP:
> +                       ret = do_cpu_op_fn(op_or_fn,
> +                                       (void __user *)op->u.bitwise_op.p,
> +                                       op->u.bitwise_op.mask, op->len);
> +                       /* Stop execution on error. */
> +                       if (ret)
> +                               return ret;
> +                       break;
> +               case CPU_AND_OP:
> +                       ret = do_cpu_op_fn(op_and_fn,
> +                                       (void __user *)op->u.bitwise_op.p,
> +                                       op->u.bitwise_op.mask, op->len);
> +                       /* Stop execution on error. */
> +                       if (ret)
> +                               return ret;
> +                       break;
> +               case CPU_XOR_OP:
> +                       ret = do_cpu_op_fn(op_xor_fn,
> +                                       (void __user *)op->u.bitwise_op.p,
> +                                       op->u.bitwise_op.mask, op->len);
> +                       /* Stop execution on error. */
> +                       if (ret)
> +                               return ret;
> +                       break;
> +               case CPU_LSHIFT_OP:
> +                       ret = do_cpu_op_fn(op_lshift_fn,
> +                                       (void __user *)op->u.shift_op.p,
> +                                       op->u.shift_op.bits, op->len);
> +                       /* Stop execution on error. */
> +                       if (ret)
> +                               return ret;
> +                       break;
> +               case CPU_RSHIFT_OP:
> +                       ret = do_cpu_op_fn(op_rshift_fn,
> +                                       (void __user *)op->u.shift_op.p,
> +                                       op->u.shift_op.bits, op->len);
> +                       /* Stop execution on error. */
> +                       if (ret)
> +                               return ret;
> +                       break;
> +               case CPU_MB_OP:
> +                       smp_mb();
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
> +       return 0;
> +}
> +
> +static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt, int cpu)
> +{
> +       int ret;
> +
> +       if (cpu != raw_smp_processor_id()) {
> +               ret = push_task_to_cpu(current, cpu);
> +               if (ret)
> +                       goto check_online;
> +       }
> +       preempt_disable();
> +       if (cpu != smp_processor_id()) {
> +               ret = -EAGAIN;
> +               goto end;
> +       }
> +       ret = __do_cpu_opv(cpuop, cpuopcnt);
> +end:
> +       preempt_enable();
> +       return ret;
> +
> +check_online:
> +       if (!cpu_possible(cpu))
> +               return -EINVAL;
> +       get_online_cpus();
> +       if (cpu_online(cpu)) {
> +               ret = -EAGAIN;
> +               goto put_online_cpus;
> +       }
> +       /*
> +        * CPU is offline. Perform operation from the current CPU with
> +        * cpu_online read lock held, preventing that CPU from coming online,
> +        * and with mutex held, providing mutual exclusion against other
> +        * CPUs also finding out about an offline CPU.
> +        */
> +       mutex_lock(&cpu_opv_offline_lock);
> +       ret = __do_cpu_opv(cpuop, cpuopcnt);
> +       mutex_unlock(&cpu_opv_offline_lock);
> +put_online_cpus:
> +       put_online_cpus();
> +       return ret;
> +}
> +
> +/*
> + * cpu_opv - execute operation vector on a given CPU with preempt off.
> + *
> + * Userspace should pass current CPU number as parameter. May fail with
> + * -EAGAIN if currently executing on the wrong CPU.
> + */
> +SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
> +               int, cpu, int, flags)
> +{
> +       struct cpu_op cpuopv[CPU_OP_VEC_LEN_MAX];
> +       struct page *pinned_pages_on_stack[NR_PINNED_PAGES_ON_STACK];
> +       struct cpu_opv_pinned_pages pin_pages = {
> +               .pages = pinned_pages_on_stack,
> +               .nr = 0,
> +               .is_kmalloc = false,
> +       };
> +       int ret, i;
> +
> +       if (unlikely(flags))
> +               return -EINVAL;
> +       if (unlikely(cpu < 0))
> +               return -EINVAL;
> +       if (cpuopcnt < 0 || cpuopcnt > CPU_OP_VEC_LEN_MAX)
> +               return -EINVAL;
> +       if (copy_from_user(cpuopv, ucpuopv, cpuopcnt * sizeof(struct cpu_op)))
> +               return -EFAULT;
> +       ret = cpu_opv_check(cpuopv, cpuopcnt);
> +       if (ret)
> +               return ret;
> +       ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &pin_pages);
> +       if (ret)
> +               goto end;
> +       ret = do_cpu_opv(cpuopv, cpuopcnt, cpu);
> +       for (i = 0; i < pin_pages.nr; i++)
> +               put_page(pin_pages.pages[i]);
> +end:
> +       if (pin_pages.is_kmalloc)
> +               kfree(pin_pages.pages);
> +       return ret;
> +}
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6bba05f47e51..e547f93a46c2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1052,6 +1052,43 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
>                 set_curr_task(rq, p);
>  }
>
> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu)
> +{
> +       struct rq_flags rf;
> +       struct rq *rq;
> +       int ret = 0;
> +
> +       rq = task_rq_lock(p, &rf);
> +       update_rq_clock(rq);
> +
> +       if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       if (task_cpu(p) == dest_cpu)
> +               goto out;
> +
> +       if (task_running(rq, p) || p->state == TASK_WAKING) {
> +               struct migration_arg arg = { p, dest_cpu };
> +               /* Need help from migration thread: drop lock and wait. */
> +               task_rq_unlock(rq, p, &rf);
> +               stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
> +               tlb_migrate_finish(p->mm);
> +               return 0;
> +       } else if (task_on_rq_queued(p)) {
> +               /*
> +                * OK, since we're going to drop the lock immediately
> +                * afterwards anyway.
> +                */
> +               rq = move_queued_task(rq, &rf, p, dest_cpu);
> +       }
> +out:
> +       task_rq_unlock(rq, p, &rf);
> +
> +       return ret;
> +}
> +
>  /*
>   * Change a given task's CPU affinity. Migrate the thread to a
>   * proper CPU and schedule it away if the CPU it's executing on
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3b448ba82225..cab256c1720a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1209,6 +1209,8 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
>  #endif
>  }
>
> +int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu);
> +
>  /*
>   * Tunables that become constants when CONFIG_SCHED_DEBUG is off:
>   */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index bfa1ee1bf669..59e622296dc3 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -262,3 +262,4 @@ cond_syscall(sys_pkey_free);
>
>  /* restartable sequence */
>  cond_syscall(sys_rseq);
> +cond_syscall(sys_cpu_opv);
> --
> 2.11.0
>
>
>



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/