[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4F2306BB.5090907@jp.fujitsu.com>
Date: Fri, 27 Jan 2012 15:19:07 -0500
From: KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
To: gorcunov@...nvz.org
CC: linux-kernel@...r.kernel.org, akpm@...ux-foundation.org,
ebiederm@...ssion.com, xemul@...nvz.org, xemul@...allels.com,
avagin@...nvz.org, kosaki.motohiro@...il.com, mingo@...e.hu,
hpa@...or.com, tglx@...utronix.de, glommer@...allels.com,
andi@...stfloor.org, tj@...nel.org, matthltc@...ibm.com,
penberg@...nel.org, eric.dumazet@...il.com, segoon@...nwall.com,
adobriyan@...il.com, Valdis.Kletnieks@...edu
Subject: Re: [RFC c/r 2/4] [RFC] syscalls, x86: Add __NR_kcmp syscall v7
On 1/27/2012 12:53 PM, Cyrill Gorcunov wrote:
> While doing the checkpoint-restore in the userspace one need to determine
> whether various kernel objects (like mm_struct-s of file_struct-s) are shared
> between tasks and restore this state.
>
> The 2nd step can be solved by using appropriate CLONE_ flags and the unshare
> syscall, while there's currently no ways for solving the 1st one.
>
> One of the ways for checking whether two tasks share e.g. mm_struct is to
> provide some mm_struct ID of a task to its proc file, but showing such
> info considered to be not that good for security reasons.
>
> Thus after some debates we end up in conclusion that using that named
> 'comparision' syscall might be the best candidate. So here is it --
> __NR_kcmp.
>
> It takes up to 5 agruments - the pids of the two tasks (which
> characteristics should be compared), the comparision type and
> (in case of comparision of files) two file descriptors.
>
> Lookups for pids are done in the caller's PID namespace only.
>
> At moment only x86 is supported and tested.
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@...nvz.org>
> CC: "Eric W. Biederman" <ebiederm@...ssion.com>
> CC: Pavel Emelyanov <xemul@...allels.com>
> CC: Andrey Vagin <avagin@...nvz.org>
> CC: KOSAKI Motohiro <kosaki.motohiro@...il.com>
> CC: Ingo Molnar <mingo@...e.hu>
> CC: H. Peter Anvin <hpa@...or.com>
> CC: Thomas Gleixner <tglx@...utronix.de>
> CC: Glauber Costa <glommer@...allels.com>
> CC: Andi Kleen <andi@...stfloor.org>
> CC: Tejun Heo <tj@...nel.org>
> CC: Matt Helsley <matthltc@...ibm.com>
> CC: Pekka Enberg <penberg@...nel.org>
> CC: Eric Dumazet <eric.dumazet@...il.com>
> CC: Vasiliy Kulikov <segoon@...nwall.com>
> CC: Andrew Morton <akpm@...ux-foundation.org>
> CC: Alexey Dobriyan <adobriyan@...il.com>
> CC: Valdis.Kletnieks@...edu
> ---
> arch/x86/syscalls/syscall_32.tbl | 1
> arch/x86/syscalls/syscall_64.tbl | 1
> include/linux/kcmp.h | 17 ++
> include/linux/syscalls.h | 2
> kernel/Makefile | 1
> kernel/kcmp.c | 181 +++++++++++++++++++++++++++++++
> tools/testing/selftests/kcmp/Makefile | 35 +++++
> tools/testing/selftests/kcmp/kcmp_test.c | 126 +++++++++++++++++++++
> tools/testing/selftests/run_tests | 2
> 9 files changed, 365 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
> ===================================================================
> --- linux-2.6.git.orig/arch/x86/syscalls/syscall_32.tbl
> +++ linux-2.6.git/arch/x86/syscalls/syscall_32.tbl
> @@ -355,3 +355,4 @@
> 346 i386 setns sys_setns
> 347 i386 process_vm_readv sys_process_vm_readv compat_sys_process_vm_readv
> 348 i386 process_vm_writev sys_process_vm_writev compat_sys_process_vm_writev
> +349 i386 kcmp sys_kcmp
> Index: linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
> ===================================================================
> --- linux-2.6.git.orig/arch/x86/syscalls/syscall_64.tbl
> +++ linux-2.6.git/arch/x86/syscalls/syscall_64.tbl
> @@ -318,3 +318,4 @@
> 309 64 getcpu sys_getcpu
> 310 64 process_vm_readv sys_process_vm_readv
> 311 64 process_vm_writev sys_process_vm_writev
> +312 64 kcmp sys_kcmp
> Index: linux-2.6.git/include/linux/kcmp.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/include/linux/kcmp.h
> @@ -0,0 +1,17 @@
> +#ifndef _LINUX_KCMP_H
> +#define _LINUX_KCMP_H
> +
> +/* Comparision type */
> +enum kcmp_type {
> + KCMP_FILE,
> + KCMP_VM,
> + KCMP_FILES,
> + KCMP_FS,
> + KCMP_SIGHAND,
> + KCMP_IO,
> + KCMP_SYSVSEM,
> +
> + KCMP_TYPES,
> +};
> +
> +#endif /* _LINUX_KCMP_H */
> Index: linux-2.6.git/include/linux/syscalls.h
> ===================================================================
> --- linux-2.6.git.orig/include/linux/syscalls.h
> +++ linux-2.6.git/include/linux/syscalls.h
> @@ -857,4 +857,6 @@ asmlinkage long sys_process_vm_writev(pi
> unsigned long riovcnt,
> unsigned long flags);
>
> +asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
> + unsigned long idx1, unsigned long idx2);
> #endif
> Index: linux-2.6.git/kernel/Makefile
> ===================================================================
> --- linux-2.6.git.orig/kernel/Makefile
> +++ linux-2.6.git/kernel/Makefile
> @@ -25,6 +25,7 @@ endif
> obj-y += sched/
> obj-y += power/
>
> +obj-$(CONFIG_X86) += kcmp.o
> obj-$(CONFIG_FREEZER) += freezer.o
> obj-$(CONFIG_PROFILING) += profile.o
> obj-$(CONFIG_SYSCTL_SYSCALL_CHECK) += sysctl_check.o
> Index: linux-2.6.git/kernel/kcmp.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/kernel/kcmp.c
> @@ -0,0 +1,181 @@
> +#include <linux/kernel.h>
> +#include <linux/syscalls.h>
> +#include <linux/fdtable.h>
> +#include <linux/string.h>
> +#include <linux/random.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/cache.h>
> +#include <linux/bug.h>
> +#include <linux/err.h>
> +#include <linux/kcmp.h>
> +
> +#include <asm/unistd.h>
> +
> +/*
> + * We don't expose real in-memory order of objects for security
> + * reasons, still the comparision results should be suitable for
> + * sorting. Thus, we obfuscate kernel pointers values (using random
> + * cookies obtaned at early boot stage) and compare the production
> + * instead.
> + */
> +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> +
> +static long kptr_obfuscate(long v, int type)
> +{
> + return (v ^ cookies[type][0]) * cookies[type][1];
> +}
> +
> +/*
> + * 0 - equal
> + * 1 - less than
> + * 2 - greater than
> + * 3 - not equal but ordering unavailable (reserved for future)
> + */
> +static int kcmp_ptr(long v1, long v2, enum kcmp_type type)
> +{
> + long ret;
> +
> + ret = kptr_obfuscate(v1, type) - kptr_obfuscate(v2, type);
> +
> + return (ret < 0) | ((ret > 0) << 1);
> +}
> +
> +/* The caller must have pinned the task */
> +static struct file *
> +get_file_raw_ptr(struct task_struct *task, unsigned int idx)
> +{
> + struct fdtable *fdt;
> + struct file *file;
> +
> + spin_lock(&task->files->file_lock);
> + fdt = files_fdtable(task->files);
> + if (idx < fdt->max_fds)
> + file = fdt->fd[idx];
> + else
> + file = NULL;
> + spin_unlock(&task->files->file_lock);
> +
> + return file;
> +}
> +
> +SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
> + unsigned long, idx1, unsigned long, idx2)
> +{
> + struct task_struct *task1, *task2;
> + int ret;
> +
> + rcu_read_lock();
> +
> + /*
> + * Tasks are looked up in caller's
> + * PID namespace only.
> + */
> +
> + task1 = find_task_by_vpid(pid1);
> + if (!task1) {
> + rcu_read_unlock();
> + return -ESRCH;
> + }
> +
> + task2 = find_task_by_vpid(pid2);
> + if (!task2) {
> + put_task_struct(task1);
> + rcu_read_unlock();
> + return -ESRCH;
> + }
> +
> + get_task_struct(task1);
> + get_task_struct(task2);
> +
> + rcu_read_unlock();
> +
> + /*
> + * One should have enough rights to inspect tasks details.
> + */
> + if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
> + !ptrace_may_access(task2, PTRACE_MODE_READ)) {
> + ret = -EACCES;
> + goto err;
> + }
> +
> + /*
> + * Note for all cases but the KCMP_FILE we
> + * don't take any locks in a sake of speed.
> + */
> +
> + switch (type) {
> + case KCMP_FILE: {
> + struct file *filp1, *filp2;
> +
> + filp1 = get_file_raw_ptr(task1, idx1);
> + filp2 = get_file_raw_ptr(task2, idx2);
> +
> + if (filp1 && filp2)
> + ret = kcmp_ptr((long)filp1, (long)filp2, KCMP_FILE);
> + else
> + ret = -ENOENT;
If my remember is correct, Andrew pointed out EINVAL is better than ENOENT.
> + break;
> + }
> + case KCMP_VM:
> + ret = kcmp_ptr((long)task1->mm,
> + (long)task2->mm,
> + KCMP_VM);
> + break;
> + case KCMP_FILES:
> + ret = kcmp_ptr((long)task1->files,
> + (long)task2->files,
> + KCMP_FILES);
> + break;
> + case KCMP_FS:
> + ret = kcmp_ptr((long)task1->fs,
> + (long)task2->fs,
> + KCMP_FS);
> + break;
> + case KCMP_SIGHAND:
> + ret = kcmp_ptr((long)task1->sighand,
> + (long)task2->sighand,
> + KCMP_SIGHAND);
> + break;
> + case KCMP_IO:
> + ret = kcmp_ptr((long)task1->io_context,
> + (long)task2->io_context,
> + KCMP_IO);
> + break;
> + case KCMP_SYSVSEM:
> +#ifdef CONFIG_SYSVIPC
> + ret = kcmp_ptr((long)task1->sysvsem.undo_list,
> + (long)task2->sysvsem.undo_list,
> + KCMP_SYSVSEM);
> +#else
> + ret = -EINVAL;
ENOTSUP is better, I think. because of, EINVAL implicitly mean _caller_ is wrong.
but in this case, it is not bad. only the kernel doesn't have enough feature.
> + goto err;
you don't need err label at all.
> +#endif
> + break;
> + default:
> + ret = -EINVAL;
> + goto err;
> + }
> +
> +err:
> + put_task_struct(task1);
> + put_task_struct(task2);
> +
> + return ret;
> +}
> +
> +static __init int kcmp_cookie_init(void)
> +{
> + int i, j;
> +
> + for (i = 0; i < KCMP_TYPES; i++) {
> + for (j = 0; j < 2; j++) {
> + get_random_bytes(&cookies[i][j],
> + sizeof(cookies[i][j]));
> + }
> + cookies[i][1] |= (~(~0UL >> 1) | 1);
> + }
> +
> + return 0;
> +}
> +late_initcall(kcmp_cookie_init);
> Index: linux-2.6.git/tools/testing/selftests/kcmp/Makefile
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/tools/testing/selftests/kcmp/Makefile
> @@ -0,0 +1,35 @@
> +ifeq ($(strip $(V)),)
> + E = @echo
> + Q = @
> +else
> + E = @\#
> + Q =
> +endif
> +export E Q
> +
> +uname_M := $(shell uname -m 2>/dev/null || echo not)
> +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
> +ifeq ($(ARCH),i386)
> + ARCH := X86
> + CFLAGS := -DCONFIG_X86_32
> +endif
> +ifeq ($(ARCH),x86_64)
> + ARCH := X86
> + CFLAGS := -DCONFIG_X86_64
> +endif
> +
> +CFLAGS += -I../../../../arch/x86/include/generated/
> +CFLAGS += -I../../../../include/
> +
> +all:
> +ifeq ($(ARCH),X86)
> + $(E) " CC run_test"
> + $(Q) gcc $(CFLAGS) kcmp_test.c -o run_test
> +else
> + $(E) "Not an x86 target, can't build breakpoints selftests"
> +endif
> +
> +clean:
> + $(E) " CLEAN"
> + $(Q) rm -fr ./run_test
> + $(Q) rm -fr ./test-file
> Index: linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
> @@ -0,0 +1,126 @@
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <signal.h>
> +#include <limits.h>
> +#include <unistd.h>
> +#include <errno.h>
> +#include <string.h>
> +#include <fcntl.h>
> +
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <sys/wait.h>
> +
> +#ifdef CONFIG_X86_64
> +#include "asm/unistd_64.h"
> +#else
> +#include "asm/unistd_32.h"
> +#endif
> +
> +#include "linux/kcmp.h"
> +
> +#ifdef CONFIG_X86_64
> +static long syscall5(int nr, unsigned long arg0, unsigned long arg1,
> + unsigned long arg2, unsigned long arg3,
> + unsigned long arg4)
> +{
> + long ret;
> + asm volatile(
> + "movl %1, %%eax \n"
> + "movq %2, %%rdi \n"
> + "movq %3, %%rsi \n"
> + "movq %4, %%rdx \n"
> + "movq %5, %%r10 \n"
> + "movq %6, %%r8 \n"
> + "syscall \n"
> + "movq %%rax, %0 \n"
> + : "=r"(ret)
> + : "g" ((int)nr), "g" (arg0), "g" (arg1), "g" (arg2),
> + "g" (arg3), "g" (arg4)
> + : "rax", "rdi", "rsi", "rdx", "r10", "r8", "memory");
> + return ret;
> +}
> +#else
> +static long syscall5(int nr, unsigned long arg0, unsigned long arg1,
> + unsigned long arg2, unsigned long arg3,
> + unsigned long arg4)
> +{
> + long ret;
> + asm volatile(
> + "movl %1, %%eax \n"
> + "movl %2, %%ebx \n"
> + "movl %3, %%ecx \n"
> + "movl %4, %%edx \n"
> + "movl %5, %%esi \n"
> + "movl %6, %%edi \n"
> + "intl $0x80 \n"
> + "movq %%eax, %0 \n"
> + : "=r"(ret)
> + : "g" ((int)nr), "g" (arg0), "g" (arg1), "g" (arg2),
> + "g" (arg3), "g" (arg4)
> + : "eax", "ebx", "ecx", "edx", "esi", "edi", "memory");
> + return ret;
> +}
> +#endif
> +
> +static long sys_kcmp(int pid1, int pid2, int type, int fd1, int fd2)
> +{
> + return syscall5(__NR_kcmp, (long)pid1, (long)pid2,
> + (long)type, (long)fd1, (long)fd2);
> +}
> +
> +int main(int argc, char **argv)
> +{
> + int pid1, pid2;
> + int fd1, fd2;
> + int status;
> +
> + fd1 = open("test-file", O_RDWR | O_CREAT | O_TRUNC, 0644);
> + pid1 = getpid();
> +
> + if (fd1 < 0) {
> + perror("Can't create file");
> + exit(1);
> + }
> +
> + pid2 = fork();
> + if (pid2 < 0) {
> + perror("fork failed");
> + exit(1);
> + }
> +
> + if (!pid2) {
> + int pid2 = getpid();
> +
> + fd2 = open("test-file", O_RDWR, 0644);
> + if (fd2 < 0) {
> + perror("Can't open file");
> + exit(1);
> + }
> +
> + printf("pid1: %6d pid2: %6d FD: %2d FILES: %2d VM: %2d FS: %2d "
> + "SIGHAND: %2d IO: %2d SYSVSEM: %2d INV: %2d\n",
> + pid1, pid2,
> + sys_kcmp(pid1, pid2, KCMP_FILE, fd1, fd2),
> + sys_kcmp(pid1, pid2, KCMP_FILES, 0, 0),
> + sys_kcmp(pid1, pid2, KCMP_VM, 0, 0),
> + sys_kcmp(pid1, pid2, KCMP_FS, 0, 0),
> + sys_kcmp(pid1, pid2, KCMP_SIGHAND, 0, 0),
> + sys_kcmp(pid1, pid2, KCMP_IO, 0, 0),
> + sys_kcmp(pid1, pid2, KCMP_SYSVSEM, 0, 0),.
The best practice of auto test is
AssertFooBar(expected_value, actual_value);
and, just only print "correct or not". Only you know the correct value.
> +
> + /* This one should fail */
> + sys_kcmp(pid1, pid2, KCMP_TYPES + 1, 0, 0));
> +
> + /* This one should return same fd */
> + printf("pid1: %6d pid2: %6d SAME-FD: %2d\n",
> + pid1, pid2,
> + sys_kcmp(pid1, pid2, KCMP_FILE, fd1, fd1));
> +
> + exit(0);
> + }
> +
> + waitpid(pid2, &status, P_ALL);
> +
> + return 0;
> +}
> Index: linux-2.6.git/tools/testing/selftests/run_tests
> ===================================================================
> --- linux-2.6.git.orig/tools/testing/selftests/run_tests
> +++ linux-2.6.git/tools/testing/selftests/run_tests
> @@ -1,6 +1,6 @@
> #!/bin/bash
>
> -TARGETS=breakpoints
> +TARGETS="breakpoints kcmp"
>
> for TARGET in $TARGETS
> do
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists