lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BANLkTi=vXXJzti_jn7kCA8jS9Q6uOej+yw@mail.gmail.com>
Date:	Thu, 2 Jun 2011 13:14:54 -0500
From:	Will Drewry <wad@...omium.org>
To:	paulmck@...ux.vnet.ibm.com
Cc:	linux-kernel@...r.kernel.org, kees.cook@...onical.com,
	torvalds@...ux-foundation.org, tglx@...utronix.de, mingo@...e.hu,
	rostedt@...dmis.org, jmorris@...ei.org,
	Peter Zijlstra <peterz@...radead.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	linux-security-module@...r.kernel.org
Subject: Re: [PATCH v3 03/13] seccomp_filters: new mode with configurable
 syscall filters

On Thu, Jun 2, 2011 at 12:36 PM, Paul E. McKenney
<paulmck@...ux.vnet.ibm.com> wrote:
> On Tue, May 31, 2011 at 10:10:35PM -0500, Will Drewry wrote:
>> This change adds a new seccomp mode which specifies the allowed system
>> calls dynamically.  When in the new mode (2), all system calls are
>> checked against process-defined filters - first by system call number,
>> then by a filter string.  If an entry exists for a given system call and
>> all filter predicates evaluate to true, then the task may proceed.
>> Otherwise, the task is killed.
>
> A few questions below -- I can't say that I understand the RCU usage.
>
>                                                        Thanx, Paul
>
>> Filter string parsing and evaluation is handled by the ftrace filter
>> engine.  Related patches tweak to the perf filter trace and free
>> allowing the calls to be shared. Filters inherit their understanding of
>> types and arguments for each system call from the CONFIG_FTRACE_SYSCALLS
>> subsystem which already populates this information in syscall_metadata
>> associated enter_event (and exit_event) structures. If
>> CONFIG_FTRACE_SYSCALLS is not compiled in, only filter strings of "1"
>> will be allowed.
>>
>> The net result is a process may have its system calls filtered using the
>> ftrace filter engine's inherent understanding of systems calls.  The set
>> of filters is specified through the PR_SET_SECCOMP_FILTER argument in
>> prctl(). For example, a filterset for a process, like pdftotext, that
>> should only process read-only input could (roughly) look like:
>>   sprintf(rdonly, "flags == %u", O_RDONLY|O_LARGEFILE);
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_open, rdonly);
>>   prctl(PR_SET_SECCOMP_FILTER, __NR__llseek, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_brk, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_close, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_exit_group, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_fstat64, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_mmap2, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_munmap, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_read, "1");
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_write, "(fd == 1 | fd == 2)");
>>   prctl(PR_SET_SECCOMP, 2);
>>
>> Subsequent calls to PR_SET_SECCOMP_FILTER for the same system call will
>> be &&'d together to ensure that attack surface may only be reduced:
>>   prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");
>>
>> With the earlier example, the active filter becomes:
>>   "(fd == 1 || fd == 2) && fd != 2"
>>
>> The patch also adds PR_CLEAR_SECCOMP_FILTER and PR_GET_SECCOMP_FILTER.
>> The latter returns the current filter for a system call to userspace:
>>
>>   prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf, bufsize);
>>
>> while the former clears any filters for a given system call changing it
>> back to a defaulty deny:
>>
>>   prctl(PR_CLEAR_SECCOMP_FILTER, __NR_write);
>>
>> v3: - always block execve calls (as per linus torvalds)
>>     - add __NR_seccomp_execve(_32) to seccomp-supporting arches
>>     - ensure compat tasks can't reach ftrace:syscalls
>>     - dropped new defines for seccomp modes.
>>     - two level array instead of hlists (sugg. by olof johansson)
>>     - added generic Kconfig entry that is not connected.
>>     - dropped internal seccomp.h
>>     - move prctl helpers to seccomp_filter
>>     - killed seccomp_t typedef (as per checkpatch)
>> v2: - changed to use the existing syscall number ABI.
>>     - prctl changes to minimize parsing in the kernel:
>>       prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC });
>>       prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5");
>>       prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
>>       prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize);
>>     - defined PR_SECCOMP_MODE_STRICT and ..._FILTER
>>     - added flags
>>     - provide a default fail syscall_nr_to_meta in ftrace
>>     - provides fallback for unhooked system calls
>>     - use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality
>>     - added kernel/seccomp.h to share seccomp.c/seccomp_filter.c
>>     - moved to a hlist and 4 bit hash of linked lists
>>     - added support to operate without CONFIG_FTRACE_SYSCALLS
>>     - moved Kconfig support next to SECCOMP
>>     - made Kconfig entries dependent on EXPERIMENTAL
>>     - added macros to avoid ifdefs from kernel/fork.c
>>     - added compat task/filter matching
>>     - drop seccomp.h inclusion in sched.h and drop seccomp_t
>>     - added Filtering to "show" output
>>     - added on_exec state dup'ing when enabling after a fast-path accept.
>>
>> Signed-off-by: Will Drewry <wad@...omium.org>
>> ---
>>  include/linux/prctl.h   |    5 +
>>  include/linux/sched.h   |    2 +-
>>  include/linux/seccomp.h |   98 ++++++-
>>  include/trace/syscall.h |    7 +
>>  kernel/Makefile         |    3 +
>>  kernel/fork.c           |    3 +
>>  kernel/seccomp.c        |   38 ++-
>>  kernel/seccomp_filter.c |  784 +++++++++++++++++++++++++++++++++++++++++++++++
>>  kernel/sys.c            |   13 +-
>>  security/Kconfig        |   17 +
>>  10 files changed, 954 insertions(+), 16 deletions(-)
>>  create mode 100644 kernel/seccomp_filter.c
>>
>> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
>> index a3baeb2..44723ce 100644
>> --- a/include/linux/prctl.h
>> +++ b/include/linux/prctl.h
>> @@ -64,6 +64,11 @@
>>  #define PR_GET_SECCOMP       21
>>  #define PR_SET_SECCOMP       22
>>
>> +/* Get/set process seccomp filters */
>> +#define PR_GET_SECCOMP_FILTER        35
>> +#define PR_SET_SECCOMP_FILTER        36
>> +#define PR_CLEAR_SECCOMP_FILTER      37
>> +
>>  /* Get/set the capability bounding set (as per security/commoncap.c) */
>>  #define PR_CAPBSET_READ 23
>>  #define PR_CAPBSET_DROP 24
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 18d63ce..3f0bc8d 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1374,7 +1374,7 @@ struct task_struct {
>>       uid_t loginuid;
>>       unsigned int sessionid;
>>  #endif
>> -     seccomp_t seccomp;
>> +     struct seccomp_struct seccomp;
>>
>>  /* Thread group tracking */
>>       u32 parent_exec_id;
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index 167c333..f4434ca 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -1,13 +1,33 @@
>>  #ifndef _LINUX_SECCOMP_H
>>  #define _LINUX_SECCOMP_H
>>
>> +struct seq_file;
>>
>>  #ifdef CONFIG_SECCOMP
>>
>> +#include <linux/errno.h>
>>  #include <linux/thread_info.h>
>> +#include <linux/types.h>
>>  #include <asm/seccomp.h>
>>
>> -typedef struct { int mode; } seccomp_t;
>> +struct seccomp_filters;
>> +/**
>> + * struct seccomp_struct - the state of a seccomp'ed process
>> + *
>> + * @mode:
>> + *     if this is 1, the process is under standard seccomp rules
>> + *             is 2, the process is only allowed to make system calls where
>> + *                   associated filters evaluate successfully.
>> + * @filters: Metadata for filters if using CONFIG_SECCOMP_FILTER.
>> + *           filters assignment/use should be RCU-protected and its contents
>> + *           should never be modified when attached to a seccomp_struct.
>> + */
>> +struct seccomp_struct {
>> +     uint16_t mode;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +     struct seccomp_filters *filters;
>> +#endif
>> +};
>>
>>  extern void __secure_computing(int);
>>  static inline void secure_computing(int this_syscall)
>> @@ -16,15 +36,14 @@ static inline void secure_computing(int this_syscall)
>>               __secure_computing(this_syscall);
>>  }
>>
>> -extern long prctl_get_seccomp(void);
>>  extern long prctl_set_seccomp(unsigned long);
>> +extern long prctl_get_seccomp(void);
>>
>>  #else /* CONFIG_SECCOMP */
>>
>>  #include <linux/errno.h>
>>
>> -typedef struct { } seccomp_t;
>> -
>> +struct seccomp_struct { };
>>  #define secure_computing(x) do { } while (0)
>>
>>  static inline long prctl_get_seccomp(void)
>> @@ -32,11 +51,80 @@ static inline long prctl_get_seccomp(void)
>>       return -EINVAL;
>>  }
>>
>> -static inline long prctl_set_seccomp(unsigned long arg2)
>> +static inline long prctl_set_seccomp(unsigned long a2);
>>  {
>>       return -EINVAL;
>>  }
>>
>>  #endif /* CONFIG_SECCOMP */
>>
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +
>> +#define inherit_tsk_seccomp(_child, _orig) do { \
>> +     _child->seccomp.mode = _orig->seccomp.mode; \
>> +     _child->seccomp.filters = get_seccomp_filters(_orig->seccomp.filters); \
>> +     } while (0)
>> +#define put_tsk_seccomp(_tsk) put_seccomp_filters(_tsk->seccomp.filters)
>> +
>> +extern int seccomp_show_filters(struct seccomp_filters *filters,
>> +                             struct seq_file *);
>> +extern long seccomp_set_filter(int, char *);
>> +extern long seccomp_clear_filter(int);
>> +extern long seccomp_get_filter(int, char *, unsigned long);
>> +
>> +extern long prctl_set_seccomp_filter(unsigned long, char __user *);
>> +extern long prctl_get_seccomp_filter(unsigned long, char __user *,
>> +                                  unsigned long);
>> +extern long prctl_clear_seccomp_filter(unsigned long);
>> +
>> +extern struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *);
>> +extern void put_seccomp_filters(struct seccomp_filters *);
>> +
>> +extern int seccomp_test_filters(int);
>> +extern void seccomp_filter_log_failure(int);
>> +
>> +#else  /* CONFIG_SECCOMP_FILTER */
>> +
>> +struct seccomp_filters { };
>> +#define inherit_tsk_seccomp(_child, _orig) do { } while (0)
>> +#define put_tsk_seccomp(_tsk) do { } while (0)
>> +
>> +static inline int seccomp_show_filters(struct seccomp_filters *filters,
>> +                                    struct seq_file *m)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +static inline long seccomp_set_filter(int syscall_nr, char *filter)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +static inline long seccomp_clear_filter(int syscall_nr)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +static inline long seccomp_get_filter(int syscall_nr,
>> +                                   char *buf, unsigned long available)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +static inline long prctl_clear_seccomp_filter(unsigned long a2)
>> +{
>> +     return -ENOSYS;
>> +}
>> +
>> +static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3,
>> +                                         unsigned long a4)
>> +{
>> +     return -ENOSYS;
>> +}
>> +#endif  /* CONFIG_SECCOMP_FILTER */
>>  #endif /* _LINUX_SECCOMP_H */
>> diff --git a/include/trace/syscall.h b/include/trace/syscall.h
>> index 242ae04..e061ad0 100644
>> --- a/include/trace/syscall.h
>> +++ b/include/trace/syscall.h
>> @@ -35,6 +35,8 @@ struct syscall_metadata {
>>  extern unsigned long arch_syscall_addr(int nr);
>>  extern int init_syscall_trace(struct ftrace_event_call *call);
>>
>> +extern struct syscall_metadata *syscall_nr_to_meta(int);
>> +
>>  extern int reg_event_syscall_enter(struct ftrace_event_call *call);
>>  extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
>>  extern int reg_event_syscall_exit(struct ftrace_event_call *call);
>> @@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
>>                                     struct trace_event *event);
>>  enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
>>                                    struct trace_event *event);
>> +#else
>> +static inline struct syscall_metadata *syscall_nr_to_meta(int nr)
>> +{
>> +     return NULL;
>> +}
>>  #endif
>>
>>  #ifdef CONFIG_PERF_EVENTS
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index 85cbfb3..84e7dfb 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
>>  obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
>>  obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
>>  obj-$(CONFIG_SECCOMP) += seccomp.o
>> +ifeq ($(CONFIG_SECCOMP_FILTER),y)
>> +obj-$(CONFIG_SECCOMP) += seccomp_filter.o
>> +endif
>>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>>  obj-$(CONFIG_TREE_RCU) += rcutree.o
>>  obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index e7548de..6f835e0 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -34,6 +34,7 @@
>>  #include <linux/cgroup.h>
>>  #include <linux/security.h>
>>  #include <linux/hugetlb.h>
>> +#include <linux/seccomp.h>
>>  #include <linux/swap.h>
>>  #include <linux/syscalls.h>
>>  #include <linux/jiffies.h>
>> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
>>       free_thread_info(tsk->stack);
>>       rt_mutex_debug_task_free(tsk);
>>       ftrace_graph_exit_task(tsk);
>> +     put_tsk_seccomp(tsk);
>>       free_task_struct(tsk);
>>  }
>>  EXPORT_SYMBOL(free_task);
>> @@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
>>       if (err)
>>               goto out;
>>
>> +     inherit_tsk_seccomp(tsk, orig);
>>       setup_thread_stack(tsk, orig);
>>       clear_user_return_notifier(tsk);
>>       clear_tsk_need_resched(tsk);
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 57d4b13..0a942be 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -2,16 +2,20 @@
>>   * linux/kernel/seccomp.c
>>   *
>>   * Copyright 2004-2005  Andrea Arcangeli <andrea@...share.com>
>> + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@...omium.org>
>>   *
>>   * This defines a simple but solid secure-computing mode.
>>   */
>>
>>  #include <linux/seccomp.h>
>>  #include <linux/sched.h>
>> +#include <linux/slab.h>
>>  #include <linux/compat.h>
>> +#include <linux/unistd.h>
>> +#include <linux/ftrace_event.h>
>>
>> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
>>  /* #define SECCOMP_DEBUG 1 */
>> -#define NR_SECCOMP_MODES 1
>>
>>  /*
>>   * Secure computing mode 1 allows only read/write/exit/sigreturn.
>> @@ -32,10 +36,9 @@ static int mode1_syscalls_32[] = {
>>
>>  void __secure_computing(int this_syscall)
>>  {
>> -     int mode = current->seccomp.mode;
>>       int * syscall;
>>
>> -     switch (mode) {
>> +     switch (current->seccomp.mode) {
>>       case 1:
>>               syscall = mode1_syscalls;
>>  #ifdef CONFIG_COMPAT
>> @@ -47,6 +50,17 @@ void __secure_computing(int this_syscall)
>>                               return;
>>               } while (*++syscall);
>>               break;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +     case 2:
>> +             if (this_syscall >= NR_syscalls || this_syscall < 0)
>> +                     break;
>> +
>> +             if (!seccomp_test_filters(this_syscall))
>> +                     return;
>> +
>> +             seccomp_filter_log_failure(this_syscall);
>> +             break;
>> +#endif
>>       default:
>>               BUG();
>>       }
>> @@ -71,16 +85,22 @@ long prctl_set_seccomp(unsigned long seccomp_mode)
>>       if (unlikely(current->seccomp.mode))
>>               goto out;
>>
>> -     ret = -EINVAL;
>> -     if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
>> -             current->seccomp.mode = seccomp_mode;
>> -             set_thread_flag(TIF_SECCOMP);
>> +     ret = 0;
>> +     switch (seccomp_mode) {
>> +     case 1:
>>  #ifdef TIF_NOTSC
>>               disable_TSC();
>>  #endif
>> -             ret = 0;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +     case 2:
>> +#endif
>> +             current->seccomp.mode = seccomp_mode;
>> +             set_thread_flag(TIF_SECCOMP);
>> +             break;
>> +     default:
>> +             ret = -EINVAL;
>>       }
>>
>> - out:
>> +out:
>>       return ret;
>>  }
>> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
>> new file mode 100644
>> index 0000000..9782f25
>> --- /dev/null
>> +++ b/kernel/seccomp_filter.c
>> @@ -0,0 +1,784 @@
>> +/* filter engine-based seccomp system call filtering
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software
>> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
>> + *
>> + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@...omium.org>
>> + */
>> +
>> +#include <linux/compat.h>
>> +#include <linux/err.h>
>> +#include <linux/errno.h>
>> +#include <linux/ftrace_event.h>
>> +#include <linux/seccomp.h>
>> +#include <linux/seq_file.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/uaccess.h>
>> +
>> +#include <asm/syscall.h>
>> +#include <trace/syscall.h>
>> +
>> +
>> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
>> +
>> +#define SECCOMP_FILTER_ALLOW "1"
>> +#define SECCOMP_ACTION_DENY 0xffff
>> +#define SECCOMP_ACTION_ALLOW 0xfffe
>> +
>> +/**
>> + * struct seccomp_filters - container for seccomp filterset
>> + *
>> + * @syscalls: array of 16-bit indices into @event_filters by syscall_nr
>> + *            May also be SECCOMP_ACTION_DENY or SECCOMP_ACTION_ALLOW
>> + * @event_filters: array of pointers to ftrace event objects
>> + * @count: size of @event_filters
>> + * @flags: anonymous struct to wrap filters-specific flags
>> + * @usage: reference count to simplify use.
>> + */
>> +struct seccomp_filters {
>> +     uint16_t syscalls[NR_syscalls];
>> +     struct event_filter **event_filters;
>> +     uint16_t count;
>> +     struct {
>> +             uint32_t compat:1,
>> +                      __reserved:31;
>> +     } flags;
>> +     atomic_t usage;
>> +};
>> +
>> +/* Handle ftrace symbol non-existence */
>> +#ifdef CONFIG_FTRACE_SYSCALLS
>> +#define create_event_filter(_ef_pptr, _event_type, _str) \
>> +     ftrace_parse_filter(_ef_pptr, _event_type, _str)
>> +#define get_filter_string(_ef) ftrace_get_filter_string(_ef)
>> +#define free_event_filter(_f) ftrace_free_filter(_f)
>> +
>> +#else
>> +
>> +#define create_event_filter(_ef_pptr, _event_type, _str) (-ENOSYS)
>> +#define get_filter_string(_ef) (NULL)
>> +#define free_event_filter(_f) do { } while (0)
>> +#endif
>> +
>> +/**
>> + * seccomp_filters_new - allocates a new filters object
>> + * @count: count to allocate for the event_filters array
>> + *
>> + * Returns ERR_PTR on error or an allocated object.
>> + */
>> +static struct seccomp_filters *seccomp_filters_new(uint16_t count)
>> +{
>> +     struct seccomp_filters *f;
>> +
>> +     if (count >= SECCOMP_ACTION_ALLOW)
>> +             return ERR_PTR(-EINVAL);
>> +
>> +     f = kzalloc(sizeof(struct seccomp_filters), GFP_KERNEL);
>> +     if (!f)
>> +             return ERR_PTR(-ENOMEM);
>> +
>> +     /* Lazy SECCOMP_ACTION_DENY assignment. */
>> +     memset(f->syscalls, 0xff, sizeof(f->syscalls));
>> +     atomic_set(&f->usage, 1);
>> +
>> +     f->event_filters = NULL;
>> +     f->count = count;
>> +     if (!count)
>> +             return f;
>> +
>> +     f->event_filters = kzalloc(count * sizeof(struct event_filter *),
>> +                                GFP_KERNEL);
>> +     if (!f->event_filters) {
>> +             kfree(f);
>> +             f = ERR_PTR(-ENOMEM);
>> +     }
>> +     return f;
>> +}
>> +
>> +/**
>> + * seccomp_filters_free - cleans up the filter list and frees the table
>> + * @filters: NULL or live object to be completely destructed.
>> + */
>> +static void seccomp_filters_free(struct seccomp_filters *filters)
>> +{
>> +     uint16_t count = 0;
>> +     if (!filters)
>> +             return;
>> +     while (count < filters->count) {
>> +             struct event_filter *f = filters->event_filters[count];
>> +             free_event_filter(f);
>> +             count++;
>> +     }
>> +     kfree(filters->event_filters);
>> +     kfree(filters);
>> +}
>> +
>> +static void __put_seccomp_filters(struct seccomp_filters *orig)
>> +{
>> +     WARN_ON(atomic_read(&orig->usage));
>> +     seccomp_filters_free(orig);
>> +}
>> +
>> +#define seccomp_filter_allow(_id) ((_id) == SECCOMP_ACTION_ALLOW)
>> +#define seccomp_filter_deny(_id) ((_id) == SECCOMP_ACTION_DENY)
>> +#define seccomp_filter_dynamic(_id) \
>> +     (!seccomp_filter_allow(_id) && !seccomp_filter_deny(_id))
>> +static inline uint16_t seccomp_filter_id(const struct seccomp_filters *f,
>> +                                      int syscall_nr)
>> +{
>> +     if (!f)
>> +             return SECCOMP_ACTION_DENY;
>> +     return f->syscalls[syscall_nr];
>> +}
>> +
>> +static inline struct event_filter *seccomp_dynamic_filter(
>> +             const struct seccomp_filters *filters, uint16_t id)
>> +{
>> +     if (!seccomp_filter_dynamic(id))
>> +             return NULL;
>> +     return filters->event_filters[id];
>> +}
>> +
>> +static inline void set_seccomp_filter_id(struct seccomp_filters *filters,
>> +                                      int syscall_nr, uint16_t id)
>> +{
>> +     filters->syscalls[syscall_nr] = id;
>> +}
>> +
>> +static inline void set_seccomp_filter(struct seccomp_filters *filters,
>> +                                   int syscall_nr, uint16_t id,
>> +                                   struct event_filter *dynamic_filter)
>> +{
>> +     filters->syscalls[syscall_nr] = id;
>> +     if (seccomp_filter_dynamic(id))
>> +             filters->event_filters[id] = dynamic_filter;
>> +}
>> +
>> +static struct event_filter *alloc_event_filter(int syscall_nr,
>> +                                            const char *filter_string)
>> +{
>> +     struct syscall_metadata *data;
>> +     struct event_filter *filter = NULL;
>> +     int err;
>> +
>> +     data = syscall_nr_to_meta(syscall_nr);
>> +     /* Argument-based filtering only works on ftrace-hooked syscalls. */
>> +     err = -ENOSYS;
>> +     if (!data)
>> +             goto fail;
>> +     err = create_event_filter(&filter,
>> +                               data->enter_event->event.type,
>> +                               filter_string);
>> +     if (err)
>> +             goto fail;
>> +
>> +     return filter;
>> +fail:
>> +     kfree(filter);
>> +     return ERR_PTR(err);
>> +}
>> +
>> +/**
>> + * seccomp_filters_copy - copies filters from src to dst.
>> + *
>> + * @dst: seccomp_filters to populate.
>> + * @src: table to read from.
>> + * @skip: specifies an entry, by system call, to skip.
>> + *
>> + * Returns non-zero on failure.
>> + * Both the source and the destination should have no simultaneous
>> + * writers, and dst should be exclusive to the caller.
>> + * If @skip is < 0, it is ignored.
>> + */
>> +static int seccomp_filters_copy(struct seccomp_filters *dst,
>> +                             const struct seccomp_filters *src,
>> +                             int skip)
>> +{
>> +     int id = 0, ret = 0, nr;
>> +     memcpy(&dst->flags, &src->flags, sizeof(src->flags));
>> +     memcpy(dst->syscalls, src->syscalls, sizeof(dst->syscalls));
>> +     if (!src->count)
>> +             goto done;
>> +     for (nr = 0; nr < NR_syscalls; ++nr) {
>> +             struct event_filter *filter;
>> +             const char *str;
>> +             uint16_t src_id = seccomp_filter_id(src, nr);
>> +             if (nr == skip) {
>> +                     set_seccomp_filter(dst, nr, SECCOMP_ACTION_DENY,
>> +                                        NULL);
>> +                     continue;
>> +             }
>> +             if (!seccomp_filter_dynamic(src_id))
>> +                     continue;
>> +             if (id >= dst->count) {
>> +                     ret = -EINVAL;
>> +                     goto done;
>> +             }
>> +             str = get_filter_string(seccomp_dynamic_filter(src, src_id));
>> +             filter = alloc_event_filter(nr, str);
>> +             if (IS_ERR(filter)) {
>> +                     ret = PTR_ERR(filter);
>> +                     goto done;
>> +             }
>> +             set_seccomp_filter(dst, nr, id, filter);
>> +             id++;
>> +     }
>> +
>> +done:
>> +     return ret;
>> +}
>> +
>> +/**
>> + * seccomp_extend_filter - appends more text to a syscall_nr's filter
>> + * @filters: unattached filter object to operate on
>> + * @syscall_nr: syscall number to update filters for
>> + * @filter_string: string to append to the existing filter
>> + *
>> + * The new string will be &&'d to the original filter string to ensure that it
>> + * always matches the existing predicates or less:
>> + *   (old_filter) && @filter_string
>> + * A new seccomp_filters instance is returned on success and a ERR_PTR on
>> + * failure.
>> + */
>> +static int seccomp_extend_filter(struct seccomp_filters *filters,
>> +                              int syscall_nr, char *filter_string)
>> +{
>> +     struct event_filter *filter;
>> +     uint16_t id = seccomp_filter_id(filters, syscall_nr);
>> +     char *merged = NULL;
>> +     int ret = -EINVAL, expected;
>> +
>> +     /* No extending with a "1". */
>> +     if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string))
>> +             goto out;
>> +
>> +     filter = seccomp_dynamic_filter(filters, id);
>> +     ret = -ENOENT;
>> +     if (!filter)
>> +             goto out;
>> +
>> +     merged = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
>> +     ret = -ENOMEM;
>> +     if (!merged)
>> +             goto out;
>> +
>> +     expected = snprintf(merged, SECCOMP_MAX_FILTER_LENGTH, "(%s) && %s",
>> +                         get_filter_string(filter), filter_string);
>> +     ret = -E2BIG;
>> +     if (expected >= SECCOMP_MAX_FILTER_LENGTH || expected < 0)
>> +             goto out;
>> +
>> +     /* Free the old filter */
>> +     free_event_filter(filter);
>> +     set_seccomp_filter(filters, syscall_nr, id, NULL);
>> +
>> +     /* Replace it */
>> +     filter = alloc_event_filter(syscall_nr, merged);
>> +     if (IS_ERR(filter)) {
>> +             ret = PTR_ERR(filter);
>> +             goto out;
>> +     }
>> +     set_seccomp_filter(filters, syscall_nr, id, filter);
>> +     ret = 0;
>> +
>> +out:
>> +     kfree(merged);
>> +     return ret;
>> +}
>> +
>> +/**
>> + * seccomp_add_filter - adds a filter for an unfiltered syscall
>> + * @filters: filters object to add a filter/action to
>> + * @syscall_nr: system call number to add a filter for
>> + * @filter_string: the filter string to apply
>> + *
>> + * Returns 0 on success and non-zero otherwise.
>> + */
>> +static int seccomp_add_filter(struct seccomp_filters *filters, int syscall_nr,
>> +                           char *filter_string)
>> +{
>> +     struct event_filter *filter;
>> +     int ret = 0;
>> +
>> +     if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string)) {
>> +             set_seccomp_filter(filters, syscall_nr,
>> +                                SECCOMP_ACTION_ALLOW, NULL);
>> +             goto out;
>> +     }
>> +
>> +     filter = alloc_event_filter(syscall_nr, filter_string);
>> +     if (IS_ERR(filter)) {
>> +             ret = PTR_ERR(filter);
>> +             goto out;
>> +     }
>> +     /* Always add to the last slot available since additions are
>> +      * are only done one at a time.
>> +      */
>> +     set_seccomp_filter(filters, syscall_nr, filters->count - 1, filter);
>> +out:
>> +     return ret;
>> +}
>> +
>> +/* Wrap optional ftrace syscall support. Returns 1 on match or 0 otherwise. */
>> +static int filter_match_current(struct event_filter *event_filter)
>> +{
>> +     int err = 0;
>> +#ifdef CONFIG_FTRACE_SYSCALLS
>> +     uint8_t syscall_state[64];
>> +
>> +     memset(syscall_state, 0, sizeof(syscall_state));
>> +
>> +     /* The generic tracing entry can remain zeroed. */
>> +     err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
>> +                                      NULL);
>> +     if (err)
>> +             return 0;
>> +
>> +     err = filter_match_preds(event_filter, syscall_state);
>> +#endif
>> +     return err;
>> +}
>> +
>> +static const char *syscall_nr_to_name(int syscall)
>> +{
>> +     const char *syscall_name = "unknown";
>> +     struct syscall_metadata *data = syscall_nr_to_meta(syscall);
>> +     if (data)
>> +             syscall_name = data->name;
>> +     return syscall_name;
>> +}
>> +
>> +static void filters_set_compat(struct seccomp_filters *filters)
>> +{
>> +#ifdef CONFIG_COMPAT
>> +     if (is_compat_task())
>> +             filters->flags.compat = 1;
>> +#endif
>> +}
>> +
>> +static inline int filters_compat_mismatch(struct seccomp_filters *filters)
>> +{
>> +     int ret = 0;
>> +     if (!filters)
>> +             return 0;
>> +#ifdef CONFIG_COMPAT
>> +     if (!!(is_compat_task()) == filters->flags.compat)
>> +             ret = 1;
>> +#endif
>> +     return ret;
>> +}
>> +
>> +static inline int syscall_is_execve(int syscall)
>> +{
>> +     int nr = __NR_execve;
>> +#ifdef CONFIG_COMPAT
>> +     if (is_compat_task())
>> +             nr = __NR_seccomp_execve_32;
>> +#endif
>> +     return syscall == nr;
>> +}
>> +
>> +#ifndef KSTK_EIP
>> +#define KSTK_EIP(x) 0L
>> +#endif
>> +
>> +void seccomp_filter_log_failure(int syscall)
>> +{
>> +     pr_info("%s[%d]: system call %d (%s) blocked at 0x%lx\n",
>> +             current->comm, task_pid_nr(current), syscall,
>> +             syscall_nr_to_name(syscall), KSTK_EIP(current));
>> +}
>> +
>> +/* put_seccomp_state - decrements the reference count of @orig and may free. */
>> +void put_seccomp_filters(struct seccomp_filters *orig)
>> +{
>> +     if (!orig)
>> +             return;
>> +
>> +     if (atomic_dec_and_test(&orig->usage))
>> +             __put_seccomp_filters(orig);
>> +}
>> +
>> +/* get_seccomp_state - increments the reference count of @orig */
>> +struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *orig)
>
> Nit: the name does not match the comment.

Will fix it here and above. Thanks!

>> +{
>> +     if (!orig)
>> +             return NULL;
>> +     atomic_inc(&orig->usage);
>> +     return orig;
>
> This is called in an RCU read-side critical section.  What exactly is
> RCU protecting?  I would expect an rcu_dereference() or one of the
> RCU list-traversal primitives somewhere, either here or at the caller.

Ah, I spaced on rcu_dereference().  The goal was to make the
assignment and replacement of the seccomp_filters pointer
RCU-protected (in seccomp_state) so there's no concern over it being
replaced partial on platforms where pointer assignments are non-atomic
- such as via /proc/<pid>/seccomp_filters access or a call via the
exported symbols.  Object lifetime is managed by reference counting so
that I don't have to worry about extending the RCU read-side critical
section by much or deal with pre-allocations.

I'll add rcu_dereference() to all the get_seccomp_filters() uses where
it makes sense, so that it is called safely.  Just to make sure, does
it make sense to continue to rcu protect the specific pointer?

>> +}
>> +
>> +/**
>> + * seccomp_test_filters - tests 'current' against the given syscall
>> + * @state: seccomp_state of current to use.
>> + * @syscall: number of the system call to test
>> + *
>> + * Returns 0 on ok and non-zero on error/failure.
>> + */
>> +int seccomp_test_filters(int syscall)
>> +{
>> +     uint16_t id;
>> +     struct event_filter *filter;
>> +     struct seccomp_filters *filters;
>> +     int ret = -EACCES;
>> +
>> +     rcu_read_lock();
>> +     filters = get_seccomp_filters(current->seccomp.filters);
>> +     rcu_read_unlock();
>> +
>> +     if (!filters)
>> +             goto out;
>> +
>> +     if (filters_compat_mismatch(filters)) {
>> +             pr_info("%s[%d]: seccomp_filter compat() mismatch.\n",
>> +                     current->comm, task_pid_nr(current));
>> +             goto out;
>> +     }
>> +
>> +     /* execve is never allowed. */
>> +     if (syscall_is_execve(syscall))
>> +             goto out;
>> +
>> +     ret = 0;
>> +     id = seccomp_filter_id(filters, syscall);
>> +     if (seccomp_filter_allow(id))
>> +             goto out;
>> +
>> +     ret = -EACCES;
>> +     if (!seccomp_filter_dynamic(id))
>> +             goto out;
>> +
>> +     filter = seccomp_dynamic_filter(filters, id);
>> +     if (filter && filter_match_current(filter))
>> +             ret = 0;
>> +out:
>> +     put_seccomp_filters(filters);
>> +     return ret;
>> +}
>> +
>> +/**
>> + * seccomp_show_filters - prints the current filter state to a seq_file
>> + * @filters: properly get()'d filters object
>> + * @m: the prepared seq_file to receive the data
>> + *
>> + * Returns 0 on a successful write.
>> + */
>> +int seccomp_show_filters(struct seccomp_filters *filters, struct seq_file *m)
>> +{
>> +     int syscall;
>> +     seq_printf(m, "Mode: %d\n", current->seccomp.mode);
>> +     if (!filters)
>> +             goto out;
>> +
>> +     for (syscall = 0; syscall < NR_syscalls; ++syscall) {
>> +             uint16_t id = seccomp_filter_id(filters, syscall);
>> +             const char *filter_string = SECCOMP_FILTER_ALLOW;
>> +             if (seccomp_filter_deny(id))
>> +                     continue;
>> +             seq_printf(m, "%d (%s): ",
>> +                           syscall,
>> +                           syscall_nr_to_name(syscall));
>> +             if (seccomp_filter_dynamic(id))
>> +                     filter_string = get_filter_string(
>> +                                       seccomp_dynamic_filter(filters, id));
>> +             seq_printf(m, "%s\n", filter_string);
>> +     }
>> +out:
>> +     return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(seccomp_show_filters);
>> +
>> +/**
>> + * seccomp_get_filter - copies the filter_string into "buf"
>> + * @syscall_nr: system call number to look up
>> + * @buf: destination buffer
>> + * @bufsize: available space in the buffer.
>> + *
>> + * Context: User context only. This function may sleep on allocation and
>> + *          operates on current. current must be attempting a system call
>> + *          when this is called.
>> + *
>> + * Looks up the filter for the given system call number on current.  If found,
>> + * the string length of the NUL-terminated buffer is returned and < 0 is
>> + * returned on error. The NUL byte is not included in the length.
>> + */
>> +long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize)
>> +{
>> +     struct seccomp_filters *filters;
>> +     struct event_filter *filter;
>> +     long ret = -EINVAL;
>> +     uint16_t id;
>> +
>> +     if (bufsize > SECCOMP_MAX_FILTER_LENGTH)
>> +             bufsize = SECCOMP_MAX_FILTER_LENGTH;
>> +
>> +     rcu_read_lock();
>> +     filters = get_seccomp_filters(current->seccomp.filters);
>> +     rcu_read_unlock();
>> +
>> +     if (!filters)
>> +             goto out;
>> +
>> +     ret = -ENOENT;
>> +     id = seccomp_filter_id(filters, syscall_nr);
>> +     if (seccomp_filter_deny(id))
>> +             goto out;
>> +
>> +     if (seccomp_filter_allow(id)) {
>> +             ret = strlcpy(buf, SECCOMP_FILTER_ALLOW, bufsize);
>> +             goto copied;
>> +     }
>> +
>> +     filter = seccomp_dynamic_filter(filters, id);
>> +     if (!filter)
>> +             goto out;
>> +     ret = strlcpy(buf, get_filter_string(filter), bufsize);
>> +
>> +copied:
>> +     if (ret >= bufsize) {
>> +             ret = -ENOSPC;
>> +             goto out;
>> +     }
>> +     /* Zero out any remaining buffer, just in case. */
>> +     memset(buf + ret, 0, bufsize - ret);
>> +out:
>> +     put_seccomp_filters(filters);
>> +     return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(seccomp_get_filter);
>> +
>> +/**
>> + * seccomp_clear_filter: clears the seccomp filter for a syscall.
>> + * @syscall_nr: the system call number to clear filters for.
>> + *
>> + * Context: User context only. This function may sleep on allocation and
>> + *          operates on current. current must be attempting a system call
>> + *          when this is called.
>> + *
>> + * Returns 0 on success.
>> + */
>> +long seccomp_clear_filter(int syscall_nr)
>> +{
>> +     struct seccomp_filters *filters = NULL, *orig_filters;
>> +     uint16_t id;
>> +     int ret = -EINVAL;
>> +
>> +     rcu_read_lock();
>> +     orig_filters = get_seccomp_filters(current->seccomp.filters);
>> +     rcu_read_unlock();
>> +
>> +     if (!orig_filters)
>> +             goto out;
>> +
>> +     if (filters_compat_mismatch(orig_filters))
>> +             goto out;
>> +
>> +     id = seccomp_filter_id(orig_filters, syscall_nr);
>> +     if (seccomp_filter_deny(id))
>> +             goto out;
>> +
>> +     /* Create a new filters object for the task */
>> +     if (seccomp_filter_dynamic(id))
>> +             filters = seccomp_filters_new(orig_filters->count - 1);
>> +     else
>> +             filters = seccomp_filters_new(orig_filters->count);
>> +
>> +     if (IS_ERR(filters)) {
>> +             ret = PTR_ERR(filters);
>> +             goto out;
>> +     }
>> +
>> +     /* Copy, but drop the requested entry. */
>> +     ret = seccomp_filters_copy(filters, orig_filters, syscall_nr);
>> +     if (ret)
>> +             goto out;
>> +     get_seccomp_filters(filters);  /* simplify the out: path */
>> +
>> +     rcu_assign_pointer(current->seccomp.filters, filters);
>
> What prevents two copies of seccomp_clear_filter() from running
> concurrently?

Nothing - the last one wins assignment, but the objects themselves
should be internally consistent to the parallel calls.  If that's a
concern, a per-task writer mutex could be used just to ensure
simultaneous calls to clear and set are performed serially.  Would
that make more sense?


>> +     synchronize_rcu();
>> +     put_seccomp_filters(orig_filters);  /* for the task */
>> +out:
>> +     put_seccomp_filters(orig_filters);  /* for the get */
>> +     put_seccomp_filters(filters);  /* for the extra get */
>> +     return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(seccomp_clear_filter);
>> +
>> +/**
>> + * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall.
>> + * @syscall_nr: system call number to apply the filter to.
>> + * @filter: ftrace filter string to apply.
>> + *
>> + * Context: User context only. This function may sleep on allocation and
>> + *          operates on current. current must be attempting a system call
>> + *          when this is called.
>> + *
>> + * New filters may be added for system calls when the current task is
>> + * not in a secure computing mode (seccomp).  Otherwise, existing filters may
>> + * be extended.
>> + *
>> + * Returns 0 on success or an errno on failure.
>> + */
>> +long seccomp_set_filter(int syscall_nr, char *filter)
>> +{
>> +     struct seccomp_filters *filters = NULL, *orig_filters = NULL;
>> +     uint16_t id;
>> +     long ret = -EINVAL;
>> +     uint16_t filters_needed;
>> +
>> +     if (!filter)
>> +             goto out;
>> +
>> +     filter = strstrip(filter);
>> +     /* Disallow empty strings. */
>> +     if (filter[0] == 0)
>> +             goto out;
>> +
>> +     rcu_read_lock();
>> +     orig_filters = get_seccomp_filters(current->seccomp.filters);
>> +     rcu_read_unlock();
>> +
>> +     /* After the first call, compatibility mode is selected permanently. */
>> +     ret = -EACCES;
>> +     if (filters_compat_mismatch(orig_filters))
>> +             goto out;
>> +
>> +     filters_needed = orig_filters ? orig_filters->count : 0;
>> +     id = seccomp_filter_id(orig_filters, syscall_nr);
>> +     if (seccomp_filter_deny(id)) {
>> +             /* Don't allow DENYs to be changed when in a seccomp mode */
>> +             ret = -EACCES;
>> +             if (current->seccomp.mode)
>> +                     goto out;
>> +             filters_needed++;
>> +     }
>> +
>> +     filters = seccomp_filters_new(filters_needed);
>> +     if (IS_ERR(filters)) {
>> +             ret = PTR_ERR(filters);
>> +             goto out;
>> +     }
>> +
>> +     filters_set_compat(filters);
>> +     if (orig_filters) {
>> +             ret = seccomp_filters_copy(filters, orig_filters, -1);
>> +             if (ret)
>> +                     goto out;
>> +     }
>> +
>> +     if (seccomp_filter_deny(id))
>> +             ret = seccomp_add_filter(filters, syscall_nr, filter);
>> +     else
>> +             ret = seccomp_extend_filter(filters, syscall_nr, filter);
>> +     if (ret)
>> +             goto out;
>> +     get_seccomp_filters(filters);  /* simplify the error paths */
>> +
>> +     rcu_assign_pointer(current->seccomp.filters, filters);
>
> Again, what prevents two copies of seccomp_set_filter() from running
> concurrently?

Same deal - nothing, but I'd be happy to add a guard if it makes sense.

Thanks!

>> +     synchronize_rcu();
>> +     put_seccomp_filters(orig_filters);  /* for the task */
>> +out:
>> +     put_seccomp_filters(orig_filters);  /* for the get */
>> +     put_seccomp_filters(filters);  /* for get or task, on err */
>> +     return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(seccomp_set_filter);
>> +
>> +long prctl_set_seccomp_filter(unsigned long syscall_nr,
>> +                           char __user *user_filter)
>> +{
>> +     int nr;
>> +     long ret;
>> +     char *filter = NULL;
>> +
>> +     ret = -EINVAL;
>> +     if (syscall_nr >= NR_syscalls)
>> +             goto out;
>> +
>> +     ret = -EFAULT;
>> +     if (!user_filter)
>> +             goto out;
>> +
>> +     filter = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
>> +     ret = -ENOMEM;
>> +     if (!filter)
>> +             goto out;
>> +
>> +     ret = -EFAULT;
>> +     if (strncpy_from_user(filter, user_filter,
>> +                           SECCOMP_MAX_FILTER_LENGTH - 1) < 0)
>> +             goto out;
>> +
>> +     nr = (int) syscall_nr;
>> +     ret = seccomp_set_filter(nr, filter);
>> +
>> +out:
>> +     kfree(filter);
>> +     return ret;
>> +}
>> +
>> +long prctl_clear_seccomp_filter(unsigned long syscall_nr)
>> +{
>> +     int nr = -1;
>> +     long ret;
>> +
>> +     ret = -EINVAL;
>> +     if (syscall_nr >= NR_syscalls)
>> +             goto out;
>> +
>> +     nr = (int) syscall_nr;
>> +     ret = seccomp_clear_filter(nr);
>> +
>> +out:
>> +     return ret;
>> +}
>> +
>> +long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst,
>> +                           unsigned long available)
>> +{
>> +     int ret, nr;
>> +     unsigned long copied;
>> +     char *buf = NULL;
>> +     ret = -EINVAL;
>> +     if (!available)
>> +             goto out;
>> +     /* Ignore extra buffer space. */
>> +     if (available > SECCOMP_MAX_FILTER_LENGTH)
>> +             available = SECCOMP_MAX_FILTER_LENGTH;
>> +
>> +     ret = -EINVAL;
>> +     if (syscall_nr >= NR_syscalls)
>> +             goto out;
>> +     nr = (int) syscall_nr;
>> +
>> +     ret = -ENOMEM;
>> +     buf = kmalloc(available, GFP_KERNEL);
>> +     if (!buf)
>> +             goto out;
>> +
>> +     ret = seccomp_get_filter(nr, buf, available);
>> +     if (ret < 0)
>> +             goto out;
>> +
>> +     /* Include the NUL byte in the copy. */
>> +     copied = copy_to_user(dst, buf, ret + 1);
>> +     ret = -ENOSPC;
>> +     if (copied)
>> +             goto out;
>> +     ret = 0;
>> +out:
>> +     kfree(buf);
>> +     return ret;
>> +}
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index af468ed..ed60d06 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -1698,13 +1698,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>>               case PR_SET_ENDIAN:
>>                       error = SET_ENDIAN(me, arg2);
>>                       break;
>> -
>>               case PR_GET_SECCOMP:
>>                       error = prctl_get_seccomp();
>>                       break;
>>               case PR_SET_SECCOMP:
>>                       error = prctl_set_seccomp(arg2);
>>                       break;
>> +             case PR_SET_SECCOMP_FILTER:
>> +                     error = prctl_set_seccomp_filter(arg2,
>> +                                                      (char __user *) arg3);
>> +                     break;
>> +             case PR_CLEAR_SECCOMP_FILTER:
>> +                     error = prctl_clear_seccomp_filter(arg2);
>> +                     break;
>> +             case PR_GET_SECCOMP_FILTER:
>> +                     error = prctl_get_seccomp_filter(arg2,
>> +                                                      (char __user *) arg3,
>> +                                                      arg4);
>> +                     break;
>>               case PR_GET_TSC:
>>                       error = GET_TSC_CTL(arg2);
>>                       break;
>> diff --git a/security/Kconfig b/security/Kconfig
>> index 95accd4..c76adf2 100644
>> --- a/security/Kconfig
>> +++ b/security/Kconfig
>> @@ -2,6 +2,10 @@
>>  # Security configuration
>>  #
>>
>> +# Make seccomp filter Kconfig switch below available
>> +config HAVE_SECCOMP_FILTER
>> +       bool
>> +
>>  menu "Security options"
>>
>>  config KEYS
>> @@ -82,6 +86,19 @@ config SECURITY_DMESG_RESTRICT
>>
>>         If you are unsure how to answer this question, answer N.
>>
>> +config SECCOMP_FILTER
>> +     bool "Enable seccomp-based system call filtering"
>> +     select SECCOMP
>> +     depends on HAVE_SECCOMP_FILTER && EXPERIMENTAL
>> +     help
>> +       This kernel feature expands CONFIG_SECCOMP to allow computing
>> +       in environments with reduced kernel access dictated by the
>> +       application itself through prctl calls.  If
>> +       CONFIG_FTRACE_SYSCALLS is available, then system call
>> +       argument-based filtering predicates may be used.
>> +
>> +       See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>>  config SECURITY
>>       bool "Enable different security models"
>>       depends on SYSFS
>> --
>> 1.7.0.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ