[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABqD9hbVkmEiCe9pRv4hEL+i6qsmajZ8FeWomcZjbX_fiAKtGQ@mail.gmail.com>
Date: Tue, 14 Jan 2014 15:19:23 -0600
From: Will Drewry <wad@...omium.org>
To: Andy Lutomirski <luto@...capital.net>
Cc: LKML <linux-kernel@...r.kernel.org>,
Nicolas Schichan <nschichan@...ebox.fr>,
Kees Cook <keescook@...omium.org>,
James Morris <james.l.morris@...cle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Oleg Nesterov <oleg@...hat.com>,
Alexander Viro <viro@...iv.linux.org.uk>,
Julien Tinnes <jln@...omium.org>
Subject: Re: [PATCH 2/2] sys, seccomp: add PR_SECCOMP_EXT and SECCOMP_EXT_ACT_TSYNC
On Tue, Jan 14, 2014 at 3:09 PM, Andy Lutomirski <luto@...capital.net> wrote:
> On Tue, Jan 14, 2014 at 12:59 PM, Will Drewry <wad@...omium.org> wrote:
>> On Tue, Jan 14, 2014 at 2:24 PM, Andy Lutomirski <luto@...capital.net> wrote:
>>> On Tue, Jan 14, 2014 at 10:59 AM, Will Drewry <wad@...omium.org> wrote:
>>>> On Mon, Jan 13, 2014 at 5:36 PM, Andy Lutomirski <luto@...capital.net> wrote:
>>>>> On 01/13/2014 12:30 PM, Will Drewry wrote:
>>>>>> Applying restrictive seccomp filter programs to large or diverse
>>>>>> codebases often requires handling threads which may be started early in
>>>>>> the process lifetime (e.g., by code that is linked in). While it is
>>>>>> possible to apply permissive programs prior to process start up, it is
>>>>>> difficult to further restrict the kernel ABI to those threads after that
>>>>>> point.
>>>>>>
>>>>>> This change adds a new seccomp "extension" for synchronizing thread
>>>>>> group seccomp filters and a prctl() for accessing that functionality.
>>>>>> The need for the added prctl() is due to the lack of reserved arguments
>>>>>> in PR_SET_SECCOMP.
>>>>>>
>>>>>> When prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT_TSYNC, 0, 0) is called, it
>>>>>> will attempt to synchronize all threads in current's threadgroup to its
>>>>>> seccomp filter program. This is possible iff all threads are using a
>>>>>> filter that is an ancestor to the filter current is attempting to
>>>>>> synchronize to. NULL filters (where the task is running as
>>>>>> SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
>>>>>> transitioned into SECCOMP_MODE_FILTER. On success, 0 is returned. On
>>>>>> failure, the pid of one of the failing threads will be returned.
>>>>>>
>>>>>> Suggested-by: Julien Tinnes <jln@...omium.org>
>>>>>> Signed-off-by: Will Drewry <wad@...omium.org>
>>>>>> ---
>>>>>> include/linux/seccomp.h | 7 +++
>>>>>> include/uapi/linux/prctl.h | 6 ++
>>>>>> include/uapi/linux/seccomp.h | 6 ++
>>>>>> kernel/seccomp.c | 128 ++++++++++++++++++++++++++++++++++++++++++
>>>>>> kernel/sys.c | 3 +
>>>>>> 5 files changed, 150 insertions(+)
>>>>>>
>>>>>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>>>>>> index 85c0895..3163db6 100644
>>>>>> --- a/include/linux/seccomp.h
>>>>>> +++ b/include/linux/seccomp.h
>>>>>> @@ -77,6 +77,8 @@ static inline int seccomp_mode(struct seccomp *s)
>>>>>> extern void put_seccomp_filter(struct task_struct *tsk);
>>>>>> extern void get_seccomp_filter(struct task_struct *tsk);
>>>>>> extern u32 seccomp_bpf_load(int off);
>>>>>> +extern long prctl_seccomp_ext(unsigned long, unsigned long,
>>>>>> + unsigned long, unsigned long);
>>>>>> #else /* CONFIG_SECCOMP_FILTER */
>>>>>> static inline void put_seccomp_filter(struct task_struct *tsk)
>>>>>> {
>>>>>> @@ -86,5 +88,10 @@ static inline void get_seccomp_filter(struct task_struct *tsk)
>>>>>> {
>>>>>> return;
>>>>>> }
>>>>>> +static inline long prctl_seccomp_ext(unsigned long arg2, unsigned long arg3,
>>>>>> + unsigned long arg4, unsigned long arg5)
>>>>>> +{
>>>>>> + return -EINVAL;
>>>>>> +}
>>>>>> #endif /* CONFIG_SECCOMP_FILTER */
>>>>>> #endif /* _LINUX_SECCOMP_H */
>>>>>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>>>>>> index 289760f..5dcd5d3 100644
>>>>>> --- a/include/uapi/linux/prctl.h
>>>>>> +++ b/include/uapi/linux/prctl.h
>>>>>> @@ -149,4 +149,10 @@
>>>>>>
>>>>>> #define PR_GET_TID_ADDRESS 40
>>>>>>
>>>>>> +/*
>>>>>> + * Access seccomp extensions
>>>>>> + * See Documentation/prctl/seccomp_filter.txt for more details.
>>>>>> + */
>>>>>> +#define PR_SECCOMP_EXT 41
>>>>>> +
>>>>>> #endif /* _LINUX_PRCTL_H */
>>>>>> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
>>>>>> index ac2dc9f..49b5279 100644
>>>>>> --- a/include/uapi/linux/seccomp.h
>>>>>> +++ b/include/uapi/linux/seccomp.h
>>>>>> @@ -10,6 +10,12 @@
>>>>>> #define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */
>>>>>> #define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */
>>>>>>
>>>>>> +/* Valid extension types as arg2 for prctl(PR_SECCOMP_EXT) */
>>>>>> +#define SECCOMP_EXT_ACT 1
>>>>>> +
>>>>>> +/* Valid extension actions as arg3 to prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT) */
>>>>>> +#define SECCOMP_EXT_ACT_TSYNC 1 /* attempt to synchronize thread filters */
>>>>>> +
>>>>>> /*
>>>>>> * All BPF programs must return a 32-bit value.
>>>>>> * The bottom 16-bits are for optional return data.
>>>>>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>>>>>> index 71512e4..8a0de7b 100644
>>>>>> --- a/kernel/seccomp.c
>>>>>> +++ b/kernel/seccomp.c
>>>>>> @@ -24,6 +24,7 @@
>>>>>> #ifdef CONFIG_SECCOMP_FILTER
>>>>>> #include <asm/syscall.h>
>>>>>> #include <linux/filter.h>
>>>>>> +#include <linux/pid.h>
>>>>>> #include <linux/ptrace.h>
>>>>>> #include <linux/security.h>
>>>>>> #include <linux/slab.h>
>>>>>> @@ -220,6 +221,108 @@ static u32 seccomp_run_filters(int syscall)
>>>>>> return ret;
>>>>>> }
>>>>>>
>>>>>> +/* Returns 1 if the candidate is an ancestor. */
>>>>>> +static int is_ancestor(struct seccomp_filter *candidate,
>>>>>> + struct seccomp_filter *child)
>>>>>> +{
>>>>>> + /* NULL is the root ancestor. */
>>>>>> + if (candidate == NULL)
>>>>>> + return 1;
>>>>>> + for (; child; child = child->prev)
>>>>>> + if (child == candidate)
>>>>>> + return 1;
>>>>>> + return 0;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * seccomp_sync_threads: sets all threads to use current's filter
>>>>>> + *
>>>>>> + * Returns 0 on success or the pid of a thread which was either not
>>>>>> + * in the correct seccomp mode or it did not have an ancestral
>>>>>> + * seccomp filter. current must be in seccomp.mode=2 already.
>>>>>> + */
>>>>>> +static pid_t seccomp_sync_threads(void)
>>>>>> +{
>>>>>> + struct task_struct *thread, *caller;
>>>>>> + pid_t failed = 0;
>>>>>> + thread = caller = current;
>>>>>> +
>>>>>> + read_lock(&tasklist_lock);
>>>>>> + if (thread_group_empty(caller))
>>>>>> + goto done;
>>>>>> + while_each_thread(caller, thread) {
>>>>>> + task_lock(thread);
>>>>>> + /*
>>>>>> + * All threads must not be in SECCOMP_MODE_STRICT to
>>>>>> + * be eligible for synchronization.
>>>>>> + */
>>>>>> + if ((thread->seccomp.mode == SECCOMP_MODE_DISABLED ||
>>>>>> + thread->seccomp.mode == SECCOMP_MODE_FILTER) &&
>>>>>> + is_ancestor(thread->seccomp.filter,
>>>>>> + caller->seccomp.filter)) {
>>>>>> + /* Get a task reference for the new leaf node. */
>>>>>> + get_seccomp_filter(caller);
>>>>>> + /*
>>>>>> + * Drop the task reference to the shared ancestor since
>>>>>> + * current's path will hold a reference. (This also
>>>>>> + * allows a put before the assignment.)
>>>>>> + */
>>>>>> + put_seccomp_filter(thread);
>>>>>> + thread->seccomp.filter = caller->seccomp.filter;
>>>>>> + /* Opt the other thread into seccomp if needed.
>>>>>> + * As threads are considered to be trust-realm
>>>>>> + * equivalent (see ptrace_may_access), it is safe to
>>>>>> + * allow one thread to transition the other.
>>>>>> + */
>>>>>> + if (thread->seccomp.mode == SECCOMP_MODE_DISABLED) {
>>>>>> + thread->seccomp.mode = SECCOMP_MODE_FILTER;
>>>>>> + /*
>>>>>> + * Don't let an unprivileged task work around
>>>>>> + * the no_new_privs restriction by creating
>>>>>> + * a thread that sets it up, enters seccomp,
>>>>>> + * then dies.
>>>>>> + */
>>>>>> + if (caller->no_new_privs)
>>>>>> + thread->no_new_privs = 1;
>>>>>> + set_tsk_thread_flag(thread, TIF_SECCOMP);
>>>>>
>>>>> no_new_privs is a bitfield, and some of the other bits in there look
>>>>> like things that might not want to be read and written back from another
>>>>> thread.
>>>>
>>>> Ah :/ Good catch!
>>>>
>>>>> Would it be too annoying to require that the other threads already have
>>>>> no_new_privs set?
>>>>
>>>> Hrm, it's pretty painful in the edge cases where you don't control the
>>>> process initialization which might setup threads you need to ensnare.
>>>>
>>>> Would it be crazy to do something like below in sched.h?
>>>> - unsigned no_new_privs:1;
>>>> + unsigned no_new_privs;
>>>
>>> set_bit, etc. would also work. (Although there isn't a 32-bit set_bit
>>> AFAIK, or at least there isn't one that works on 64-bit BE archs.)
>>
>> I wasn't sure if I could use set_bit() in a way that wouldn't get me
>> banned from submitting patches forever :)
>
> You'd at least have to get rid of the bitfield -- set_bit on a
> bitfield would be... bad. :)
>
>>
>>> Also, is 'unsigned' actually safe for this purpose, on all supported
>>> archs/compilers? I'm pretty sure it's okay by C++11 rules, but those
>>> don't apply here. Maybe some day the kernel will move to C11 and life
>>> will be good.
>>>
>>>>
>>>> It feels like a big hammer though, but it also seems weird to wrap those
>>>> bitfields with task_lock. Any suggestions are welcome! I'll think about
>>>> this a bit more and see if there is a good way to do this transition
>>>> safely and cheaply.
>>>
>>> Hmm. I bet you could move no_new_privs somewhere else in task_lock
>>> where there's a bit free. It could also go in 'struct creds', but I
>>> think that's even worse from your perspective.
>>>
>>> Here's another dumb idea: Add an accessor task_no_new_privs(struct
>>> task_struct *) and move no_new_privs into struct seccomp (i.e. make it
>>> a bit in the seccomp mode). It kind of sucks on !CONFIG_SECCOMP, but
>>> it's free if CONFIG_SECCOMP.
>>
>> That'd certainly be fine with me. I was considering adding a
>> "needs_transition" bit to struct seccomp, but moving nnp there could be tidy.
>> I'd need to make sure reading it locklessly still makes sense, but I really
>> don't want to put a lock on the syscall path...
>
> You may be able to cheat a bit: I don't think that reading
> no_new_privs needs to be fast -- it only matters in a small set of
> security-related syscalls and in execve.
Good point.
> Have you considered RCU for the seccomp state?
I was hoping to avoid it and as much locking as possible, but
it may end up being where this needs to be. Especially as Oleg
points out deficiencies in my current locking strategy!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists