linux-kernel - Re: [RFC 1/3] seccomp: add a return code to trap to userspace

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALCETrWgu5n+SMqrsZQ7MVYPtzs8otuc7hpA5uPH+JNtFrMBkQ@mail.gmail.com>
Date:   Sun, 4 Feb 2018 17:36:33 +0000
From:   Andy Lutomirski <luto@...capital.net>
To:     Tycho Andersen <tycho@...ho.ws>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Linux Containers <containers@...ts.linux-foundation.org>,
        Kees Cook <keescook@...omium.org>,
        Oleg Nesterov <oleg@...hat.com>,
        "Eric W . Biederman" <ebiederm@...ssion.com>,
        "Serge E . Hallyn" <serge@...lyn.com>,
        Christian Brauner <christian.brauner@...ntu.com>,
        Tyler Hicks <tyhicks@...onical.com>,
        Akihiro Suda <suda.akihiro@....ntt.co.jp>
Subject: Re: [RFC 1/3] seccomp: add a return code to trap to userspace

On Sun, Feb 4, 2018 at 10:49 AM, Tycho Andersen <tycho@...ho.ws> wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.

Neat!

>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
>
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
>
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex. Also worth noting that there
> is one race still present:
>
>   1. a task does a SECCOMP_RET_USER_NOTIF
>   2. the userspace handler reads this notification
>   3. the task dies
>   4. a new task with the same pid starts
>   5. this new task does a SECCOMP_RET_USER_NOTIF, gets the same cookie id
>      that the previous one did
>   6. the userspace handler writes a response

I'm slightly confused.  I thought the id was never reused for a given
struct seccomp_filter.  (Also, shouldn't the id be u64, not u32?)

On very quick reading, I have a question.  What happens if a process
has two seccomp_filters attached, one of them returns
SECCOMP_RET_USER_NOTIF, and the *other* one has a listener?