linux-kernel - Re: [PATCH v6 6/9] kernel: entry: Support Syscall User Dispatch for common syscall entry

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALCETrUaAy7uU9jjneC9+ft-TtS+SuyWXxMCCE5dmcth3N4rHw@mail.gmail.com>
Date:   Mon, 7 Sep 2020 13:20:23 -0700
From:   Andy Lutomirski <luto@...nel.org>
To:     Christian Brauner <christian.brauner@...ntu.com>
Cc:     Gabriel Krisman Bertazi <krisman@...labora.com>,
        Andrew Lutomirski <luto@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Kees Cook <keescook@...omium.org>, X86 ML <x86@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux API <linux-api@...r.kernel.org>,
        Matthew Wilcox <willy@...radead.org>,
        "open list:KERNEL SELFTEST FRAMEWORK" 
        <linux-kselftest@...r.kernel.org>, Shuah Khan <shuah@...nel.org>,
        kernel@...labora.com
Subject: Re: [PATCH v6 6/9] kernel: entry: Support Syscall User Dispatch for
 common syscall entry

On Mon, Sep 7, 2020 at 7:25 AM Christian Brauner
<christian.brauner@...ntu.com> wrote:
>
> On Mon, Sep 07, 2020 at 07:15:52AM -0700, Andy Lutomirski wrote:
> >
> >
> > > On Sep 7, 2020, at 3:15 AM, Christian Brauner <christian.brauner@...ntu.com> wrote:
> > >
> > > On Fri, Sep 04, 2020 at 04:31:44PM -0400, Gabriel Krisman Bertazi wrote:
> > >> Syscall User Dispatch (SUD) must take precedence over seccomp, since the
> > >> use case is emulation (it can be invoked with a different ABI) such that
> > >> seccomp filtering by syscall number doesn't make sense in the first
> > >> place.  In addition, either the syscall is dispatched back to userspace,
> > >> in which case there is no resource for seccomp to protect, or the
> > >
> > > Tbh, I'm torn here. I'm not a super clever attacker but it feels to me
> > > that this is still at least a clever way to circumvent a seccomp
> > > sandbox.
> > > If I'd be confined by a seccomp profile that would cause me to be
> > > SIGKILLed when I try do open() I could prctl() myself to do user
> > > dispatch to prevent that from happening, no?
> > >
> >
> > Not really, I think. The idea is that you didn’t actually do open().
> > You did a SYSCALL instruction which meant something else, and the
> > syscall dispatch correctly prevented the kernel from misinterpreting
> > it as open().
>
> Right, for the case where you're e.g. emulating windows syscalls that's
> true. I was thinking when you're running natively on Linux: couldn't I
> first load a seccomp profile "kill me if someone does an open()", then
> I exec() the target binary and that binary is setup to do
> prctl(USER_DISPATCH) first thing. I guess, it's ok because as far as I
> had time to read it this is a nothing or all mechanism, i.e. _all_
> system calls are re-routed in contrast to e.g. seccomp where I could do
> this per-syscall. So for user-dispatch it wouldn't make sense to use it
> on Linux per se. Still makes me a little uneasy. :)

There's an escape hatch, so processes using this can still make syscalls.

Maybe think about it another way: a process using user dispatch should
definitely *not* trigger seccomp user notifiers, errno returns, or
ptrace events, since they'll all do the wrong thing.  IMO RET_KILL is
the same.

Barring some very severe defect, there's no way a program can use user
dispatch to escape seccomp -- a program could use user dispatch to
allow them to do:

mov $__NR_open, %rax
syscall

without dying despite the presence of a filter that would kill the
process if it tried to do open(), but this doesn't bypass the filter
at all.  The process could just as easily have done:

mov $__NR_open
jmp magic_stub(%rip)

without tripping the filter, since no system call actually happens here.

--Andy