lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrWr_B-quNckFksTP1W-Ww71uQgCrR-o9QWdQ-Gi8p1r9A@mail.gmail.com>
Date:   Sun, 31 May 2020 14:03:48 -0700
From:   Andy Lutomirski <luto@...nel.org>
To:     Andy Lutomirski <luto@...nel.org>
Cc:     Paul Gofman <gofmanp@...il.com>,
        Gabriel Krisman Bertazi <krisman@...labora.com>,
        Linux-MM <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>, kernel@...labora.com,
        Thomas Gleixner <tglx@...utronix.de>,
        Kees Cook <keescook@...omium.org>,
        Will Drewry <wad@...omium.org>,
        "H . Peter Anvin" <hpa@...or.com>,
        Zebediah Figura <zfigura@...eweavers.com>
Subject: Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas

On Sun, May 31, 2020 at 11:57 AM Andy Lutomirski <luto@...nel.org> wrote:
>
>
> What if there was a special filter type that ran a BPF program on each
> syscall, and the program was allowed to access user memory to make its
> decisions, e.g. to look at some list of memory addresses.  But this
> would explicitly *not* be a security feature -- execve() would remove
> the filter, and the filter's outcome would be one of redirecting
> execution or allowing the syscall.  If the "allow" outcome occurs,
> then regular seccomp filters run.  Obviously the exact semantics here
> would need some care.

Let me try to flesh this out a little.

A task could install a syscall emulation filter (maybe using the
seccomp() syscall, maybe using something else).  There would be at
most one such filter per process.  Upon doing a syscall, the kernel
will first do initial syscall fixups (e.g. SYSENTER/SYSCALL32 magic
argument translation) and would then invoke the filter.  The filter is
an eBPF program (sorry Kees) and, as input, it gets access to the
task's register state and to an indication of which type of syscall
entry this was.  This will inherently be rather architecture specific
-- x86 choices could be int80, int80(translated), and syscall64.  (We
could expose SYSCALL32 separately, I suppose, but SYSENTER is such a
mess that I'm not sure this would be productive.)  The program can
access user memory, and it returns one of two results: allow the
syscall or send SIGSYS.  If the program tries to access user memory
and faults, the result is SIGSYS.

(I would love to do this with cBPF, but I'm not sure how to pull this
off.  Accessing user memory is handy for making the lookup flexible
enough to detect Windows vs Linux.  It would be *really* nice to
finally settle the unprivileged eBPF subset discussion so that we can
figure out how to make eBPF work here.)

execve() clears the filter.  clone() copies the filter.

Does this seem reasonable?  Is the implementation complexity small
enough?  Is the eBPF thing going to be a showstopper?

Using a signal instead of a bespoke thunk simplifies a lot of thorny
details but is also enough slower that catching all syscalls might be
a performance problem.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ