linux-kernel - Re: [PATCH v17 08/15] seccomp: add system call filtering using BPF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20120406140503.10b75c5b.akpm@linux-foundation.org>
Date:	Fri, 6 Apr 2012 14:05:03 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Kees Cook <keescook@...omium.org>
Cc:	Will Drewry <wad@...omium.org>, linux-kernel@...r.kernel.org,
	linux-security-module@...r.kernel.org, linux-arch@...r.kernel.org,
	linux-doc@...r.kernel.org, kernel-hardening@...ts.openwall.com,
	netdev@...r.kernel.org, x86@...nel.org, arnd@...db.de,
	davem@...emloft.net, hpa@...or.com, mingo@...hat.com,
	oleg@...hat.com, peterz@...radead.org, rdunlap@...otime.net,
	mcgrathr@...omium.org, tglx@...utronix.de, luto@....edu,
	eparis@...hat.com, serge.hallyn@...onical.com, djm@...drot.org,
	scarybeasts@...il.com, indan@....nu, pmoore@...hat.com,
	corbet@....net, eric.dumazet@...il.com, markus@...omium.org,
	coreyb@...ux.vnet.ibm.com, jmorris@...ei.org
Subject: Re: [PATCH v17 08/15] seccomp: add system call filtering using BPF

On Fri, 6 Apr 2012 13:44:43 -0700
Kees Cook <keescook@...omium.org> wrote:

> On Fri, Apr 6, 2012 at 1:23 PM, Andrew Morton <akpm@...ux-foundation.org> wrote:
> > On Thu, 29 Mar 2012 15:01:53 -0500
> > Will Drewry <wad@...omium.org> wrote:
> >
> >> [This patch depends on luto@....edu's no_new_privs patch:
> >>    https://lkml.org/lkml/2012/1/30/264
> >>  included in this series for ease of consumption.
> >> ]
> >>
> >> This patch adds support for seccomp mode 2.  Mode 2 introduces the
> >> ability for unprivileged processes to install system call filtering
> >> policy expressed in terms of a Berkeley Packet Filter (BPF) program.
> >> This program will be evaluated in the kernel for each system call
> >> the task makes and computes a result based on data in the format
> >> of struct seccomp_data.
> >> ...
> >> +static void seccomp_filter_log_failure(int syscall)
> >> +{
> >> +     int compat = 0;
> >> +#ifdef CONFIG_COMPAT
> >> +     compat = is_compat_task();
> >> +#endif
> >
> > hm, I'm surprised that we don't have a zero-returning implementation of
> > is_compat_task() when CONFIG_COMPAT=n.  Seems silly.  Blames Arnd.
> 
> There is

I can't find it.  The definition in include/linux/compat.h is inside
#ifdef CONFIG_COMPAT.

> >> +static long seccomp_attach_filter(struct sock_fprog *fprog)
> >> +{
> >> +     struct seccomp_filter *filter;
> >> +     unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
> >> +     unsigned long total_insns = fprog->len;
> >> +     long ret;
> >> +
> >> +     if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
> >> +             return -EINVAL;
> >> +
> >> +     for (filter = current->seccomp.filter; filter; filter = filter->prev)
> >> +             total_insns += filter->len + 4;  /* include a 4 instr penalty */
> >
> > So tasks don't share filters?  We copy them by value at fork?  Do we do
> > this at vfork() too?
> 
> The filter chain is shared (and refcounted).

So what's the locking rule for accessing and modifying that
singly-linked list?

> ...
> >> +/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
> >> +void put_seccomp_filter(struct task_struct *tsk)
> >> +{
> >> +     struct seccomp_filter *orig = tsk->seccomp.filter;
> >> +     /* Clean up single-reference branches iteratively. */
> >> +     while (orig && atomic_dec_and_test(&orig->usage)) {
> >> +             struct seccomp_filter *freeme = orig;
> >> +             orig = orig->prev;
> >> +             kfree(freeme);
> >> +     }
> >> +}
> >
> > So if one of the filters in the list has an elevated refcount, we bail
> > out on the remainder of the list.  Seems odd.
> 
> This so that every filter in the list doesn't need to have their
> refcount raised. As long as the counting up matching the counting
> down, it's fine. This allows for process trees branching the filter
> list at different times still being safe. IIUC, this code was based on
> how namespace refcounting is handled. I spent some time proving to
> myself that it was correctly refcounted a while back. More eyes is
> better, of course. :)

Please ensure that future readers of this code have a description of
how it is supposed to work.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/