netdev - Re: [net-next v2 1/2] bpf, seccomp: Add eBPF filter capabilities

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMp4zn-V9fu6WHNkFaHxyOu7Gdez2U3ZaRNTT578txXSwFDHMw@mail.gmail.com>
Date:   Tue, 20 Feb 2018 23:16:20 -0800
From:   Sargun Dhillon <sargun@...gun.me>
To:     Daniel Borkmann <daniel@...earbox.net>
Cc:     netdev <netdev@...r.kernel.org>,
        Linux Containers <containers@...ts.linux-foundation.org>,
        Alexei Starovoitov <ast@...nel.org>,
        Kees Cook <keescook@...omium.org>,
        Andy Lutomirski <luto@...capital.net>,
        Will Drewry <wad@...omium.org>,
        Jessie Frazelle <me@...sfraz.com>,
        Brian Goff <cpuguy83@...il.com>, tom.hromatka@...cle.com
Subject: Re: [net-next v2 1/2] bpf, seccomp: Add eBPF filter capabilities

On Mon, Feb 19, 2018 at 4:00 PM, Daniel Borkmann <daniel@...earbox.net> wrote:
> On 02/19/2018 05:22 PM, Sargun Dhillon wrote:
>> This introduces the BPF_PROG_TYPE_SECCOMP bpf program type. It is meant
>> to be used for seccomp filters as an alternative to cBPF filters. The
>> program type has relatively limited capabilities in terms of helpers,
>> but that can be extended later on.
>>
>> The eBPF code loading is separated from attachment of the filter, so
>> a privileged user can load the filter, and pass it back to an
>> unprivileged user who can attach it and use it at a later time.
>>
>> In order to attach the filter itself, you need to supply a flag to the
>> seccomp syscall indicating that a eBPF filter is being attached, as
>> opposed to a cBPF one. Verification occurs at program load time,
>> so the user should only receive errors related to attachment.
>>
>> Signed-off-by: Sargun Dhillon <sargun@...gun.me>
> [...]
>> @@ -867,7 +924,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
>>
>>       spin_lock_irq(&current->sighand->siglock);
>>
>> -     if (!seccomp_may_assign_mode(seccomp_mode))
>> +     if (!seccomp_may_assign_mode(filter_mode))
>>               goto out;
>>
>>       ret = seccomp_attach_filter(flags, prepared);
>> @@ -876,7 +933,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
>>       /* Do not free the successfully attached filter. */
>>       prepared = NULL;
>>
>> -     seccomp_assign_mode(current, seccomp_mode);
>> +     seccomp_assign_mode(current, filter_mode);
>>  out:
>>       spin_unlock_irq(&current->sighand->siglock);
>>       if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>> @@ -1040,8 +1097,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
>>       if (IS_ERR(filter))
>>               return PTR_ERR(filter);
>>
>> -     fprog = filter->prog->orig_prog;
>> -     if (!fprog) {
>> +     if (!bpf_prog_was_classic(filter->prog)) {
>
> This is actually a bug, see f8e529ed941b ("seccomp, ptrace: add support for
> dumping seccomp filters") and would cause a NULL ptr deref in case the filter
> was created with bpf_prog_create_from_user() with save_orig as false, so the
> if (!fprog) test for cBPF cannot be removed from here.
>
>>               /* This must be a new non-cBPF filter, since we save
>>                * every cBPF filter's orig_prog above when
>>                * CONFIG_CHECKPOINT_RESTORE is enabled.
>> @@ -1050,6 +1106,7 @@ long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,
>>               goto out;
>>       }
>>
>> +     fprog = filter->prog->orig_prog;
>>       ret = fprog->len;
>
> (See above.)
>
>>       if (!data)
>>               goto out;
>> @@ -1239,6 +1296,58 @@ static int seccomp_actions_logged_handler(struct ctl_table *ro_table, int write,
>>       return 0;
>>  }
>>
>> +#ifdef CONFIG_SECCOMP_FILTER_EXTENDED
>> +static bool seccomp_is_valid_access(int off, int size,
>> +                                 enum bpf_access_type type,
>> +                                 struct bpf_insn_access_aux *info)
>> +{
>> +     if (type != BPF_READ)
>> +             return false;
>> +
>> +     if (off < 0 || off + size > sizeof(struct seccomp_data))
>> +             return false;
>
>         if (off % size != 0)
>                 return false;
>
Won't this break access to the instruction pointer, and args if
sizeof(int) != 4? Don't know any if any architectures fall under that.

>> +     switch (off) {
>> +     case bpf_ctx_range_till(struct seccomp_data, args[0], args[5]):
>> +             return (size == sizeof(__u64));
>> +     case bpf_ctx_range(struct seccomp_data, nr):
>> +             return (size == FIELD_SIZEOF(struct seccomp_data, nr));
>> +     case bpf_ctx_range(struct seccomp_data, arch):
>> +             return (size == FIELD_SIZEOF(struct seccomp_data, arch));
>> +     case bpf_ctx_range(struct seccomp_data, instruction_pointer):
>> +             return (size == FIELD_SIZEOF(struct seccomp_data,
>> +                                          instruction_pointer));
>
>         default:
>                 return false;
>
> [...]
>> +static const struct bpf_func_proto *
>> +seccomp_func_proto(enum bpf_func_id func_id)
>> +{
>> +     switch (func_id) {
>> +     case BPF_FUNC_get_current_uid_gid:
>> +             return &bpf_get_current_uid_gid_proto;
>> +     case BPF_FUNC_ktime_get_ns:
>> +             return &bpf_ktime_get_ns_proto;
>> +     case BPF_FUNC_get_prandom_u32:
>> +             return &bpf_get_prandom_u32_proto;
>> +     case BPF_FUNC_get_current_pid_tgid:
>> +             return &bpf_get_current_pid_tgid_proto;
>
> Do you have a use-case description for the above helpers? Is the prandom/ktime
> one for simulating errors coming from the syscall? And the other two for
> orchestration purposes?
>
My specific use case with uid_guid and pid is for containers, I have a
use case where I can put systemd, or a privileged init system into a
container, pid 1, running as uid 0 will get access to whetever is
needed in order to wire up an init system. If the user forks, or
setuid / setgid to another level, the access is lost, and they become
unprivileged. Depending on the container, different levels of access
are needed by the init, so seccomp-ebpf is a bit better here as
compared to say apparmor.

prandom is for testing.
ktime is for testing and to limit access after some time period
occurs. Example: In the first 30 seconds of the container's life time,
it has privileges to wire up a file system, but this is then shut
down. It's good for 3rd party software, until we have a map mechanism
where you can hook a probe in to see once the program has initialized,
and then you can revoke access to these things.

> One use case this work could enable would be to implement state machines in BPF
> for BPF-seccomp and enabling a more fine-grained / tiny subset of syscalls based
> on the state the prog is in while the rest is all blocked out - as opposed to a
> global white/black-list of syscalls the app can do in general. Getting to such
> an app model would probably be rather challenging at least for complex apps. We'd
> need some sort of scratch buffer for keeping the state for this though, e.g. either
> map with single slot or per thread scratch space. Anyway, just a thought.
>
Yeah, are you thinking a per task space? I'm simply thinking "after
you notice init is completed for PID x, go ahead and revoke access" --
and either stash this in an LRU map, or something more clever.

>> +     default:
>> +             return NULL;
>> +     }
>> +}
>> +
>> +const struct bpf_prog_ops seccomp_prog_ops = {
>> +};
>> +
>> +const struct bpf_verifier_ops seccomp_verifier_ops = {
>> +     .get_func_proto         = seccomp_func_proto,
>> +     .is_valid_access        = seccomp_is_valid_access,
>> +};
>> +#endif /* CONFIG_SECCOMP_FILTER_EXTENDED */
>> +
>>  static struct ctl_path seccomp_sysctl_path[] = {
>>       { .procname = "kernel", },
>>       { .procname = "seccomp", },
>>
>