linux-kernel - Re: [PATCH RFC net-next 00/14] BPF syscall, maps, verifier, samples

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGXu5jK8_zPPpKgCzad67QZkggtr-ZTdPc_NyMBwkVkOTgtchg@mail.gmail.com>
Date:	Wed, 2 Jul 2014 09:39:04 -0700
From:	Kees Cook <keescook@...omium.org>
To:	Daniel Borkmann <dborkman@...hat.com>
Cc:	Alexei Starovoitov <ast@...mgrid.com>,
	"David S. Miller" <davem@...emloft.net>,
	Ingo Molnar <mingo@...nel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Chema Gonzalez <chema@...gle.com>,
	Eric Dumazet <edumazet@...gle.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Arnaldo Carvalho de Melo <acme@...radead.org>,
	Jiri Olsa <jolsa@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linux API <linux-api@...r.kernel.org>,
	Network Development <netdev@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH RFC net-next 00/14] BPF syscall, maps, verifier, samples

On Tue, Jul 1, 2014 at 12:18 AM, Daniel Borkmann <dborkman@...hat.com> wrote:
> On 07/01/2014 01:09 AM, Kees Cook wrote:
>>
>> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <ast@...mgrid.com>
>> wrote:
>>>
>>> Hi All,
>>>
>>> this patch set demonstrates the potential of eBPF.
>>>
>>> First patch "net: filter: split filter.c into two files" splits eBPF
>>> interpreter
>>> out of networking into kernel/bpf/. The goal for BPF subsystem is to be
>>> usable
>>> in NET-less configuration. Though the whole set is marked is RFC, the 1st
>>> patch
>>> is good to go. Similar version of the patch that was posted few weeks
>>> ago, but
>>> was deferred. I'm assuming due to lack of forward visibility. I hope that
>>> this
>>> patch set shows what eBPF is capable of and where it's heading.
>>>
>>> Other patches expose eBPF instruction set to user space and introduce
>>> concepts
>>> of maps and programs accessible via syscall.
>>>
>>> 'maps' is a generic storage of different types for sharing data between
>>> kernel
>>> and userspace. Maps are referrenced by global id. Root can create
>>> multiple
>>> maps of different types where key/value are opaque bytes of data. It's up
>>> to
>>> user space and eBPF program to decide what they store in the maps.
>>>
>>> eBPF programs are similar to kernel modules. They live in global space
>>> and
>>> have unique prog_id. Each program is a safe run-to-completion set of
>>> instructions. eBPF verifier statically determines that the program
>>> terminates
>>> and safe to execute. During verification the program takes a hold of maps
>>> that it intends to use, so selected maps cannot be removed until program
>>> is
>>> unloaded. The program can be attached to different events. These events
>>> can
>>> be packets, tracepoint events and other types in the future. New event
>>> triggers
>>> execution of the program which may store information about the event in
>>> the maps.
>>> Beyond storing data the programs may call into in-kernel helper functions
>>> which may, for example, dump stack, do trace_printk or other forms of
>>> live
>>> kernel debugging. Same program can be attached to multiple events.
>>> Different
>>> programs can access the same map:
>>>
>>>    tracepoint  tracepoint  tracepoint    sk_buff    sk_buff
>>>     event A     event B     event C      on eth0    on eth1
>>>      |             |          |            |          |
>>>      |             |          |            |          |
>>>      --> tracing <--      tracing       socket      socket
>>>           prog_1           prog_2       prog_3      prog_4
>>>           |  |               |            |
>>>        |---  -----|  |-------|           map_3
>>>      map_1       map_2
>>>
>>> User space (via syscall) and eBPF programs access maps concurrently.
>>>
>>> Last two patches are sample code. 1st demonstrates stateful packet
>>> inspection.
>>> It counts tcp and udp packets on eth0. Should be easy to see how this
>>> eBPF
>>> framework can be used for network analytics.
>>> 2nd sample does simple 'drop monitor'. It attaches to kfree_skb
>>> tracepoint
>>> event and counts number of packet drops at particular $pc location.
>>> User space periodically summarizes what eBPF programs recorded.
>>> In these two samples the eBPF programs are tiny and written in
>>> 'assembler'
>>> with macroses. More complex programs can be written C (llvm backend is
>>> not
>>> part of this diff to reduce 'huge' perception).
>>> Since eBPF is fully JITed on x64, the cost of running eBPF program is
>>> very
>>> small even for high frequency events. Here are the numbers comparing
>>> flow_dissector in C vs eBPF:
>>>    x86_64 skb_flow_dissect() same skb (all cached)         -  42 nsec per
>>> call
>>>    x86_64 skb_flow_dissect() different skbs (cache misses) - 141 nsec per
>>> call
>>> eBPF+jit skb_flow_dissect() same skb (all cached)         -  51 nsec per
>>> call
>>> eBPF+jit skb_flow_dissect() different skbs (cache misses) - 135 nsec per
>>> call
>>>
>>> Detailed explanation on eBPF verifier and safety is in patch 08/14
>>
>>
>> This is very exciting! Thanks for working on it. :)
>>
>> Between the new eBPF syscall and the new seccomp syscall, I'm really
>> looking forward to using lookup tables for seccomp filters. Under
>> certain types of filters, we'll likely see some non-trivial
>> performance improvements.
>
> Well, if I read this correctly, the eBPF syscall lets you set up maps, etc,
> but the only way to attach eBPF is via setsockopt for network filters right
> now (and via tracing). Seccomp will still make use of classic BPF, so you
> won't be able to use it there.

Currently, yes. But once this is in, and the new seccomp syscall is
in, we can add a SECCOMP_FILTER_EBPF flag to the "flags" field to
instruct seccomp to load an eBPF instead of a classic BPF. I'm excited
for the future. :)

-Kees

-- 
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/