linux-kernel - Re: [RFC PATCH tip 0/5] tracing filters with BPF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131203091655.GB20179@gmail.com>
Date:	Tue, 3 Dec 2013 10:16:55 +0100
From:	Ingo Molnar <mingo@...nel.org>
To:	Alexei Starovoitov <ast@...mgrid.com>
Cc:	Steven Rostedt <rostedt@...dmis.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	"H. Peter Anvin" <hpa@...or.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>,
	Tom Zanussi <tom.zanussi@...ux.intel.com>,
	Jovi Zhangwei <jovi.zhangwei@...il.com>,
	Eric Dumazet <edumazet@...gle.com>,
	linux-kernel@...r.kernel.org,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Frédéric Weisbecker <fweisbec@...il.com>,
	Arnaldo Carvalho de Melo <acme@...radead.org>,
	Tom Zanussi <tzanussi@...il.com>,
	Pekka Enberg <penberg@....fi>,
	"David S. Miller" <davem@...emloft.net>,
	Arjan van de Ven <arjan@...radead.org>,
	Christoph Hellwig <hch@...radead.org>
Subject: Re: [RFC PATCH tip 0/5] tracing filters with BPF


* Alexei Starovoitov <ast@...mgrid.com> wrote:

> Hi All,
> 
> the following set of patches adds BPF support to trace filters.
> 
> Trace filters can be written in C and allow safe read-only access to 
> any kernel data structure. Like systemtap but with safety guaranteed 
> by kernel.

Very cool! (Added various other folks who might be interested in this 
to the Cc: list.)

I have one generic concern:

It would be important to make it easy to extract loaded BPF code from 
the kernel in source code equivalent form, which compiles to the same 
BPF code.

I.e. I think it would be fundamentally important to make sure that 
this is all within the kernel's license domain, to make it very clear 
there can be no 'binary only' BPF scripts.

By up-loading BPF into a kernel the person loading it agrees to make 
that code available to all users of that system who can access it, 
under the same license as the kernel's code (or under a more 
permissive license).

The last thing we want is people getting funny ideas and writing 
drivers in BPF and hiding the code or making license claims over it 
...

I.e. we want to allow flexible plugins technologically, but make sure 
people who run into such a plugin can modify and improve it under the 
same license as they can modify and improve the kernel itself!

[ People can still 'hide' their sekrit plugins if they want to, by not
  distributing them to anyone who'd redistribute it widely. ]

> The user can do:
> cat bpf_program > /sys/kernel/debug/tracing/.../filter
> if tracing event is either static or dynamic via kprobe_events.
> 
> The filter program may look like:
> void filter(struct bpf_context *ctx)
> {
>         char devname[4] = "eth5";
>         struct net_device *dev;
>         struct sk_buff *skb = 0;
> 
>         dev = (struct net_device *)ctx->regs.si;
>         if (bpf_memcmp(dev->name, devname, 4) == 0) {
>                 char fmt[] = "skb %p dev %p eth5\n";
>                 bpf_trace_printk(fmt, skb, dev, 0, 0);
>         }
> }
> 
> The kernel will do static analysis of bpf program to make sure that 
> it cannot crash the kernel (doesn't have loops, valid 
> memory/register accesses, etc). Then kernel will map bpf 
> instructions to x86 instructions and let it run in the place of 
> trace filter.
> 
> To demonstrate performance I did a synthetic test:
>         dev = init_net.loopback_dev;
>         do_gettimeofday(&start_tv);
>         for (i = 0; i < 1000000; i++) {
>                 struct sk_buff *skb;
>                 skb = netdev_alloc_skb(dev, 128);
>                 kfree_skb(skb);
>         }
>         do_gettimeofday(&end_tv);
>         time = end_tv.tv_sec - start_tv.tv_sec;
>         time *= USEC_PER_SEC;
>         time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec);
> 
>         printk("1M skb alloc/free %lld (usecs)\n", time);
> 
> no tracing
> [   33.450966] 1M skb alloc/free 145179 (usecs)
> 
> echo 1 > enable
> [   97.186379] 1M skb alloc/free 240419 (usecs)
> (tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit)
> 
> echo 'name==eth5' > filter
> [  139.644161] 1M skb alloc/free 302552 (usecs)
> (running filter_match_preds() for every skb and discarding
> event_buffer is even slower)
> 
> cat bpf_prog > filter
> [  171.150566] 1M skb alloc/free 199463 (usecs)
> (JITed bpf program is safely checking dev->name == eth5 and discarding)

So, to do the math:

   tracing               'all' overhead:   95 nsecs per event
   tracing 'eth5 + old filter' overhead:  157 nsecs per event
   tracing 'eth5 + BPF filter' overhead:   54 nsecs per event

So via BPF and a fairly trivial filter, we are able to reduce tracing 
overhead for real - while old-style filters.

In addition to that we now also have arbitrary BPF scripts, full C 
programs (or written in any other language from which BPF bytecode can 
be generated) enabled.

Seems like a massive win-win scenario to me ;-)

> echo 0 > enable
> [  258.073593] 1M skb alloc/free 144919 (usecs)
> (tracing is disabled, performance is back to original)
> 
> The C program compiled into BPF and then JITed into x86 is faster 
> than filter_match_preds() approach (199-145 msec vs 302-145 msec)
> 
> tracing+bpf is a tool for safe read-only access to variables without 
> recompiling the kernel and without affecting running programs.
> 
> BPF filters can be written manually (see 
> tools/bpf/trace/filter_ex1.c) or better compiled from restricted C 
> via GCC or LLVM

> Q: What is the difference between existing BPF and extended BPF?
> A:
> Existing BPF insn from uapi/linux/filter.h
> struct sock_filter {
>         __u16   code;   /* Actual filter code */
>         __u8    jt;     /* Jump true */
>         __u8    jf;     /* Jump false */
>         __u32   k;      /* Generic multiuse field */
> };
> 
> Extended BPF insn from linux/bpf.h
> struct bpf_insn {
>         __u8    code;    /* opcode */
>         __u8    a_reg:4; /* dest register*/
>         __u8    x_reg:4; /* source register */
>         __s16   off;     /* signed offset */
>         __s32   imm;     /* signed immediate constant */
> };
> 
> opcode encoding is the same between old BPF and extended BPF.
> Original BPF has two 32-bit registers.
> Extended BPF has ten 64-bit registers.
> That is the main difference.
> 
> Old BPF was using jt/jf fields for jump-insn only.
> New BPF combines them into generic 'off' field for jump and non-jump insns.
> k==imm field has the same meaning.

This only affects the internal JIT representation, not the BPF byte 
code, right?

>  32 files changed, 3332 insertions(+), 24 deletions(-)

Impressive!

I'm wondering, will the new nftable code in works make use of the BPF 
JIT as well, or is that a separate implementation?

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/