[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <531C5EB5.2080507@iogearbox.net>
Date: Sun, 09 Mar 2014 13:29:41 +0100
From: Daniel Borkmann <borkmann@...earbox.net>
To: Alexei Starovoitov <ast@...mgrid.com>
CC: "David S. Miller" <davem@...emloft.net>,
Daniel Borkmann <dborkman@...hat.com>,
Ingo Molnar <mingo@...nel.org>, Will Drewry <wad@...omium.org>,
Steven Rostedt <rostedt@...dmis.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
"H. Peter Anvin" <hpa@...or.com>,
Hagen Paul Pfeifer <hagen@...u.net>,
Jesse Gross <jesse@...ira.com>,
Thomas Gleixner <tglx@...utronix.de>,
Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>,
Tom Zanussi <tom.zanussi@...ux.intel.com>,
Jovi Zhangwei <jovi.zhangwei@...il.com>,
Eric Dumazet <edumazet@...gle.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Frederic Weisbecker <fweisbec@...il.com>,
Arnaldo Carvalho de Melo <acme@...radead.org>,
Pekka Enberg <penberg@....fi>,
Arjan van de Ven <arjan@...radead.org>,
Christoph Hellwig <hch@...radead.org>,
linux-kernel@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH v7 net-next 1/3] filter: add Extended BPF interpreter
and converter
On 03/09/2014 12:15 AM, Alexei Starovoitov wrote:
> Extended BPF extends old BPF in the following ways:
> - from 2 to 10 registers
> Original BPF has two registers (A and X) and hidden frame pointer.
> Extended BPF has ten registers and read-only frame pointer.
> - from 32-bit registers to 64-bit registers
> semantics of old 32-bit ALU operations are preserved via 32-bit
> subregisters
> - if (cond) jump_true; else jump_false;
> old BPF insns are replaced with:
> if (cond) jump_true; /* else fallthrough */
> - adds signed > and >= insns
> - 16 4-byte stack slots for register spill-fill replaced with
> up to 512 bytes of multi-use stack space
> - introduces bpf_call insn and register passing convention for zero
> overhead calls from/to other kernel functions (not part of this patch)
> - adds arithmetic right shift insn
> - adds swab32/swab64 insns
> - adds atomic_add insn
> - old tax/txa insns are replaced with 'mov dst,src' insn
>
> Extended BPF is designed to be JITed with one to one mapping, which
> allows GCC/LLVM backends to generate optimized BPF code that performs
> almost as fast as natively compiled code
>
> sk_convert_filter() remaps old style insns into extended:
> 'sock_filter' instructions are remapped on the fly to
> 'sock_filter_ext' extended instructions when
> sysctl net.core.bpf_ext_enable=1
>
> Old filter comes through sk_attach_filter() or sk_unattached_filter_create()
> if (bpf_ext_enable) {
> convert to new
> sk_chk_filter() - check old bpf
> use sk_run_filter_ext() - new interpreter
> } else {
> sk_chk_filter() - check old bpf
> if (bpf_jit_enable)
> use old jit
> else
> use sk_run_filter() - old interpreter
> }
>
> sk_run_filter_ext() interpreter is noticeably faster
> than sk_run_filter() for two reasons:
>
> 1.fall-through jumps
> Old BPF jump instructions are forced to go either 'true' or 'false'
> branch which causes branch-miss penalty.
> Extended BPF jump instructions have one branch and fall-through,
> which fit CPU branch predictor logic better.
> 'perf stat' shows drastic difference for branch-misses.
>
> 2.jump-threaded implementation of interpreter vs switch statement
> Instead of single tablejump at the top of 'switch' statement, GCC will
> generate multiple tablejump instructions, which helps CPU branch predictor
>
> Performance of two BPF filters generated by libpcap was measured
> on x86_64, i386 and arm32.
>
> fprog #1 is taken from Documentation/networking/filter.txt:
> tcpdump -i eth0 port 22 -dd
>
> fprog #2 is taken from 'man tcpdump':
> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
> ((tcp[12]&0xf0)>>2)) != 0)' -dd
>
> Other libpcap programs have similar performance differences.
>
> Raw performance data from BPF micro-benchmark:
> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
> time in nsec per call, smaller is better
> --x86_64--
> fprog #1 fprog #1 fprog #2 fprog #2
> cache-hit cache-miss cache-hit cache-miss
> old BPF 90 101 192 202
> ext BPF 31 71 47 97
> old BPF jit 12 34 17 44
> ext BPF jit TBD
>
> --i386--
> fprog #1 fprog #1 fprog #2 fprog #2
> cache-hit cache-miss cache-hit cache-miss
> old BPF 107 136 227 252
> ext BPF 40 119 69 172
>
> --arm32--
> fprog #1 fprog #1 fprog #2 fprog #2
> cache-hit cache-miss cache-hit cache-miss
> old BPF 202 300 475 540
> ext BPF 180 270 330 470
> old BPF jit 26 182 37 202
> new BPF jit TBD
>
> Tested with trinify BPF fuzzer
>
> Future work:
>
> 0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf
>
> 1. add extended BPF JIT for x86_64
>
> 2. add inband old/new demux and extended BPF verifier, so that new programs
> can be loaded through old sk_attach_filter() and sk_unattached_filter_create()
> interfaces
>
> 3. tracing filters systemtap-like with extended BPF
>
> 4. OVS with extended BPF
>
> 5. nftables with extended BPF
>
> Signed-off-by: Alexei Starovoitov <ast@...mgrid.com>
> Acked-by: Hagen Paul Pfeifer <hagen@...u.net>
> Reviewed-by: Daniel Borkmann <dborkman@...hat.com>
One more question or possible issue that came through my mind: When
someone attaches a socket filter from user space, and bpf_ext_enable=1
then the old filter will transparently be converted to the new
representation. If then user space (e.g. through checkpoint restore)
will issue a sk_get_filter() and thus we're calling sk_decode_filter()
on sk->sk_filter and, therefore, try to decode what we stored in
insns_ext[] with the assumption we still have the old code. Would that
actually crash (or leak memory, or just return garbage), as we access
decodes[] array with filt->code? Would be great if you could double-check.
The assumption with sk_get_filter() is that it returns the same filter
that was previously attached, so that it can be re-attached again at
a later point in time.
Cheers,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists