lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Sun, 09 Mar 2014 13:29:41 +0100 From: Daniel Borkmann <borkmann@...earbox.net> To: Alexei Starovoitov <ast@...mgrid.com> CC: "David S. Miller" <davem@...emloft.net>, Daniel Borkmann <dborkman@...hat.com>, Ingo Molnar <mingo@...nel.org>, Will Drewry <wad@...omium.org>, Steven Rostedt <rostedt@...dmis.org>, Peter Zijlstra <a.p.zijlstra@...llo.nl>, "H. Peter Anvin" <hpa@...or.com>, Hagen Paul Pfeifer <hagen@...u.net>, Jesse Gross <jesse@...ira.com>, Thomas Gleixner <tglx@...utronix.de>, Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>, Tom Zanussi <tom.zanussi@...ux.intel.com>, Jovi Zhangwei <jovi.zhangwei@...il.com>, Eric Dumazet <edumazet@...gle.com>, Linus Torvalds <torvalds@...ux-foundation.org>, Andrew Morton <akpm@...ux-foundation.org>, Frederic Weisbecker <fweisbec@...il.com>, Arnaldo Carvalho de Melo <acme@...radead.org>, Pekka Enberg <penberg@....fi>, Arjan van de Ven <arjan@...radead.org>, Christoph Hellwig <hch@...radead.org>, linux-kernel@...r.kernel.org, netdev@...r.kernel.org Subject: Re: [PATCH v7 net-next 1/3] filter: add Extended BPF interpreter and converter On 03/09/2014 12:15 AM, Alexei Starovoitov wrote: > Extended BPF extends old BPF in the following ways: > - from 2 to 10 registers > Original BPF has two registers (A and X) and hidden frame pointer. > Extended BPF has ten registers and read-only frame pointer. > - from 32-bit registers to 64-bit registers > semantics of old 32-bit ALU operations are preserved via 32-bit > subregisters > - if (cond) jump_true; else jump_false; > old BPF insns are replaced with: > if (cond) jump_true; /* else fallthrough */ > - adds signed > and >= insns > - 16 4-byte stack slots for register spill-fill replaced with > up to 512 bytes of multi-use stack space > - introduces bpf_call insn and register passing convention for zero > overhead calls from/to other kernel functions (not part of this patch) > - adds arithmetic right shift insn > - adds swab32/swab64 insns > - adds atomic_add insn > - old tax/txa insns are replaced with 'mov dst,src' insn > > Extended BPF is designed to be JITed with one to one mapping, which > allows GCC/LLVM backends to generate optimized BPF code that performs > almost as fast as natively compiled code > > sk_convert_filter() remaps old style insns into extended: > 'sock_filter' instructions are remapped on the fly to > 'sock_filter_ext' extended instructions when > sysctl net.core.bpf_ext_enable=1 > > Old filter comes through sk_attach_filter() or sk_unattached_filter_create() > if (bpf_ext_enable) { > convert to new > sk_chk_filter() - check old bpf > use sk_run_filter_ext() - new interpreter > } else { > sk_chk_filter() - check old bpf > if (bpf_jit_enable) > use old jit > else > use sk_run_filter() - old interpreter > } > > sk_run_filter_ext() interpreter is noticeably faster > than sk_run_filter() for two reasons: > > 1.fall-through jumps > Old BPF jump instructions are forced to go either 'true' or 'false' > branch which causes branch-miss penalty. > Extended BPF jump instructions have one branch and fall-through, > which fit CPU branch predictor logic better. > 'perf stat' shows drastic difference for branch-misses. > > 2.jump-threaded implementation of interpreter vs switch statement > Instead of single tablejump at the top of 'switch' statement, GCC will > generate multiple tablejump instructions, which helps CPU branch predictor > > Performance of two BPF filters generated by libpcap was measured > on x86_64, i386 and arm32. > > fprog #1 is taken from Documentation/networking/filter.txt: > tcpdump -i eth0 port 22 -dd > > fprog #2 is taken from 'man tcpdump': > tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - > ((tcp[12]&0xf0)>>2)) != 0)' -dd > > Other libpcap programs have similar performance differences. > > Raw performance data from BPF micro-benchmark: > SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss) > time in nsec per call, smaller is better > --x86_64-- > fprog #1 fprog #1 fprog #2 fprog #2 > cache-hit cache-miss cache-hit cache-miss > old BPF 90 101 192 202 > ext BPF 31 71 47 97 > old BPF jit 12 34 17 44 > ext BPF jit TBD > > --i386-- > fprog #1 fprog #1 fprog #2 fprog #2 > cache-hit cache-miss cache-hit cache-miss > old BPF 107 136 227 252 > ext BPF 40 119 69 172 > > --arm32-- > fprog #1 fprog #1 fprog #2 fprog #2 > cache-hit cache-miss cache-hit cache-miss > old BPF 202 300 475 540 > ext BPF 180 270 330 470 > old BPF jit 26 182 37 202 > new BPF jit TBD > > Tested with trinify BPF fuzzer > > Future work: > > 0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf > > 1. add extended BPF JIT for x86_64 > > 2. add inband old/new demux and extended BPF verifier, so that new programs > can be loaded through old sk_attach_filter() and sk_unattached_filter_create() > interfaces > > 3. tracing filters systemtap-like with extended BPF > > 4. OVS with extended BPF > > 5. nftables with extended BPF > > Signed-off-by: Alexei Starovoitov <ast@...mgrid.com> > Acked-by: Hagen Paul Pfeifer <hagen@...u.net> > Reviewed-by: Daniel Borkmann <dborkman@...hat.com> One more question or possible issue that came through my mind: When someone attaches a socket filter from user space, and bpf_ext_enable=1 then the old filter will transparently be converted to the new representation. If then user space (e.g. through checkpoint restore) will issue a sk_get_filter() and thus we're calling sk_decode_filter() on sk->sk_filter and, therefore, try to decode what we stored in insns_ext[] with the assumption we still have the old code. Would that actually crash (or leak memory, or just return garbage), as we access decodes[] array with filt->code? Would be great if you could double-check. The assumption with sk_get_filter() is that it returns the same filter that was previously attached, so that it can be re-attached again at a later point in time. Cheers, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists