linux-kernel - Re: [PATCH v7 net-next 1/3] filter: add Extended BPF interpreter and converter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <531CE47E.40700@iogearbox.net>
Date:	Sun, 09 Mar 2014 23:00:30 +0100
From:	Daniel Borkmann <borkmann@...earbox.net>
To:	Alexei Starovoitov <ast@...mgrid.com>
CC:	"David S. Miller" <davem@...emloft.net>,
	Daniel Borkmann <dborkman@...hat.com>,
	Ingo Molnar <mingo@...nel.org>, Will Drewry <wad@...omium.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	"H. Peter Anvin" <hpa@...or.com>,
	Hagen Paul Pfeifer <hagen@...u.net>,
	Jesse Gross <jesse@...ira.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>,
	Tom Zanussi <tom.zanussi@...ux.intel.com>,
	Jovi Zhangwei <jovi.zhangwei@...il.com>,
	Eric Dumazet <edumazet@...gle.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Arnaldo Carvalho de Melo <acme@...radead.org>,
	Pekka Enberg <penberg@....fi>,
	Arjan van de Ven <arjan@...radead.org>,
	Christoph Hellwig <hch@...radead.org>,
	LKML <linux-kernel@...r.kernel.org>, netdev@...r.kernel.org,
	Pavel Emelyanov <xemul@...allels.com>
Subject: Re: [PATCH v7 net-next 1/3] filter: add Extended BPF interpreter
 and converter

On 03/09/2014 06:08 PM, Alexei Starovoitov wrote:
> On Sun, Mar 9, 2014 at 5:29 AM, Daniel Borkmann <borkmann@...earbox.net> wrote:
>> On 03/09/2014 12:15 AM, Alexei Starovoitov wrote:
>>>
>>> Extended BPF extends old BPF in the following ways:
>>> - from 2 to 10 registers
>>>     Original BPF has two registers (A and X) and hidden frame pointer.
>>>     Extended BPF has ten registers and read-only frame pointer.
>>> - from 32-bit registers to 64-bit registers
>>>     semantics of old 32-bit ALU operations are preserved via 32-bit
>>>     subregisters
>>> - if (cond) jump_true; else jump_false;
>>>     old BPF insns are replaced with:
>>>     if (cond) jump_true; /* else fallthrough */
>>> - adds signed > and >= insns
>>> - 16 4-byte stack slots for register spill-fill replaced with
>>>     up to 512 bytes of multi-use stack space
>>> - introduces bpf_call insn and register passing convention for zero
>>>     overhead calls from/to other kernel functions (not part of this patch)
>>> - adds arithmetic right shift insn
>>> - adds swab32/swab64 insns
>>> - adds atomic_add insn
>>> - old tax/txa insns are replaced with 'mov dst,src' insn
>>>
>>> Extended BPF is designed to be JITed with one to one mapping, which
>>> allows GCC/LLVM backends to generate optimized BPF code that performs
>>> almost as fast as natively compiled code
>>>
>>> sk_convert_filter() remaps old style insns into extended:
>>> 'sock_filter' instructions are remapped on the fly to
>>> 'sock_filter_ext' extended instructions when
>>> sysctl net.core.bpf_ext_enable=1
>>>
>>> Old filter comes through sk_attach_filter() or
>>> sk_unattached_filter_create()
>>>    if (bpf_ext_enable) {
>>>       convert to new
>>>       sk_chk_filter() - check old bpf
>>>       use sk_run_filter_ext() - new interpreter
>>>    } else {
>>>       sk_chk_filter() - check old bpf
>>>       if (bpf_jit_enable)
>>>           use old jit
>>>       else
>>>           use sk_run_filter() - old interpreter
>>>    }
>>>
>>> sk_run_filter_ext() interpreter is noticeably faster
>>> than sk_run_filter() for two reasons:
>>>
>>> 1.fall-through jumps
>>>     Old BPF jump instructions are forced to go either 'true' or 'false'
>>>     branch which causes branch-miss penalty.
>>>     Extended BPF jump instructions have one branch and fall-through,
>>>     which fit CPU branch predictor logic better.
>>>     'perf stat' shows drastic difference for branch-misses.
>>>
>>> 2.jump-threaded implementation of interpreter vs switch statement
>>>     Instead of single tablejump at the top of 'switch' statement, GCC will
>>>     generate multiple tablejump instructions, which helps CPU branch
>>> predictor
>>>
>>> Performance of two BPF filters generated by libpcap was measured
>>> on x86_64, i386 and arm32.
>>>
>>> fprog #1 is taken from Documentation/networking/filter.txt:
>>> tcpdump -i eth0 port 22 -dd
>>>
>>> fprog #2 is taken from 'man tcpdump':
>>> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
>>>      ((tcp[12]&0xf0)>>2)) != 0)' -dd
>>>
>>> Other libpcap programs have similar performance differences.
>>>
>>> Raw performance data from BPF micro-benchmark:
>>> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
>>> time in nsec per call, smaller is better
>>> --x86_64--
>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>            cache-hit cache-miss cache-hit cache-miss
>>> old BPF     90       101       192       202
>>> ext BPF     31        71       47         97
>>> old BPF jit 12        34       17         44
>>> ext BPF jit TBD
>>>
>>> --i386--
>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>            cache-hit cache-miss cache-hit cache-miss
>>> old BPF    107        136      227       252
>>> ext BPF     40        119       69       172
>>>
>>> --arm32--
>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>            cache-hit cache-miss cache-hit cache-miss
>>> old BPF    202        300      475       540
>>> ext BPF    180        270      330       470
>>> old BPF jit 26        182       37       202
>>> new BPF jit TBD
>>>
>>> Tested with trinify BPF fuzzer
>>>
>>> Future work:
>>>
>>> 0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf
>>>
>>> 1. add extended BPF JIT for x86_64
>>>
>>> 2. add inband old/new demux and extended BPF verifier, so that new
>>> programs
>>>      can be loaded through old sk_attach_filter() and
>>> sk_unattached_filter_create()
>>>      interfaces
>>>
>>> 3. tracing filters systemtap-like with extended BPF
>>>
>>> 4. OVS with extended BPF
>>>
>>> 5. nftables with extended BPF
>>>
>>> Signed-off-by: Alexei Starovoitov <ast@...mgrid.com>
>>> Acked-by: Hagen Paul Pfeifer <hagen@...u.net>
>>> Reviewed-by: Daniel Borkmann <dborkman@...hat.com>
>>
>>
>> One more question or possible issue that came through my mind: When
>> someone attaches a socket filter from user space, and bpf_ext_enable=1
>> then the old filter will transparently be converted to the new
>> representation. If then user space (e.g. through checkpoint restore)
>> will issue a sk_get_filter() and thus we're calling sk_decode_filter()
>> on sk->sk_filter and, therefore, try to decode what we stored in
>> insns_ext[] with the assumption we still have the old code. Would that
>> actually crash (or leak memory, or just return garbage), as we access
>> decodes[] array with filt->code? Would be great if you could double-check.
>
> ohh. yes. missed that.
> when bpf_ext_enable=1 I think it's cleaner to return ebpf filter.
> This way the user space can see how old bpf filter was converted.
>
> Of course we can allocate extra memory and keep original bpf code there
> just to return it via sk_get_filter(), but that seems overkill.

Cc'ing Pavel for a8fc92778080 ("sk-filter: Add ability to get socket
filter program (v2)").

I think the issue can be that when applications could get migrated
from one machine to another and their kernel won't support ebpf yet,
then filter could not get loaded this way as it's expected to return
what the user loaded. The trade-off, however, is that the original
BPF code needs to be stored as well. :(

>> The assumption with sk_get_filter() is that it returns the same filter
>> that was previously attached, so that it can be re-attached again at
>> a later point in time.
>
> when bpf_ext_enable=1, load old, sk_get_filter() returns new ebpf,
> this ebpf will be re-attachable, since there will be inband demux for bpf/ebpf.
>
> Thanks
> Alexei
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/