linux-kernel - Re: [PATCH v7 net-next 1/3] filter: add Extended BPF interpreter and converter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMEtUuyh4_je-vbtZ91bG4Qo5rKAHMuVd1==TJFxQS1oR3L4Mg@mail.gmail.com>
Date:	Tue, 11 Mar 2014 11:03:47 -0700
From:	Alexei Starovoitov <ast@...mgrid.com>
To:	Pavel Emelyanov <xemul@...allels.com>
Cc:	Daniel Borkmann <borkmann@...earbox.net>,
	"David S. Miller" <davem@...emloft.net>,
	Daniel Borkmann <dborkman@...hat.com>,
	Ingo Molnar <mingo@...nel.org>, Will Drewry <wad@...omium.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	"H. Peter Anvin" <hpa@...or.com>,
	Hagen Paul Pfeifer <hagen@...u.net>,
	Jesse Gross <jesse@...ira.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>,
	Tom Zanussi <tom.zanussi@...ux.intel.com>,
	Jovi Zhangwei <jovi.zhangwei@...il.com>,
	Eric Dumazet <edumazet@...gle.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Arnaldo Carvalho de Melo <acme@...radead.org>,
	Pekka Enberg <penberg@....fi>,
	Arjan van de Ven <arjan@...radead.org>,
	Christoph Hellwig <hch@...radead.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH v7 net-next 1/3] filter: add Extended BPF interpreter and converter

On Tue, Mar 11, 2014 at 10:40 AM, Pavel Emelyanov <xemul@...allels.com> wrote:
> On 03/10/2014 02:00 AM, Daniel Borkmann wrote:
>> On 03/09/2014 06:08 PM, Alexei Starovoitov wrote:
>>> On Sun, Mar 9, 2014 at 5:29 AM, Daniel Borkmann <borkmann@...earbox.net> wrote:
>>>> On 03/09/2014 12:15 AM, Alexei Starovoitov wrote:
>>>>>
>>>>> Extended BPF extends old BPF in the following ways:
>>>>> - from 2 to 10 registers
>>>>>     Original BPF has two registers (A and X) and hidden frame pointer.
>>>>>     Extended BPF has ten registers and read-only frame pointer.
>>>>> - from 32-bit registers to 64-bit registers
>>>>>     semantics of old 32-bit ALU operations are preserved via 32-bit
>>>>>     subregisters
>>>>> - if (cond) jump_true; else jump_false;
>>>>>     old BPF insns are replaced with:
>>>>>     if (cond) jump_true; /* else fallthrough */
>>>>> - adds signed > and >= insns
>>>>> - 16 4-byte stack slots for register spill-fill replaced with
>>>>>     up to 512 bytes of multi-use stack space
>>>>> - introduces bpf_call insn and register passing convention for zero
>>>>>     overhead calls from/to other kernel functions (not part of this patch)
>>>>> - adds arithmetic right shift insn
>>>>> - adds swab32/swab64 insns
>>>>> - adds atomic_add insn
>>>>> - old tax/txa insns are replaced with 'mov dst,src' insn
>>>>>
>>>>> Extended BPF is designed to be JITed with one to one mapping, which
>>>>> allows GCC/LLVM backends to generate optimized BPF code that performs
>>>>> almost as fast as natively compiled code
>>>>>
>>>>> sk_convert_filter() remaps old style insns into extended:
>>>>> 'sock_filter' instructions are remapped on the fly to
>>>>> 'sock_filter_ext' extended instructions when
>>>>> sysctl net.core.bpf_ext_enable=1
>>>>>
>>>>> Old filter comes through sk_attach_filter() or
>>>>> sk_unattached_filter_create()
>>>>>    if (bpf_ext_enable) {
>>>>>       convert to new
>>>>>       sk_chk_filter() - check old bpf
>>>>>       use sk_run_filter_ext() - new interpreter
>>>>>    } else {
>>>>>       sk_chk_filter() - check old bpf
>>>>>       if (bpf_jit_enable)
>>>>>           use old jit
>>>>>       else
>>>>>           use sk_run_filter() - old interpreter
>>>>>    }
>>>>>
>>>>> sk_run_filter_ext() interpreter is noticeably faster
>>>>> than sk_run_filter() for two reasons:
>>>>>
>>>>> 1.fall-through jumps
>>>>>     Old BPF jump instructions are forced to go either 'true' or 'false'
>>>>>     branch which causes branch-miss penalty.
>>>>>     Extended BPF jump instructions have one branch and fall-through,
>>>>>     which fit CPU branch predictor logic better.
>>>>>     'perf stat' shows drastic difference for branch-misses.
>>>>>
>>>>> 2.jump-threaded implementation of interpreter vs switch statement
>>>>>     Instead of single tablejump at the top of 'switch' statement, GCC will
>>>>>     generate multiple tablejump instructions, which helps CPU branch
>>>>> predictor
>>>>>
>>>>> Performance of two BPF filters generated by libpcap was measured
>>>>> on x86_64, i386 and arm32.
>>>>>
>>>>> fprog #1 is taken from Documentation/networking/filter.txt:
>>>>> tcpdump -i eth0 port 22 -dd
>>>>>
>>>>> fprog #2 is taken from 'man tcpdump':
>>>>> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
>>>>>      ((tcp[12]&0xf0)>>2)) != 0)' -dd
>>>>>
>>>>> Other libpcap programs have similar performance differences.
>>>>>
>>>>> Raw performance data from BPF micro-benchmark:
>>>>> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
>>>>> time in nsec per call, smaller is better
>>>>> --x86_64--
>>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>>            cache-hit cache-miss cache-hit cache-miss
>>>>> old BPF     90       101       192       202
>>>>> ext BPF     31        71       47         97
>>>>> old BPF jit 12        34       17         44
>>>>> ext BPF jit TBD
>>>>>
>>>>> --i386--
>>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>>            cache-hit cache-miss cache-hit cache-miss
>>>>> old BPF    107        136      227       252
>>>>> ext BPF     40        119       69       172
>>>>>
>>>>> --arm32--
>>>>>            fprog #1  fprog #1   fprog #2  fprog #2
>>>>>            cache-hit cache-miss cache-hit cache-miss
>>>>> old BPF    202        300      475       540
>>>>> ext BPF    180        270      330       470
>>>>> old BPF jit 26        182       37       202
>>>>> new BPF jit TBD
>>>>>
>>>>> Tested with trinify BPF fuzzer
>>>>>
>>>>> Future work:
>>>>>
>>>>> 0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf
>>>>>
>>>>> 1. add extended BPF JIT for x86_64
>>>>>
>>>>> 2. add inband old/new demux and extended BPF verifier, so that new
>>>>> programs
>>>>>      can be loaded through old sk_attach_filter() and
>>>>> sk_unattached_filter_create()
>>>>>      interfaces
>>>>>
>>>>> 3. tracing filters systemtap-like with extended BPF
>>>>>
>>>>> 4. OVS with extended BPF
>>>>>
>>>>> 5. nftables with extended BPF
>>>>>
>>>>> Signed-off-by: Alexei Starovoitov <ast@...mgrid.com>
>>>>> Acked-by: Hagen Paul Pfeifer <hagen@...u.net>
>>>>> Reviewed-by: Daniel Borkmann <dborkman@...hat.com>
>>>>
>>>>
>>>> One more question or possible issue that came through my mind: When
>>>> someone attaches a socket filter from user space, and bpf_ext_enable=1
>>>> then the old filter will transparently be converted to the new
>>>> representation. If then user space (e.g. through checkpoint restore)
>>>> will issue a sk_get_filter() and thus we're calling sk_decode_filter()
>>>> on sk->sk_filter and, therefore, try to decode what we stored in
>>>> insns_ext[] with the assumption we still have the old code. Would that
>>>> actually crash (or leak memory, or just return garbage), as we access
>>>> decodes[] array with filt->code? Would be great if you could double-check.
>>>
>>> ohh. yes. missed that.
>>> when bpf_ext_enable=1 I think it's cleaner to return ebpf filter.
>>> This way the user space can see how old bpf filter was converted.
>>>
>>> Of course we can allocate extra memory and keep original bpf code there
>>> just to return it via sk_get_filter(), but that seems overkill.
>>
>> Cc'ing Pavel for a8fc92778080 ("sk-filter: Add ability to get socket
>> filter program (v2)").
>>
>> I think the issue can be that when applications could get migrated
>> from one machine to another and their kernel won't support ebpf yet,
>> then filter could not get loaded this way as it's expected to return
>> what the user loaded. The trade-off, however, is that the original
>> BPF code needs to be stored as well. :(
>
> Sorry if I miss the point, but isn't the original filter kept on socket?
> The sk_attach_filter() does so, then calls __sk_prepare_filter, which
> in turn calls bpf_jit_compile(), and the latter two keep the insns in place.

Yes. in V8/V9 series original filter is kept on socket.
and your crtools/test/zdtm/live/static/socket_filter.c test passes.
Let me know if there are any other tests I can try.

Thanks
Alexei

>
> Thanks,
> Pavel
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/