[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMEtUuwV8HpmT7ejMD+atBmGuynX9J9X2uHedHtWacZgKK+iQw@mail.gmail.com>
Date: Fri, 28 Feb 2014 16:10:10 -0800
From: Alexei Starovoitov <ast@...mgrid.com>
To: Daniel Borkmann <dborkman@...hat.com>
Cc: "David S. Miller" <davem@...emloft.net>,
Ingo Molnar <mingo@...nel.org>,
Steven Rostedt <rostedt@...dmis.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
"H. Peter Anvin" <hpa@...or.com>,
Thomas Gleixner <tglx@...utronix.de>,
Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>,
Tom Zanussi <tom.zanussi@...ux.intel.com>,
Jovi Zhangwei <jovi.zhangwei@...il.com>,
Eric Dumazet <edumazet@...gle.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Frederic Weisbecker <fweisbec@...il.com>,
Arnaldo Carvalho de Melo <acme@...radead.org>,
Pekka Enberg <penberg@....fi>,
Arjan van de Ven <arjan@...radead.org>,
Christoph Hellwig <hch@...radead.org>,
linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
Hagen Paul Pfeifer <hagen@...u.net>,
Jesse Gross <jesse@...ira.com>
Subject: Re: [PATCH v3 net-next 1/1] bpf32->bpf64 mapper and bpf64 interpreter
On Fri, Feb 28, 2014 at 12:53 PM, Alexei Starovoitov <ast@...mgrid.com> wrote:
> On Fri, Feb 28, 2014 at 4:45 AM, Daniel Borkmann <dborkman@...hat.com> wrote:
>> Hi Alexei,
>>
>> [also cc'ing Hagen and Jesse]
>>
>> Just some minor comments below ... let me know what you think.
>
> Thank you for review! Comments below.
>
>> On 02/27/2014 03:38 AM, Alexei Starovoitov wrote:
>>>
>>> Extended BPF (or 64-bit BPF) is an instruction set to
>>> create safe dynamically loadable filters that can call fixed set
>>> of kernel functions and take generic bpf_context as an input.
>>> BPF filter is a glue between kernel functions and bpf_context.
>>> Different kernel subsystems can define their own set of available
>>> functions
>>> and alter BPF machinery for specific use case.
>>> BPF64 instruction set is designed for efficient mapping to native
>>> instructions on 64-bit CPUs
>>>
>>> Old BPF instructions are remapped on the fly to BPF64
>>> when sysctl net.core.bpf64_enable=1
>>
>>
>> Would be nice if the commit message could be extended with what you
>> have posted in your [PATCH v3 net-next 0/1] email (but without the
>> changelog, changelog should go after "---" line), so that also this
>> information will appear here and in the Git log later on. Also please
>> elaborate more on this commit message. I think, at least, as it's a
>> bigger change it also needs to include future planned usage scenarios
>> for user space and for inside the kernel e.g. for OVS and ftrace.
>
> Ok will do
>
>> You could make this patch a 2 patch patch-series and put into patch
>> 2/2 all documentation you had in your previous version of the set.
>> Would be nice to extend the file Documentation/networking/filter.txt
>> with a description of your extended BPF so that users can read about
>> them in the same file.
>
> Sure.
>
>> Did you also test that seccomp-BPF still works out?
>
> yes. Have a prototype, but it needs a bit more cleanup.
>
>>
>>> Signed-off-by: Alexei Starovoitov <ast@...mgrid.com>
>>> ---
>>> include/linux/filter.h | 9 +-
>>> include/linux/netdevice.h | 1 +
>>> include/uapi/linux/filter.h | 37 ++-
>>> net/core/Makefile | 2 +-
>>> net/core/bpf_run.c | 766
>>> +++++++++++++++++++++++++++++++++++++++++++
>>> net/core/filter.c | 114 ++++++-
>>
>>
>> I would have liked to see the content from net/core/bpf_run.c to go
>> directly into net/core/filter.c, not as a separate file, if that's
>> possible.
>
> sure. that's fine.
>
>>
>>> net/core/sysctl_net_core.c | 7 +
>>> 7 files changed, 913 insertions(+), 23 deletions(-)
>>> create mode 100644 net/core/bpf_run.c
>>>
>>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>>> index e568c8ef896b..bf3085258f4c 100644
>>> --- a/include/linux/filter.h
>>> +++ b/include/linux/filter.h
>>> @@ -53,6 +53,13 @@ extern int sk_chk_filter(struct sock_filter *filter,
>>> unsigned int flen);
>>> extern int sk_get_filter(struct sock *sk, struct sock_filter __user
>>> *filter, unsigned len);
>>> extern void sk_decode_filter(struct sock_filter *filt, struct
>>> sock_filter *to);
>>>
>>> +/* function remaps 'sock_filter' style insns to 'bpf_insn' style insns */
>>> +int bpf_convert(struct sock_filter *fp, int len, struct bpf_insn
>>> *new_prog,
>>> + int *p_new_len);
>>> +/* execute bpf64 program */
>>> +u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn);
>>> +
>>> +#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB,
>>> FILTER->insns)
>>> #ifdef CONFIG_BPF_JIT
>>> #include <stdarg.h>
>>> #include <linux/linkage.h>
>>> @@ -70,7 +77,6 @@ static inline void bpf_jit_dump(unsigned int flen,
>>> unsigned int proglen,
>>> print_hex_dump(KERN_ERR, "JIT code: ", DUMP_PREFIX_OFFSET,
>>> 16, 1, image, proglen, false);
>>> }
>>> -#define SK_RUN_FILTER(FILTER, SKB) (*FILTER->bpf_func)(SKB,
>>> FILTER->insns)
>>> #else
>>> #include <linux/slab.h>
>>> static inline void bpf_jit_compile(struct sk_filter *fp)
>>> @@ -80,7 +86,6 @@ static inline void bpf_jit_free(struct sk_filter *fp)
>>> {
>>> kfree(fp);
>>> }
>>> -#define SK_RUN_FILTER(FILTER, SKB) sk_run_filter(SKB, FILTER->insns)
>>> #endif
>>>
>>> static inline int bpf_tell_extensions(void)
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index 5e84483c0650..7b1acefc244e 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -2971,6 +2971,7 @@ extern int netdev_max_backlog;
>>> extern int netdev_tstamp_prequeue;
>>> extern int weight_p;
>>> extern int bpf_jit_enable;
>>> +extern int bpf64_enable;
>>
>>
>> We should keep naming consistent (so either extended BPF or BPF64),
>> so maybe bpf_ext_enable ? I'd presume rather {bpf,sk_filter}*_ext
>
> agree. we need consistent naming for both (old and new).
> I'll try all-out rename of bpf_*() into sk_filter64_*() and sk_filter_ext_*()
> to see which one looks better.
> I'm kinda leaning to sk_filter64, since it's easier to quickly spot
> the difference
> and more descriptive.
After going back and forth between sk_filter64 vs sk_filter_ext,
decided to follow your suggestion and stick with sk_filter_ext,
bpf_ext_enable, etc. since calling it bpf64 may raise unnecessary
concerns on 32-bit architectures... but performance numbers
show that extended bpf on 32-bit cpus is faster than old bpf,
so 'extended bpf' from now on.
Thanks
>> as in 'struct bpf_insn' the immediate value is 32 bit, so for 64 bit
>> comparisons, you'd still need to load to immediate values, right?
>
> there is no insn that use 64-bit immediate, since 64-bit immediates
> are extremely rare. grep x86-64 asm code for movabsq will return very few.
> llvm or gcc can easily construct any constant by combination of mov,
> shifts and ors.
> bpf64 comparisons are all 64-bit right now. So far I didn't see a need to do
> 32-bit comparison, since old bpf is all unsigned, mapping 32->64 of
> Jxx is painless.
> Notice I added signed comparison, since real life programs cannot do
> without them.
> I also kept the spirit of old bpf having > and >= only. (no < and <=)
> that made llvm/gcc backends a bit harder to do, since no real cpu has
> such limitations.
> I'm still contemplating do add < and <= (both signed and unsigned), which is
> interesting trade-off: number of instructions vs complexity of compiler
>
>> After all your changes, we will still have the bpf_jit_enable knob
>> in tact, right?
>
> Yes.
> In this diff the workflow is the following:
>
> old filter comes through sk_attach_filter() or sk_unattached_filter_create()
> if (bpf64) {
> convert to new
> sk_chk_filter() - check old bpf
> use new interpreter
> } else {
> sk_chk_filter() - check old bpf
> if (bpf_jit_enable)
> use old jit
> else
> use old interpreter
> }
> soon I'll add bpf64 jit and will reuse the same bpf_jit_enable knob for it.
> then add new/old inband demux into sk_attach_filter(),
> so that workflow will become:
>
> a filter comes through sk_attach_filter() or sk_unattached_filter_create()
> if (new filter prog) {
> sk_chk_filter64() - check new bpf
> if (bpf_jit_enable)
> use new jit
> else
> use new interpreter
> } else {
> if (bpf64_enable) {
> convert to new
> sk_chk_filter() - check old bpf
> if (bpf_jit_enable)
> use new jit
> else
> use new interpreter
> } else {
> sk_chk_filter()
> if (bpf_jit_enable)
> use old jit
> else
> use old interpreter
> }
> }
> eventually bpf64_enable can be made default,
> the last 'else' can be retired and 'bpf64_enable' removed along with
> old interpreter.
>
> bpf_jit_enable knob will stay for foreseeable future.
>
>>> bool netdev_has_upper_dev(struct net_device *dev, struct net_device
>>> *upper_dev);
>>> struct net_device *netdev_all_upper_get_next_dev_rcu(struct net_device
>>> *dev,
>>> diff --git a/include/uapi/linux/filter.h b/include/uapi/linux/filter.h
>>> index 8eb9ccaa5b48..70ff29ee6825 100644
>>> --- a/include/uapi/linux/filter.h
>>> +++ b/include/uapi/linux/filter.h
>>> @@ -1,3 +1,4 @@
>>> +/* extended BPF is Copyright (c) 2011-2014, PLUMgrid, http://plumgrid.com
>>> */
>>> /*
>>> * Linux Socket Filter Data Structures
>>> */
>>
>>
>> You can merge both comments into one.
>
> ok.
>
>>
>>> @@ -19,7 +20,7 @@
>>> * Try and keep these values and structures similar to BSD,
>>> especially
>>> * the BPF code definitions which need to match so you can share
>>> filters
>>> */
>>> -
>>> +
>>> struct sock_filter { /* Filter block */
>>> __u16 code; /* Actual filter code */
>>> __u8 jt; /* Jump true */
>>> @@ -45,12 +46,26 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>>> */
>>> #define BPF_JMP 0x05
>>> #define BPF_RET 0x06
>>> #define BPF_MISC 0x07
>>> +#define BPF_ALU64 0x07
>>> +
>>> +struct bpf_insn {
>>> + __u8 code; /* opcode */
>>> + __u8 a_reg:4; /* dest register*/
>>> + __u8 x_reg:4; /* source register */
>>> + __s16 off; /* signed offset */
>>> + __s32 imm; /* signed immediate constant */
>>> +};
>>
>>
>> As we have struct sock_filter and it's immutable due to uapi, I
>> would have liked to see the new data structure with a consistent
>> naming scheme, e.g. struct sock_filter_ext {...} for extended
>> BPF.
>
> ok. will rename bpf_insn to sock_filter64 and sock_filter_ext to see
> which one looks better through out the code.
>
>>> +/* pointer to bpf_context is the first and only argument to BPF program
>>> + * its definition is use-case specific */
>>> +struct bpf_context;
>>>
>>> /* ld/ldx fields */
>>> #define BPF_SIZE(code) ((code) & 0x18)
>>> #define BPF_W 0x00
>>> #define BPF_H 0x08
>>> #define BPF_B 0x10
>>> +#define BPF_DW 0x18
>>> #define BPF_MODE(code) ((code) & 0xe0)
>>> #define BPF_IMM 0x00
>>> #define BPF_ABS 0x20
>>> @@ -58,6 +73,7 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>>> */
>>> #define BPF_MEM 0x60
>>> #define BPF_LEN 0x80
>>> #define BPF_MSH 0xa0
>>> +#define BPF_XADD 0xc0 /* exclusive add */
>>>
>>> /* alu/jmp fields */
>>> #define BPF_OP(code) ((code) & 0xf0)
>>> @@ -68,16 +84,24 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>>> */
>>> #define BPF_OR 0x40
>>> #define BPF_AND 0x50
>>> #define BPF_LSH 0x60
>>> -#define BPF_RSH 0x70
>>> +#define BPF_RSH 0x70 /* logical shift right */
>>> #define BPF_NEG 0x80
>>> #define BPF_MOD 0x90
>>> #define BPF_XOR 0xa0
>>> +#define BPF_MOV 0xb0 /* mov reg to reg */
>>> +#define BPF_ARSH 0xc0 /* sign extending arithmetic
>>> shift right */
>>> +#define BPF_BSWAP32 0xd0 /* swap lower 4 bytes of
>>> 64-bit register */
>>> +#define BPF_BSWAP64 0xe0 /* swap all 8 bytes of 64-bit
>>> register */
>>>
>>> #define BPF_JA 0x00
>>> -#define BPF_JEQ 0x10
>>> -#define BPF_JGT 0x20
>>> -#define BPF_JGE 0x30
>>> -#define BPF_JSET 0x40
>>> +#define BPF_JEQ 0x10 /* jump == */
>>> +#define BPF_JGT 0x20 /* GT is unsigned '>', JA in x86 */
>>> +#define BPF_JGE 0x30 /* GE is unsigned '>=', JAE in x86
>>> */
>>> +#define BPF_JSET 0x40 /* if (A & X) */
>>> +#define BPF_JNE 0x50 /* jump != */
>>> +#define BPF_JSGT 0x60 /* SGT is signed '>', GT in x86 */
>>> +#define BPF_JSGE 0x70 /* SGE is signed '>=', GE in x86 */
>>> +#define BPF_CALL 0x80 /* function call */
>>> #define BPF_SRC(code) ((code) & 0x08)
>>> #define BPF_K 0x00
>>> #define BPF_X 0x08
>>> @@ -134,5 +158,4 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER.
>>> */
>>> #define SKF_NET_OFF (-0x100000)
>>> #define SKF_LL_OFF (-0x200000)
>>>
>>> -
>>> #endif /* _UAPI__LINUX_FILTER_H__ */
>>> diff --git a/net/core/Makefile b/net/core/Makefile
>>> index 9628c20acff6..e622b97f58dc 100644
>>> --- a/net/core/Makefile
>>> +++ b/net/core/Makefile
>>> @@ -8,7 +8,7 @@ obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o
>>> stream.o scm.o \
>>> obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
>>>
>>> obj-y += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o
>>> \
>>> - neighbour.o rtnetlink.o utils.o link_watch.o
>>> filter.o \
>>> + neighbour.o rtnetlink.o utils.o link_watch.o
>>> filter.o bpf_run.o \
>>> sock_diag.o dev_ioctl.o
>>>
>>> obj-$(CONFIG_XFRM) += flow.o
>>> diff --git a/net/core/bpf_run.c b/net/core/bpf_run.c
>>> new file mode 100644
>>> index 000000000000..fa1862fcbc74
>>> --- /dev/null
>>> +++ b/net/core/bpf_run.c
>>> @@ -0,0 +1,766 @@
>>> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
>>> + *
>>> + * This program is free software; you can redistribute it and/or
>>> + * modify it under the terms of version 2 of the GNU General Public
>>> + * License as published by the Free Software Foundation.
>>> + *
>>> + * This program is distributed in the hope that it will be useful, but
>>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
>>> + * General Public License for more details.
>>> + */
>>> +#include <linux/kernel.h>
>>> +#include <linux/types.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/uaccess.h>
>>> +#include <linux/filter.h>
>>> +#include <linux/skbuff.h>
>>> +#include <asm/unaligned.h>
>>> +
>>> +int bpf64_enable __read_mostly;
>>> +
>>> +void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int
>>> k,
>>> + unsigned int size);
>>> +
>>> +static inline void *load_pointer(const struct sk_buff *skb, int k,
>>> + unsigned int size, void *buffer)
>>> +{
>>> + if (k >= 0)
>>> + return skb_header_pointer(skb, k, size, buffer);
>>> + return bpf_internal_load_pointer_neg_helper(skb, k, size);
>>> +}
>>> +
>>> +static const char *const bpf_class_string[] = {
>>> + "ld", "ldx", "st", "stx", "alu", "jmp", "ret", "misc"
>>> +};
>>> +
>>> +static const char *const bpf_alu_string[] = {
>>> + "+=", "-=", "*=", "/=", "|=", "&=", "<<=", ">>=", "neg",
>>> + "%=", "^=", "=", "s>>=", "bswap32", "bswap64", "???"
>>> +};
>>> +
>>> +static const char *const bpf_ldst_string[] = {
>>> + "u32", "u16", "u8", "u64"
>>> +};
>>> +
>>> +static const char *const bpf_jmp_string[] = {
>>> + "jmp", "==", ">", ">=", "&", "!=", "s>", "s>=", "call"
>>> +};
>>> +
>>> +static const char *reg_to_str(int regno, u64 *regs)
>>> +{
>>> + static char reg_value[16][32];
>>> + if (!regs)
>>> + return "";
>>> + snprintf(reg_value[regno], sizeof(reg_value[regno]), "(0x%llx)",
>>> + regs[regno]);
>>> + return reg_value[regno];
>>> +}
>>> +
>>> +#define R(regno) reg_to_str(regno, regs)
>>> +
>>> +void pr_info_bpf_insn(const struct bpf_insn *insn, u64 *regs)
>>> +{
>>> + u16 class = BPF_CLASS(insn->code);
>>> + if (class == BPF_ALU || class == BPF_ALU64) {
>>> + if (BPF_SRC(insn->code) == BPF_X)
>>> + pr_info("code_%02x %sr%d%s %s r%d%s\n",
>>> + insn->code, class == BPF_ALU ? "(u32)" :
>>> "",
>>> + insn->a_reg, R(insn->a_reg),
>>> + bpf_alu_string[BPF_OP(insn->code) >> 4],
>>> + insn->x_reg, R(insn->x_reg));
>>> + else
>>> + pr_info("code_%02x %sr%d%s %s %d\n",
>>> + insn->code, class == BPF_ALU ? "(u32)" :
>>> "",
>>> + insn->a_reg, R(insn->a_reg),
>>> + bpf_alu_string[BPF_OP(insn->code) >> 4],
>>> + insn->imm);
>>> + } else if (class == BPF_STX) {
>>> + if (BPF_MODE(insn->code) == BPF_MEM)
>>> + pr_info("code_%02x *(%s *)(r%d%s %+d) = r%d%s\n",
>>> + insn->code,
>>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> + insn->a_reg, R(insn->a_reg),
>>> + insn->off, insn->x_reg, R(insn->x_reg));
>>> + else if (BPF_MODE(insn->code) == BPF_XADD)
>>> + pr_info("code_%02x lock *(%s *)(r%d%s %+d) +=
>>> r%d%s\n",
>>> + insn->code,
>>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> + insn->a_reg, R(insn->a_reg), insn->off,
>>> + insn->x_reg, R(insn->x_reg));
>>> + else
>>> + pr_info("BUG_%02x\n", insn->code);
>>> + } else if (class == BPF_ST) {
>>> + if (BPF_MODE(insn->code) != BPF_MEM) {
>>> + pr_info("BUG_st_%02x\n", insn->code);
>>> + return;
>>> + }
>>> + pr_info("code_%02x *(%s *)(r%d%s %+d) = %d\n",
>>> + insn->code,
>>> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
>>> + insn->a_reg, R(insn->a_reg),
>>> + insn->off, insn->imm);
>>> + } else if (class == BPF_LDX) {
>>> + if (BPF_MODE(insn->code) == BPF_MEM) {
>>> + pr_info("code_%02x r%d = *(%s *)(r%d%s %+d)\n",
>>> + insn->code, insn->a_reg,
>>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> + insn->x_reg, R(insn->x_reg), insn->off);
>>> + } else {
>>> + pr_info("BUG_ldx_%02x\n", insn->code);
>>> + }
>>> + } else if (class == BPF_LD) {
>>> + if (BPF_MODE(insn->code) == BPF_ABS) {
>>> + pr_info("code_%02x r%d = *(%s *)(skb %+d)\n",
>>> + insn->code, insn->a_reg,
>>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> + insn->imm);
>>> + } else if (BPF_MODE(insn->code) == BPF_IND) {
>>> + pr_info("code_%02x r%d = *(%s *)(skb + r%d%s
>>> %+d)\n",
>>> + insn->code, insn->a_reg,
>>> + bpf_ldst_string[BPF_SIZE(insn->code) >>
>>> 3],
>>> + insn->x_reg, R(insn->x_reg), insn->imm);
>>> + } else {
>>> + pr_info("BUG_ld_%02x\n", insn->code);
>>> + }
>>> + } else if (class == BPF_JMP) {
>>> + u16 opcode = BPF_OP(insn->code);
>>> + if (opcode == BPF_CALL) {
>>> + pr_info("code_%02x call %d\n", insn->code,
>>> insn->imm);
>>> + } else if (insn->code == (BPF_JMP | BPF_JA)) {
>>> + pr_info("code_%02x goto pc%+d\n",
>>> + insn->code, insn->off);
>>> + } else if (BPF_SRC(insn->code) == BPF_X) {
>>> + pr_info("code_%02x if r%d%s %s r%d%s goto
>>> pc%+d\n",
>>> + insn->code, insn->a_reg, R(insn->a_reg),
>>> + bpf_jmp_string[BPF_OP(insn->code) >> 4],
>>> + insn->x_reg, R(insn->x_reg), insn->off);
>>> + } else {
>>> + pr_info("code_%02x if r%d%s %s 0x%x goto pc%+d\n",
>>> + insn->code, insn->a_reg, R(insn->a_reg),
>>> + bpf_jmp_string[BPF_OP(insn->code) >> 4],
>>> + insn->imm, insn->off);
>>> + }
>>> + } else {
>>> + pr_info("code_%02x %s\n", insn->code,
>>> bpf_class_string[class]);
>>> + }
>>> +}
>>> +EXPORT_SYMBOL(pr_info_bpf_insn);
>>
>>
>> Why would that need to be exported as a symbol?
>
> the performance numbers I mentioned are from bpf_bench that is done
> as kernel module, so I used this for debugging from it.
> also to see what execution traces I get while running trinity bpf fuzzer :)
>
>> I would actually like to avoid having this pr_info* inside the kernel.
>> Couldn't this be done e.g. through systemtap script that could e.g. be
>> placed under tools/net/ or inside the documentation file?
>
> like the idea!
> Will drop it from the diff and eventually will move it to tools/net.
>
>>> +/* remap 'sock_filter' style BPF instruction set to 'bpf_insn' style
>>> (BPF64)
>>> + *
>>> + * first, call bpf_convert(old_prog, len, NULL, &new_len) to calculate
>>> new
>>> + * program length in one pass
>>> + *
>>> + * then new_prog = kmalloc(sizeof(struct bpf_insn) * new_len);
>>> + *
>>> + * and call it again: bpf_convert(old_prog, len, new_prog, &new_len);
>>> + * to remap in two passes: 1st pass finds new jump offsets, 2nd pass
>>> remaps
>>> + */
>>
>>
>> Would be nice to have the API comment it in kdoc format.
>
> good point. will do.
>
>>> +int bpf_convert(struct sock_filter *old_prog, int len,
>>> + struct bpf_insn *new_prog, int *p_new_len)
>>> +{
>>
>>
>> If we place it into net/core/filter.c, it would be nice to keep naming
>> conventions consistent, e.g. sk_convert_filter() or so.
>
> ok.
>
>>> + struct bpf_insn *new_insn;
>>> + struct sock_filter *fp;
>>> + int *addrs = NULL;
>>> + int new_len = 0;
>>> + int pass = 0;
>>> + int tgt, i;
>>> +
>>> + if (len <= 0 || len >= BPF_MAXINSNS)
>>> + return -EINVAL;
>>> +
>>> + if (new_prog) {
>>> + addrs = kzalloc(len * sizeof(*addrs), GFP_KERNEL);
>>> + if (!addrs)
>>> + return -ENOMEM;
>>> + }
>>> +
>>> +do_pass:
>>> + new_insn = new_prog;
>>> + fp = old_prog;
>>> + for (i = 0; i < len; fp++, i++) {
>>> + struct bpf_insn tmp_insns[3] = {};
>>> + struct bpf_insn *insn = tmp_insns;
>>> +
>>> + if (addrs)
>>> + addrs[i] = new_insn - new_prog;
>>> +
>>> + switch (fp->code) {
>>> + /* all arithmetic insns and skb loads map as-is */
>>> + case BPF_ALU | BPF_ADD | BPF_X:
>>> + case BPF_ALU | BPF_ADD | BPF_K:
>>> + case BPF_ALU | BPF_SUB | BPF_X:
>>> + case BPF_ALU | BPF_SUB | BPF_K:
>>> + case BPF_ALU | BPF_AND | BPF_X:
>>> + case BPF_ALU | BPF_AND | BPF_K:
>>> + case BPF_ALU | BPF_OR | BPF_X:
>>> + case BPF_ALU | BPF_OR | BPF_K:
>>> + case BPF_ALU | BPF_LSH | BPF_X:
>>> + case BPF_ALU | BPF_LSH | BPF_K:
>>> + case BPF_ALU | BPF_RSH | BPF_X:
>>> + case BPF_ALU | BPF_RSH | BPF_K:
>>> + case BPF_ALU | BPF_XOR | BPF_X:
>>> + case BPF_ALU | BPF_XOR | BPF_K:
>>> + case BPF_ALU | BPF_MUL | BPF_X:
>>> + case BPF_ALU | BPF_MUL | BPF_K:
>>> + case BPF_ALU | BPF_DIV | BPF_X:
>>> + case BPF_ALU | BPF_DIV | BPF_K:
>>> + case BPF_ALU | BPF_MOD | BPF_X:
>>> + case BPF_ALU | BPF_MOD | BPF_K:
>>> + case BPF_ALU | BPF_NEG:
>>> + case BPF_LD | BPF_ABS | BPF_W:
>>> + case BPF_LD | BPF_ABS | BPF_H:
>>> + case BPF_LD | BPF_ABS | BPF_B:
>>> + case BPF_LD | BPF_IND | BPF_W:
>>> + case BPF_LD | BPF_IND | BPF_H:
>>> + case BPF_LD | BPF_IND | BPF_B:
>>
>>
>> I think here and elsewhere for similar constructions, we could use
>> BPF_S_* helpers that was introduced by Hagen in commit 01f2f3f6ef4d076
>> ("net: optimize Berkeley Packet Filter (BPF) processing").
>
> well, old bpf opcodes were just unlucky to be on the wrong side of
> compiler heuristics at that time.
> Instead of doing remapping while loading and opposite remapping in
> sk_get_filter()
> the problem could have been solved with few dummy 'case' statements.
> That would have only added few lines of code instead of hundred lines of mapping
> back and forth.
> Also I suspect commit 01f2f3f6ef4d076 was valid only when whole kernel
> is compiled
> with -Os. With -O2 GCC should have done it right.
> Heuristic is: number_of_case_stmts * ratio < range_of_case_values
> and ratio is 3 for -Os and 10 for -O2.
> old bpf has ~70 stmts and ~200 range.
> See expand_switch_as_decision_tree_p() in gcc/stmt.c
> Eventually when current interpreter is retired, I would like to remove
> remapping as well.
> Especially for sk_convert_filter() I would like to keep normal BPF_
> values from uapi/filter.h,
> since for majority of them I reuse as-is and remapping would have
> added unnecessary code.
>
>>> + insn->code = fp->code;
>>> + insn->a_reg = 6;
>>> + insn->x_reg = 7;
>>> + insn->imm = fp->k;
>>> + break;
>>> +
>>> + /* jump opcodes map as-is, but offsets need adjustment */
>>> + case BPF_JMP | BPF_JA:
>>> + tgt = i + fp->k + 1;
>>> + insn->code = fp->code;
>>> +#define EMIT_JMP \
>>> + do { \
>>> + if (tgt >= len || tgt < 0) \
>>> + goto err; \
>>> + insn->off = addrs ? addrs[tgt] - addrs[i] - 1 : 0; \
>>> + } while (0)
>>> +
>>> + EMIT_JMP;
>>> + break;
>>> +
>>> + case BPF_JMP | BPF_JEQ | BPF_K:
>>> + case BPF_JMP | BPF_JEQ | BPF_X:
>>> + case BPF_JMP | BPF_JSET | BPF_K:
>>> + case BPF_JMP | BPF_JSET | BPF_X:
>>> + case BPF_JMP | BPF_JGT | BPF_K:
>>> + case BPF_JMP | BPF_JGT | BPF_X:
>>> + case BPF_JMP | BPF_JGE | BPF_K:
>>> + case BPF_JMP | BPF_JGE | BPF_X:
>>> + insn->a_reg = 6;
>>> + insn->x_reg = 7;
>>> + insn->imm = fp->k;
>>> + /* common case where 'jump_false' is next insn */
>>> + if (fp->jf == 0) {
>>> + insn->code = fp->code;
>>> + tgt = i + fp->jt + 1;
>>> + EMIT_JMP;
>>> + break;
>>> + }
>>> + /* convert JEQ into JNE when 'jump_true' is next
>>> insn */
>>> + if (fp->jt == 0 && BPF_OP(fp->code) == BPF_JEQ) {
>>> + insn->code = BPF_JMP | BPF_JNE |
>>> + BPF_SRC(fp->code);
>>> + tgt = i + fp->jf + 1;
>>> + EMIT_JMP;
>>> + break;
>>> + }
>>> + /* other jumps are mapped into two insns: Jxx and
>>> JA */
>>> + tgt = i + fp->jt + 1;
>>> + insn->code = fp->code;
>>> + EMIT_JMP;
>>> +
>>> + insn++;
>>> + insn->code = BPF_JMP | BPF_JA;
>>> + tgt = i + fp->jf + 1;
>>> + EMIT_JMP;
>>> + /* adjust pc relative offset, since it's a 2nd
>>> insn */
>>> + insn->off--;
>>> + break;
>>> +
>>> + /* ldxb 4*([14]&0xf) is remaped into 3 insns */
>>> + case BPF_LDX | BPF_MSH | BPF_B:
>>> + insn->code = BPF_LD | BPF_ABS | BPF_B;
>>> + insn->a_reg = 7;
>>> + insn->imm = fp->k;
>>> +
>>> + insn++;
>>> + insn->code = BPF_ALU | BPF_AND | BPF_K;
>>> + insn->a_reg = 7;
>>> + insn->imm = 0xf;
>>> +
>>> + insn++;
>>> + insn->code = BPF_ALU | BPF_LSH | BPF_K;
>>> + insn->a_reg = 7;
>>> + insn->imm = 2;
>>> + break;
>>> +
>>> + /* RET_K, RET_A are remaped into 2 insns */
>>> + case BPF_RET | BPF_A:
>>> + case BPF_RET | BPF_K:
>>> + insn->code = BPF_ALU | BPF_MOV |
>>> + (BPF_SRC(fp->code) == BPF_K ? BPF_K :
>>> BPF_X);
>>> + insn->a_reg = 0;
>>> + insn->x_reg = 6;
>>> + insn->imm = fp->k;
>>> +
>>> + insn++;
>>> + insn->code = BPF_RET | BPF_K;
>>> + break;
>>> +
>>> + /* store to stack */
>>> + case BPF_ST:
>>> + case BPF_STX:
>>> + insn->code = BPF_STX | BPF_MEM | BPF_W;
>>> + insn->a_reg = 10;
>>> + insn->x_reg = fp->code == BPF_ST ? 6 : 7;
>>> + insn->off = -(BPF_MEMWORDS - fp->k) * 4;
>>> + break;
>>> +
>>> + /* load from stack */
>>> + case BPF_LD | BPF_MEM:
>>> + case BPF_LDX | BPF_MEM:
>>> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
>>> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>>> 7;
>>> + insn->x_reg = 10;
>>> + insn->off = -(BPF_MEMWORDS - fp->k) * 4;
>>> + break;
>>> +
>>> + /* A = K or X = K */
>>> + case BPF_LD | BPF_IMM:
>>> + case BPF_LDX | BPF_IMM:
>>> + insn->code = BPF_ALU | BPF_MOV | BPF_K;
>>> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>>> 7;
>>> + insn->imm = fp->k;
>>> + break;
>>> +
>>> + /* X = A */
>>> + case BPF_MISC | BPF_TAX:
>>> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
>>> + insn->a_reg = 7;
>>> + insn->x_reg = 6;
>>> + break;
>>> +
>>> + /* A = X */
>>> + case BPF_MISC | BPF_TXA:
>>> + insn->code = BPF_ALU64 | BPF_MOV | BPF_X;
>>> + insn->a_reg = 6;
>>> + insn->x_reg = 7;
>>> + break;
>>> +
>>> + /* A = skb->len or X = skb->len */
>>> + case BPF_LD|BPF_W|BPF_LEN:
>>> + case BPF_LDX|BPF_W|BPF_LEN:
>>> + insn->code = BPF_LDX | BPF_MEM | BPF_W;
>>> + insn->a_reg = BPF_CLASS(fp->code) == BPF_LD ? 6 :
>>> 7;
>>> + insn->x_reg = 1;
>>> + insn->off = offsetof(struct sk_buff, len);
>>> + break;
>>> +
>>> + default:
>>> + /* pr_err("unknown opcode %02x\n", fp->code); */
>>> + goto err;
>>> + }
>>> +
>>> + insn++;
>>> + if (new_prog) {
>>> + memcpy(new_insn, tmp_insns,
>>> + sizeof(*insn) * (insn - tmp_insns));
>>> + }
>>> + new_insn += insn - tmp_insns;
>>> + }
>>> +
>>> + if (!new_prog) {
>>> + /* only calculating new length */
>>> + *p_new_len = new_insn - new_prog;
>>> + return 0;
>>> + }
>>> +
>>> + pass++;
>>> + if (new_len != new_insn - new_prog) {
>>> + new_len = new_insn - new_prog;
>>> + if (pass > 2)
>>> + goto err;
>>> + goto do_pass;
>>> + }
>>> + kfree(addrs);
>>> + if (*p_new_len != new_len)
>>> + /* inconsistent new program length */
>>> + pr_err("bpf_convert() usage error\n");
>>> + return 0;
>>> +err:
>>> + kfree(addrs);
>>> + return -EINVAL;
>>> +}
>>> +
>>> +notrace u32 bpf_run(struct bpf_context *ctx, const struct bpf_insn *insn)
>>> +{
>>
>>
>> Similarly, sk_run_filter_ext()?
>
> ok.
>
>>> + u64 stack[64];
>>> + u64 regs[16];
>>> + void *ptr;
>>> + u64 tmp;
>>> + int off;
>>> +
>>> +#ifdef __x86_64
>>> +#define LOAD_IMM /**/
>>> +#define K insn->imm
>>> +#else
>>> +#define LOAD_IMM (K = insn->imm)
>>> + s32 K = insn->imm;
>>> +#endif
>>> +
>>> +#ifdef DEBUG
>>> +#define DEBUG_INSN pr_info_bpf_insn(insn, regs)
>>> +#else
>>> +#define DEBUG_INSN
>>> +#endif
>>
>>
>> This DEBUG hunk could then be removed when we use a stap script
>> instead, for example.
>
> ok
>
>>> +#define A regs[insn->a_reg]
>>> +#define X regs[insn->x_reg]
>>> +
>>> +#define DL(A, B, C) [A|B|C] = &&L_##A##B##C,
>>> +#define L(A, B, C) L_##A##B##C:
>>
>>
>> Could we get rid of these two defines? I know you're defining
>
> I think it's actually more readable this way and lesser chance of typos.
> DL - define label
> L - label
> will try to remove 2nd marco to see how it looks...
>
>> and using that as labels, but it's not obvious at first what
>> 'jt' does. Maybe 'jt_labels' ?
>
> ok will rename.
>
>>> +#define CONT ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
>>> +#define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto select_insn; })
>>
>>
>> Not a big fan of macros, but here it seems fine though.
>>
>>
>>> +/* some compilers may need help:
>>> + * #define CONT_JMP ({insn++; LOAD_IMM; DEBUG_INSN; goto *jt[insn->code];
>>> })
>>> + */
>>> +
>>> + static const void *jt[256] = {
>>> + [0 ... 255] = &&L_default,
>>
>>
>> Nit: please just one define per line below:
>
> ok.
>
>>> + DL(BPF_ALU, BPF_ADD, BPF_X) DL(BPF_ALU, BPF_ADD, BPF_K)
>>> + DL(BPF_ALU, BPF_SUB, BPF_X) DL(BPF_ALU, BPF_SUB, BPF_K)
>>> + DL(BPF_ALU, BPF_AND, BPF_X) DL(BPF_ALU, BPF_AND, BPF_K)
>>> + DL(BPF_ALU, BPF_OR, BPF_X) DL(BPF_ALU, BPF_OR, BPF_K)
>>> + DL(BPF_ALU, BPF_LSH, BPF_X) DL(BPF_ALU, BPF_LSH, BPF_K)
>>> + DL(BPF_ALU, BPF_RSH, BPF_X) DL(BPF_ALU, BPF_RSH, BPF_K)
>>> + DL(BPF_ALU, BPF_XOR, BPF_X) DL(BPF_ALU, BPF_XOR, BPF_K)
>>> + DL(BPF_ALU, BPF_MUL, BPF_X) DL(BPF_ALU, BPF_MUL, BPF_K)
>>> + DL(BPF_ALU, BPF_MOV, BPF_X) DL(BPF_ALU, BPF_MOV, BPF_K)
>>> + DL(BPF_ALU, BPF_DIV, BPF_X) DL(BPF_ALU, BPF_DIV, BPF_K)
>>> + DL(BPF_ALU, BPF_MOD, BPF_X) DL(BPF_ALU, BPF_MOD, BPF_K)
>>> + DL(BPF_ALU64, BPF_ADD, BPF_X) DL(BPF_ALU64, BPF_ADD,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_SUB, BPF_X) DL(BPF_ALU64, BPF_SUB,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_AND, BPF_X) DL(BPF_ALU64, BPF_AND,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_OR, BPF_X) DL(BPF_ALU64, BPF_OR, BPF_K)
>>> + DL(BPF_ALU64, BPF_LSH, BPF_X) DL(BPF_ALU64, BPF_LSH,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_RSH, BPF_X) DL(BPF_ALU64, BPF_RSH,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_XOR, BPF_X) DL(BPF_ALU64, BPF_XOR,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_MUL, BPF_X) DL(BPF_ALU64, BPF_MUL,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_MOV, BPF_X) DL(BPF_ALU64, BPF_MOV,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_ARSH, BPF_X)DL(BPF_ALU64, BPF_ARSH,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_DIV, BPF_X) DL(BPF_ALU64, BPF_DIV,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_MOD, BPF_X) DL(BPF_ALU64, BPF_MOD,
>>> BPF_K)
>>> + DL(BPF_ALU64, BPF_BSWAP32, BPF_X)
>>> + DL(BPF_ALU64, BPF_BSWAP64, BPF_X)
>>> + DL(BPF_ALU, BPF_NEG, 0)
>>> + DL(BPF_JMP, BPF_CALL, 0)
>>> + DL(BPF_JMP, BPF_JA, 0)
>>> + DL(BPF_JMP, BPF_JEQ, BPF_X) DL(BPF_JMP, BPF_JEQ, BPF_K)
>>> + DL(BPF_JMP, BPF_JNE, BPF_X) DL(BPF_JMP, BPF_JNE, BPF_K)
>>> + DL(BPF_JMP, BPF_JGT, BPF_X) DL(BPF_JMP, BPF_JGT, BPF_K)
>>> + DL(BPF_JMP, BPF_JGE, BPF_X) DL(BPF_JMP, BPF_JGE, BPF_K)
>>> + DL(BPF_JMP, BPF_JSGT, BPF_X) DL(BPF_JMP, BPF_JSGT, BPF_K)
>>> + DL(BPF_JMP, BPF_JSGE, BPF_X) DL(BPF_JMP, BPF_JSGE, BPF_K)
>>> + DL(BPF_JMP, BPF_JSET, BPF_X) DL(BPF_JMP, BPF_JSET, BPF_K)
>>> + DL(BPF_STX, BPF_MEM, BPF_B) DL(BPF_STX, BPF_MEM, BPF_H)
>>> + DL(BPF_STX, BPF_MEM, BPF_W) DL(BPF_STX, BPF_MEM, BPF_DW)
>>> + DL(BPF_ST, BPF_MEM, BPF_B) DL(BPF_ST, BPF_MEM, BPF_H)
>>> + DL(BPF_ST, BPF_MEM, BPF_W) DL(BPF_ST, BPF_MEM, BPF_DW)
>>> + DL(BPF_LDX, BPF_MEM, BPF_B) DL(BPF_LDX, BPF_MEM, BPF_H)
>>> + DL(BPF_LDX, BPF_MEM, BPF_W) DL(BPF_LDX, BPF_MEM, BPF_DW)
>>> + DL(BPF_STX, BPF_XADD, BPF_W)
>>> +#ifdef CONFIG_64BIT
>>> + DL(BPF_STX, BPF_XADD, BPF_DW)
>>> +#endif
>>> + DL(BPF_LD, BPF_ABS, BPF_W) DL(BPF_LD, BPF_ABS, BPF_H)
>>> + DL(BPF_LD, BPF_ABS, BPF_B) DL(BPF_LD, BPF_IND, BPF_W)
>>> + DL(BPF_LD, BPF_IND, BPF_H) DL(BPF_LD, BPF_IND, BPF_B)
>>> + DL(BPF_RET, BPF_K, 0)
>>> + };
>>> +
>>> + regs[10/* BPF R10 */] = (u64)(ulong)&stack[64];
>>> + regs[1/* BPF R1 */] = (u64)(ulong)ctx;
>>> +
>>> + DEBUG_INSN;
>>> + /* execute 1st insn */
>>> +select_insn:
>>> + goto *jt[insn->code];
>>> +
>>> + /* ALU */
>>> +#define ALU(OPCODE, OP) \
>>> + L_BPF_ALU64##OPCODE##BPF_X: \
>>> + A = A OP X; \
>>> + CONT; \
>>> + L_BPF_ALU##OPCODE##BPF_X: \
>>> + A = (u32)A OP (u32)X; \
>>> + CONT; \
>>> + L_BPF_ALU64##OPCODE##BPF_K: \
>>> + A = A OP K; \
>>> + CONT; \
>>> + L_BPF_ALU##OPCODE##BPF_K: \
>>> + A = (u32)A OP (u32)K; \
>>> + CONT;
>>> +
>>> + ALU(BPF_ADD, +)
>>> + ALU(BPF_SUB, -)
>>> + ALU(BPF_AND, &)
>>> + ALU(BPF_OR, |)
>>> + ALU(BPF_LSH, <<)
>>> + ALU(BPF_RSH, >>)
>>> + ALU(BPF_XOR, ^)
>>> + ALU(BPF_MUL, *)
>>> +
>>> + L(BPF_ALU, BPF_NEG, 0)
>>> + A = (u32)-A;
>>> + CONT;
>>> + L(BPF_ALU, BPF_MOV, BPF_X)
>>> + A = (u32)X;
>>> + CONT;
>>> + L(BPF_ALU, BPF_MOV, BPF_K)
>>> + A = (u32)K;
>>> + CONT;
>>> + L(BPF_ALU64, BPF_MOV, BPF_X)
>>> + A = X;
>>> + CONT;
>>> + L(BPF_ALU64, BPF_MOV, BPF_K)
>>> + A = K;
>>> + CONT;
>>> + L(BPF_ALU64, BPF_ARSH, BPF_X)
>>> + (*(s64 *) &A) >>= X;
>>> + CONT;
>>> + L(BPF_ALU64, BPF_ARSH, BPF_K)
>>> + (*(s64 *) &A) >>= K;
>>> + CONT;
>>> + L(BPF_ALU64, BPF_MOD, BPF_X)
>>> + tmp = A;
>>> + if (X)
>>> + A = do_div(tmp, X);
>>> + CONT;
>>> + L(BPF_ALU, BPF_MOD, BPF_X)
>>> + tmp = (u32)A;
>>> + if (X)
>>> + A = do_div(tmp, (u32)X);
>>> + CONT;
>>> + L(BPF_ALU64, BPF_MOD, BPF_K)
>>> + tmp = A;
>>> + if (K)
>>> + A = do_div(tmp, K);
>>> + CONT;
>>> + L(BPF_ALU, BPF_MOD, BPF_K)
>>> + tmp = (u32)A;
>>> + if (K)
>>> + A = do_div(tmp, (u32)K);
>>> + CONT;
>>> + L(BPF_ALU64, BPF_DIV, BPF_X)
>>> + if (X)
>>> + do_div(A, X);
>>> + CONT;
>>> + L(BPF_ALU, BPF_DIV, BPF_X)
>>> + tmp = (u32)A;
>>> + if (X)
>>> + do_div(tmp, (u32)X);
>>> + A = (u32)tmp;
>>> + CONT;
>>> + L(BPF_ALU64, BPF_DIV, BPF_K)
>>> + if (K)
>>> + do_div(A, K);
>>> + CONT;
>>> + L(BPF_ALU, BPF_DIV, BPF_K)
>>> + tmp = (u32)A;
>>> + if (K)
>>> + do_div(tmp, (u32)K);
>>> + A = (u32)tmp;
>>> + CONT;
>>> + L(BPF_ALU64, BPF_BSWAP32, BPF_X)
>>> + A = swab32(A);
>>> + CONT;
>>> + L(BPF_ALU64, BPF_BSWAP64, BPF_X)
>>> + A = swab64(A);
>>> + CONT;
>>> +
>>> + /* CALL */
>>> + L(BPF_JMP, BPF_CALL, 0)
>>> + /* TODO execute_func(K, regs); */
>>> + CONT;
>>
>>
>> Would the filter checker for now just return -EINVAL when this is used?
>
> sure.
> I was planning to address that in a week or so, but ok.
> Will remove 'call' insn in the first patch and will re-add in a clean
> way in a next diff.
>
>>> + /* JMP */
>>> + L(BPF_JMP, BPF_JA, 0)
>>> + insn += insn->off;
>>> + CONT;
>>> + L(BPF_JMP, BPF_JEQ, BPF_X)
>>> + if (A == X) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JEQ, BPF_K)
>>> + if (A == K) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JNE, BPF_X)
>>> + if (A != X) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JNE, BPF_K)
>>> + if (A != K) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JGT, BPF_X)
>>> + if (A > X) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JGT, BPF_K)
>>> + if (A > K) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JGE, BPF_X)
>>> + if (A >= X) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JGE, BPF_K)
>>> + if (A >= K) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JSGT, BPF_X)
>>> + if (((s64)A) > ((s64)X)) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JSGT, BPF_K)
>>> + if (((s64)A) > ((s64)K)) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JSGE, BPF_X)
>>> + if (((s64)A) >= ((s64)X)) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JSGE, BPF_K)
>>> + if (((s64)A) >= ((s64)K)) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JSET, BPF_X)
>>> + if (A & X) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>> + L(BPF_JMP, BPF_JSET, BPF_K)
>>> + if (A & (u32)K) {
>>> + insn += insn->off;
>>> + CONT_JMP;
>>> + }
>>> + CONT;
>>
>>
>> Okay, so right now we only support forward jumps, right? And these
>> are checked by the old checker code I'd assume?
>
> yes. just like old interpreter new interpreter doesn't care. It just jumps.
> and you're correct. old checker code will allow only forward jumps.
> That restriction is preserved by sk_convert_filter() as well.
>
>>> + /* STX and ST and LDX*/
>>> +#define LDST(SIZEOP, SIZE) \
>>> + L_BPF_STXBPF_MEM##SIZEOP: \
>>> + *(SIZE *)(ulong)(A + insn->off) = X; \
>>> + CONT; \
>>> + L_BPF_STBPF_MEM##SIZEOP: \
>>> + *(SIZE *)(ulong)(A + insn->off) = K; \
>>> + CONT; \
>>> + L_BPF_LDXBPF_MEM##SIZEOP: \
>>> + A = *(SIZE *)(ulong)(X + insn->off); \
>>> + CONT;
>>> +
>>> + LDST(BPF_B, u8)
>>> + LDST(BPF_H, u16)
>>> + LDST(BPF_W, u32)
>>> + LDST(BPF_DW, u64)
>>> +
>>> + /* STX XADD */
>>> + L(BPF_STX, BPF_XADD, BPF_W)
>>> + atomic_add((u32)X, (atomic_t *)(ulong)(A + insn->off));
>>> + CONT;
>>> +#ifdef CONFIG_64BIT
>>> + L(BPF_STX, BPF_XADD, BPF_DW)
>>> + atomic64_add((u64)X, (atomic64_t *)(ulong)(A +
>>> insn->off));
>>> + CONT;
>>> +#endif
>>> +
>>> + /* LD from SKB + K */
>>
>>
>> Nit: indent one tab too much (here and elsewhere)
>
> ahh. ok.
>
>>> + L(BPF_LD, BPF_ABS, BPF_W)
>>> + off = K;
>>> +load_word:
>>> + ptr = load_pointer((struct sk_buff *)ctx, off, 4, &tmp);
>>> + if (likely(ptr != NULL)) {
>>> + A = get_unaligned_be32(ptr);
>>> + CONT;
>>> + }
>>> + return 0;
>>> +
>>> + L(BPF_LD, BPF_ABS, BPF_H)
>>> + off = K;
>>> +load_half:
>>> + ptr = load_pointer((struct sk_buff *)ctx, off, 2, &tmp);
>>> + if (likely(ptr != NULL)) {
>>> + A = get_unaligned_be16(ptr);
>>> + CONT;
>>> + }
>>> + return 0;
>>> +
>>> + L(BPF_LD, BPF_ABS, BPF_B)
>>> + off = K;
>>> +load_byte:
>>> + ptr = load_pointer((struct sk_buff *)ctx, off, 1, &tmp);
>>> + if (likely(ptr != NULL)) {
>>> + A = *(u8 *)ptr;
>>> + CONT;
>>> + }
>>> + return 0;
>>> +
>>> + /* LD from SKB + X + K */
>>
>>
>> Nit: ditto
>
> ok
>
>>> + L(BPF_LD, BPF_IND, BPF_W)
>>> + off = K + X;
>>> + goto load_word;
>>> +
>>> + L(BPF_LD, BPF_IND, BPF_H)
>>> + off = K + X;
>>> + goto load_half;
>>> +
>>> + L(BPF_LD, BPF_IND, BPF_B)
>>> + off = K + X;
>>> + goto load_byte;
>>> +
>>> + /* RET */
>>> + L(BPF_RET, BPF_K, 0)
>>> + return regs[0/* R0 */];
>>> +
>>> + L_default:
>>> + /* bpf_check() or bpf_convert() will guarantee that
>>> + * we never reach here
>>> + */
>>> + WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code);
>>> + return 0;
>>> +#undef DL
>>> +#undef L
>>> +#undef CONT
>>> +#undef A
>>> +#undef X
>>> +#undef K
>>> +#undef LOAD_IMM
>>> +#undef DEBUG_INSN
>>> +#undef ALU
>>> +#undef LDST
>>> +}
>>> +EXPORT_SYMBOL(bpf_run);
>>> +
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index ad30d626a5bd..575bf5fd4335 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -637,6 +637,10 @@ void sk_filter_release_rcu(struct rcu_head *rcu)
>>> {
>>> struct sk_filter *fp = container_of(rcu, struct sk_filter, rcu);
>>>
>>> + if ((void *)fp->bpf_func == (void *)bpf_run)
>>> + /* arch specific jit_free are expecting this value */
>>> + fp->bpf_func = sk_run_filter;
>>> +
>>> bpf_jit_free(fp);
>>> }
>>> EXPORT_SYMBOL(sk_filter_release_rcu);
>>> @@ -655,6 +659,81 @@ static int __sk_prepare_filter(struct sk_filter *fp)
>>> return 0;
>>> }
>>>
>>> +static int bpf64_prepare(struct sk_filter **pfp, struct sock_fprog
>>> *fprog,
>>> + struct sock *sk)
>>> +{
>>
>>
>> sk_prepare_filter_ext()?
>
> ok.
>
>>> + unsigned int fsize = sizeof(struct sock_filter) * fprog->len;
>>> + struct sock_filter *old_prog;
>>> + unsigned int sk_fsize;
>>> + struct sk_filter *fp;
>>> + int new_len;
>>> + int err;
>>> +
>>> + /* store old program into buffer, since chk_filter will remap
>>> opcodes */
>>> + old_prog = kmalloc(fsize, GFP_KERNEL);
>>> + if (!old_prog)
>>> + return -ENOMEM;
>>> +
>>> + if (sk) {
>>> + if (copy_from_user(old_prog, fprog->filter, fsize)) {
>>> + err = -EFAULT;
>>> + goto free_prog;
>>> + }
>>> + } else {
>>> + memcpy(old_prog, fprog->filter, fsize);
>>> + }
>>> +
>>> + /* calculate bpf64 program length */
>>> + err = bpf_convert(fprog->filter, fprog->len, NULL, &new_len);
>>> + if (err)
>>> + goto free_prog;
>>> +
>>> + sk_fsize = sk_filter_size(new_len);
>>> + /* allocate sk_filter to store bpf64 program */
>>> + if (sk)
>>> + fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>>> + else
>>> + fp = kmalloc(sk_fsize, GFP_KERNEL);
>>> + if (!fp) {
>>> + err = -ENOMEM;
>>> + goto free_prog;
>>> + }
>>> +
>>> + /* remap old insns into bpf64 */
>>> + err = bpf_convert(old_prog, fprog->len,
>>> + (struct bpf_insn *)fp->insns, &new_len);
>>> + if (err)
>>> + /* 2nd bpf_convert() can fail only if it fails
>>> + * to allocate memory, remapping must succeed
>>> + */
>>> + goto free_fp;
>>> +
>>> + /* now chk_filter can overwrite old_prog while checking */
>>> + err = sk_chk_filter(old_prog, fprog->len);
>>> + if (err)
>>> + goto free_fp;
>>> +
>>> + /* discard old prog */
>>> + kfree(old_prog);
>>> +
>>> + atomic_set(&fp->refcnt, 1);
>>> + fp->len = new_len;
>>> +
>>> + /* bpf64 insns must be executed by bpf_run */
>>> + fp->bpf_func = (typeof(fp->bpf_func))bpf_run;
>>> +
>>> + *pfp = fp;
>>> + return 0;
>>> +free_fp:
>>> + if (sk)
>>> + sock_kfree_s(sk, fp, sk_fsize);
>>> + else
>>> + kfree(fp);
>>> +free_prog:
>>> + kfree(old_prog);
>>> + return err;
>>> +}
>>> +
>>> /**
>>> * sk_unattached_filter_create - create an unattached filter
>>> * @fprog: the filter program
>>> @@ -676,6 +755,9 @@ int sk_unattached_filter_create(struct sk_filter
>>> **pfp,
>>> if (fprog->filter == NULL)
>>> return -EINVAL;
>>>
>>> + if (bpf64_enable)
>>> + return bpf64_prepare(pfp, fprog, NULL);
>>> +
>>> fp = kmalloc(sk_filter_size(fprog->len), GFP_KERNEL);
>>> if (!fp)
>>> return -ENOMEM;
>>> @@ -726,21 +808,27 @@ int sk_attach_filter(struct sock_fprog *fprog,
>>> struct sock *sk)
>>> if (fprog->filter == NULL)
>>> return -EINVAL;
>>>
>>> - fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>>> - if (!fp)
>>> - return -ENOMEM;
>>> - if (copy_from_user(fp->insns, fprog->filter, fsize)) {
>>> - sock_kfree_s(sk, fp, sk_fsize);
>>> - return -EFAULT;
>>> - }
>>> + if (bpf64_enable) {
>>> + err = bpf64_prepare(&fp, fprog, sk);
>>> + if (err)
>>> + return err;
>>> + } else {
>>> + fp = sock_kmalloc(sk, sk_fsize, GFP_KERNEL);
>>> + if (!fp)
>>> + return -ENOMEM;
>>> + if (copy_from_user(fp->insns, fprog->filter, fsize)) {
>>> + sock_kfree_s(sk, fp, sk_fsize);
>>> + return -EFAULT;
>>> + }
>>>
>>> - atomic_set(&fp->refcnt, 1);
>>> - fp->len = fprog->len;
>>> + atomic_set(&fp->refcnt, 1);
>>> + fp->len = fprog->len;
>>>
>>> - err = __sk_prepare_filter(fp);
>>> - if (err) {
>>> - sk_filter_uncharge(sk, fp);
>>> - return err;
>>> + err = __sk_prepare_filter(fp);
>>> + if (err) {
>>> + sk_filter_uncharge(sk, fp);
>>> + return err;
>>> + }
>>> }
>>>
>>> old_fp = rcu_dereference_protected(sk->sk_filter,
>>> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
>>> index cf9cd13509a7..f03acc0e8950 100644
>>> --- a/net/core/sysctl_net_core.c
>>> +++ b/net/core/sysctl_net_core.c
>>> @@ -273,6 +273,13 @@ static struct ctl_table net_core_table[] = {
>>> },
>>> #endif
>>> {
>>> + .procname = "bpf64_enable",
>>> + .data = &bpf64_enable,
>>> + .maxlen = sizeof(int),
>>> + .mode = 0644,
>>> + .proc_handler = proc_dointvec
>>> + },
>>> + {
>>> .procname = "netdev_tstamp_prequeue",
>>> .data = &netdev_tstamp_prequeue,
>>> .maxlen = sizeof(int),
>>>
>>
>> Hope some of the comments made sense. ;-)
>
> Yes. Indeed. Thanks a lot for the thorough review!
> Will address things and resend V4.
>
> Alexei
>
>> Thanks,
>>
>> Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists