netdev - Re: [PATCH v5 net-next 1/3] net: flow_dissector: avoid multiple calls in BPF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 19 May 2014 15:23:53 -0700
From:	Chema Gonzalez <chema@...gle.com>
To:	Alexei Starovoitov <ast@...mgrid.com>
Cc:	David Miller <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Daniel Borkmann <dborkman@...hat.com>,
	Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH v5 net-next 1/3] net: flow_dissector: avoid multiple calls
 in BPF

On Fri, May 16, 2014 at 3:00 PM, Alexei Starovoitov <ast@...mgrid.com> wrote:
> this is way too long for commit log. These details can go after '---' line.
This can be fixed easily ;). I must admit I'm surprised by this
request: The way I see it, a commit log is not part of the code, but
text that justifies a patch. Therefore, being extremely descriptive is
a good thing. But it's got an easy solution.

>> +        * such field has been inited)
>> +        */
>> +       context = (void *)FP - BPF_MEMWORDS * 4 - sizeof(*context);
>> +       context->flow_initted = false;
>> +
>
> Neither this code nor comment belong in interpreter which is striving to
> be generic. Both have a meaning only for your new instructions in
> socket filters. They do not apply to seccomp and to tracing.
> Calling it 'struct sk_run_filter_ctx' is equally misleading.
> 'struct flow_keys' has nothing to do with seccomp and even with
> normal tcpdump filters.
I'd argue specificity already pervades the BPF filter. Almost all the
ancillary loads (now bpf_calls) assume a sk_buff in the ctx field (all
but the get_cpu and random). If the user's filter has a "ld #nla"
insn, the eBPF filter ends up having "mov(ctx, arg1); mov(A, arg2);
...; call(__skb_get_nlattr)", which assumes ctx has an skb, and runs
"nla_find((struct nlattr *) &skb->data[a], ...)".

Now, this specificity is needed: We're using BPF filters to filter
packets. It would make no sense not to use all the packet-related
functions in the kernel.

> We've spent so much time to generalize it and now you want to take
> it back into socket specific territory. As I said in the previous email
> it's totally not ok to customize eBPF just for one use case especially
> when you're not sharing final usage scenario.
Sorry if I haven't mentioned it before.

Right now I use the following filter to detect some packets I'm
interested on ('ip[1] == 0x03 and tcp and dst port 80' in tcpdump
expression-ish):

$ cat tools/net/tos_and_tcp_and_dport.bpf
ldh [12]
jne #0x800, drop
ldb [15]
jne #0x3, drop
ldb [23]
jne #0x6, drop
ldh [20]
jset #0x1fff, drop
ldxb 4*([14]&0xf)
ldh [x + 16]
jne #0x50, drop
ret #-1
drop: ret #0

Now, I want to do the same filtering in encapsulated traffic (vlan and
gre). I'd like to be able to write something like:

$ cat tools/net/tos_and_tcp_and_dport_generic.bpf
ld #nproto
jne #0x800, drop
ld #nhoff
ldh [x + 1]
jne #0x3, drop
...

This will automatically support all the de-encapsulation performed by
the flow dissector.

> Any uapi change must be strongly justified.
I hope I have managed to convince you with the example above. Still,
my proposal only add a couple of ancillary loads.

> Other than design issue there is technical problem too.
> You're not updating eBPF JIT and context->flow_initted will have junk,
> so ld_poff will be returning junk too.
> JITs and interpreter must match. That was always the case even in
> old days of classic BPF. Interpreter initialized A and X to zero,
> JITs had to do the same. If you add stuff like this to interpreter,
> JITs would have to follow. Fast forward few years and this is not scalable.
> You have several JITs around and adding custom stuff to interpreter
> may not even be possible to do in JITs.
This is a fair point. In fact my patch is breaking "ld poff."

I think the previous mechanism where a new insn added to BPF would
cause the BPF JIT engine to refuse to run was a more scalable: New
stuff always went in the non-jitted runner, and people interested in
the jitted version will port when they needed it. We definitely not
want to require people adding new features to provide 6 versions of
the code.

> Therefore I'm strongly against adding custom C code to interpreter.
> It's an interpreter of eBPF instructions. Nothing else.
> At the same time adding new eBPF instructions would be completely fine.
> Like right now we don't have < and <= conditional jumps, which adds
> unnecessary complexity to tracing filters. So we may still expand
> interpreter and JITs in the future.
>
> Going back to your use case… assuming that you share how user
> space will use new anc loads… you can do equivalent initialization
> of 'flow_initted' by emitting extra eBPF insn while converting classic
> to internal. Don't do it unconditionally though. tcpdump filters should
> not suffer. You can scan program for presence of ld_poff, ld_toff,
> ld_tproto and if they're present, emit single eBPF 'st [fp - off], 0'
> ST_MEM instruction as a first instruction of converted filter. All eBPF
> JITs will compile it to native automatically. You'll have good
> performance and all existing filters will not regress.
Sounds fair. I'll think of a good mechanism to allow other
instructions "sharing" state among themselves.

-Chema


>
>>         /* Register for user BPF programs need to be reset first. */
>>         regs[BPF_REG_A] = 0;
>>         regs[BPF_REG_X] = 0;
>> @@ -602,7 +619,10 @@ static unsigned int pkt_type_offset(void)
>>
>>  static u64 __skb_get_pay_offset(u64 ctx, u64 a, u64 x, u64 r4, u64 r5)
>>  {
>> -       return __skb_get_poff((struct sk_buff *)(unsigned long) ctx);
>> +       struct sk_run_filter_ctx *context = (void *) r4 - BPF_MEMWORDS * 4 -
>> +                       sizeof(*context);
>> +       return __skb_get_poff((struct sk_buff *)(unsigned long) ctx,
>> +                       &context->flow, &context->flow_initted);
>>  }
>>
>>  static u64 __skb_get_nlattr(u64 ctx, u64 a, u64 x, u64 r4, u64 r5)
>> @@ -783,6 +803,10 @@ static bool convert_bpf_extensions(struct sock_filter *fp,
>>                 *insn = BPF_ALU64_REG(BPF_MOV, BPF_REG_ARG3, BPF_REG_X);
>>                 insn++;
>>
>> +               /* arg4 = FP */
>> +               *insn = BPF_ALU64_REG(BPF_MOV, BPF_REG_ARG4, BPF_REG_FP);
>> +               insn++;
>> +
>>                 /* Emit call(ctx, arg2=A, arg3=X) */
>>                 insn->code = BPF_JMP | BPF_CALL;
>>                 switch (fp->k) {
>> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>> index 107ed12..cefe1d2 100644
>> --- a/net/core/flow_dissector.c
>> +++ b/net/core/flow_dissector.c
>> @@ -275,16 +275,20 @@ EXPORT_SYMBOL(__skb_tx_hash);
>>   * truncate packets without needing to push actual payload to the user
>>   * space and can analyze headers only, instead.
>>   */
>> -u32 __skb_get_poff(const struct sk_buff *skb)
>> +u32 __skb_get_poff(const struct sk_buff *skb, struct flow_keys *flow,
>> +               bool *flow_initted)
>>  {
>> -       struct flow_keys keys;
>>         u32 poff = 0;
>>
>> -       if (!skb_flow_dissect(skb, &keys))
>> -               return 0;
>> +       /* check whether the flow dissector has already been run */
>> +       if (!*flow_initted) {
>> +               if (!skb_flow_dissect(skb, flow))
>> +                       return 0;
>> +               *flow_initted = true;
>> +       }
>>
>> -       poff += keys.thoff;
>> -       switch (keys.ip_proto) {
>> +       poff += flow->thoff;
>> +       switch (flow->ip_proto) {
>>         case IPPROTO_TCP: {
>>                 const struct tcphdr *tcph;
>>                 struct tcphdr _tcph;
>> --
>> 1.9.1.423.g4596e3a
>>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html