[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+ZOOTPmSPeBrSHAsJtyxkAUnZDpgdhhw2PkrCJ3LhmG+ZSRYA@mail.gmail.com>
Date: Mon, 19 May 2014 15:23:53 -0700
From: Chema Gonzalez <chema@...gle.com>
To: Alexei Starovoitov <ast@...mgrid.com>
Cc: David Miller <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Daniel Borkmann <dborkman@...hat.com>,
Network Development <netdev@...r.kernel.org>
Subject: Re: [PATCH v5 net-next 1/3] net: flow_dissector: avoid multiple calls
in BPF
On Fri, May 16, 2014 at 3:00 PM, Alexei Starovoitov <ast@...mgrid.com> wrote:
> this is way too long for commit log. These details can go after '---' line.
This can be fixed easily ;). I must admit I'm surprised by this
request: The way I see it, a commit log is not part of the code, but
text that justifies a patch. Therefore, being extremely descriptive is
a good thing. But it's got an easy solution.
>> + * such field has been inited)
>> + */
>> + context = (void *)FP - BPF_MEMWORDS * 4 - sizeof(*context);
>> + context->flow_initted = false;
>> +
>
> Neither this code nor comment belong in interpreter which is striving to
> be generic. Both have a meaning only for your new instructions in
> socket filters. They do not apply to seccomp and to tracing.
> Calling it 'struct sk_run_filter_ctx' is equally misleading.
> 'struct flow_keys' has nothing to do with seccomp and even with
> normal tcpdump filters.
I'd argue specificity already pervades the BPF filter. Almost all the
ancillary loads (now bpf_calls) assume a sk_buff in the ctx field (all
but the get_cpu and random). If the user's filter has a "ld #nla"
insn, the eBPF filter ends up having "mov(ctx, arg1); mov(A, arg2);
...; call(__skb_get_nlattr)", which assumes ctx has an skb, and runs
"nla_find((struct nlattr *) &skb->data[a], ...)".
Now, this specificity is needed: We're using BPF filters to filter
packets. It would make no sense not to use all the packet-related
functions in the kernel.
> We've spent so much time to generalize it and now you want to take
> it back into socket specific territory. As I said in the previous email
> it's totally not ok to customize eBPF just for one use case especially
> when you're not sharing final usage scenario.
Sorry if I haven't mentioned it before.
Right now I use the following filter to detect some packets I'm
interested on ('ip[1] == 0x03 and tcp and dst port 80' in tcpdump
expression-ish):
$ cat tools/net/tos_and_tcp_and_dport.bpf
ldh [12]
jne #0x800, drop
ldb [15]
jne #0x3, drop
ldb [23]
jne #0x6, drop
ldh [20]
jset #0x1fff, drop
ldxb 4*([14]&0xf)
ldh [x + 16]
jne #0x50, drop
ret #-1
drop: ret #0
Now, I want to do the same filtering in encapsulated traffic (vlan and
gre). I'd like to be able to write something like:
$ cat tools/net/tos_and_tcp_and_dport_generic.bpf
ld #nproto
jne #0x800, drop
ld #nhoff
ldh [x + 1]
jne #0x3, drop
...
This will automatically support all the de-encapsulation performed by
the flow dissector.
> Any uapi change must be strongly justified.
I hope I have managed to convince you with the example above. Still,
my proposal only add a couple of ancillary loads.
> Other than design issue there is technical problem too.
> You're not updating eBPF JIT and context->flow_initted will have junk,
> so ld_poff will be returning junk too.
> JITs and interpreter must match. That was always the case even in
> old days of classic BPF. Interpreter initialized A and X to zero,
> JITs had to do the same. If you add stuff like this to interpreter,
> JITs would have to follow. Fast forward few years and this is not scalable.
> You have several JITs around and adding custom stuff to interpreter
> may not even be possible to do in JITs.
This is a fair point. In fact my patch is breaking "ld poff."
I think the previous mechanism where a new insn added to BPF would
cause the BPF JIT engine to refuse to run was a more scalable: New
stuff always went in the non-jitted runner, and people interested in
the jitted version will port when they needed it. We definitely not
want to require people adding new features to provide 6 versions of
the code.
> Therefore I'm strongly against adding custom C code to interpreter.
> It's an interpreter of eBPF instructions. Nothing else.
> At the same time adding new eBPF instructions would be completely fine.
> Like right now we don't have < and <= conditional jumps, which adds
> unnecessary complexity to tracing filters. So we may still expand
> interpreter and JITs in the future.
>
> Going back to your use case… assuming that you share how user
> space will use new anc loads… you can do equivalent initialization
> of 'flow_initted' by emitting extra eBPF insn while converting classic
> to internal. Don't do it unconditionally though. tcpdump filters should
> not suffer. You can scan program for presence of ld_poff, ld_toff,
> ld_tproto and if they're present, emit single eBPF 'st [fp - off], 0'
> ST_MEM instruction as a first instruction of converted filter. All eBPF
> JITs will compile it to native automatically. You'll have good
> performance and all existing filters will not regress.
Sounds fair. I'll think of a good mechanism to allow other
instructions "sharing" state among themselves.
-Chema
>
>> /* Register for user BPF programs need to be reset first. */
>> regs[BPF_REG_A] = 0;
>> regs[BPF_REG_X] = 0;
>> @@ -602,7 +619,10 @@ static unsigned int pkt_type_offset(void)
>>
>> static u64 __skb_get_pay_offset(u64 ctx, u64 a, u64 x, u64 r4, u64 r5)
>> {
>> - return __skb_get_poff((struct sk_buff *)(unsigned long) ctx);
>> + struct sk_run_filter_ctx *context = (void *) r4 - BPF_MEMWORDS * 4 -
>> + sizeof(*context);
>> + return __skb_get_poff((struct sk_buff *)(unsigned long) ctx,
>> + &context->flow, &context->flow_initted);
>> }
>>
>> static u64 __skb_get_nlattr(u64 ctx, u64 a, u64 x, u64 r4, u64 r5)
>> @@ -783,6 +803,10 @@ static bool convert_bpf_extensions(struct sock_filter *fp,
>> *insn = BPF_ALU64_REG(BPF_MOV, BPF_REG_ARG3, BPF_REG_X);
>> insn++;
>>
>> + /* arg4 = FP */
>> + *insn = BPF_ALU64_REG(BPF_MOV, BPF_REG_ARG4, BPF_REG_FP);
>> + insn++;
>> +
>> /* Emit call(ctx, arg2=A, arg3=X) */
>> insn->code = BPF_JMP | BPF_CALL;
>> switch (fp->k) {
>> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>> index 107ed12..cefe1d2 100644
>> --- a/net/core/flow_dissector.c
>> +++ b/net/core/flow_dissector.c
>> @@ -275,16 +275,20 @@ EXPORT_SYMBOL(__skb_tx_hash);
>> * truncate packets without needing to push actual payload to the user
>> * space and can analyze headers only, instead.
>> */
>> -u32 __skb_get_poff(const struct sk_buff *skb)
>> +u32 __skb_get_poff(const struct sk_buff *skb, struct flow_keys *flow,
>> + bool *flow_initted)
>> {
>> - struct flow_keys keys;
>> u32 poff = 0;
>>
>> - if (!skb_flow_dissect(skb, &keys))
>> - return 0;
>> + /* check whether the flow dissector has already been run */
>> + if (!*flow_initted) {
>> + if (!skb_flow_dissect(skb, flow))
>> + return 0;
>> + *flow_initted = true;
>> + }
>>
>> - poff += keys.thoff;
>> - switch (keys.ip_proto) {
>> + poff += flow->thoff;
>> + switch (flow->ip_proto) {
>> case IPPROTO_TCP: {
>> const struct tcphdr *tcph;
>> struct tcphdr _tcph;
>> --
>> 1.9.1.423.g4596e3a
>>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists