[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAF=yD-L3805_=+fiWcFrm=pi0soC05yzXZPOCk6aeKBBpubg=A@mail.gmail.com>
Date: Wed, 12 Sep 2018 18:53:26 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Petar Penkov <peterpenkov96@...il.com>,
Network Development <netdev@...r.kernel.org>,
David Miller <davem@...emloft.net>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
simon.horman@...ronome.com, ecree@...arflare.com,
songliubraving@...com, Tom Herbert <tom@...bertland.com>,
Petar Penkov <ppenkov@...gle.com>,
Willem de Bruijn <willemb@...gle.com>
Subject: Re: [bpf-next, v2 1/3] flow_dissector: implements flow dissector BPF hook
On Wed, Sep 12, 2018 at 6:25 PM Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
>
> On Wed, Sep 12, 2018 at 02:43:37PM -0400, Willem de Bruijn wrote:
> > On Tue, Sep 11, 2018 at 11:47 PM Alexei Starovoitov
> > <alexei.starovoitov@...il.com> wrote:
> > >
> > > On Fri, Sep 07, 2018 at 05:11:08PM -0700, Petar Penkov wrote:
> > > > From: Petar Penkov <ppenkov@...gle.com>
> > > >
> > > > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > > > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > > > path. The BPF program is per-network namespace.
> > > >
> > > > Signed-off-by: Petar Penkov <ppenkov@...gle.com>
> > > > Signed-off-by: Willem de Bruijn <willemb@...gle.com>
> > > > ---
> > > > include/linux/bpf.h | 1 +
> > > > include/linux/bpf_types.h | 1 +
> > > > include/linux/skbuff.h | 7 ++
> > > > include/net/net_namespace.h | 3 +
> > > > include/net/sch_generic.h | 12 ++-
> > > > include/uapi/linux/bpf.h | 25 ++++++
> > > > kernel/bpf/syscall.c | 8 ++
> > > > kernel/bpf/verifier.c | 32 ++++++++
> > > > net/core/filter.c | 67 ++++++++++++++++
> > > > net/core/flow_dissector.c | 136 +++++++++++++++++++++++++++++++++
> > > > tools/bpf/bpftool/prog.c | 1 +
> > > > tools/include/uapi/linux/bpf.h | 25 ++++++
> > > > tools/lib/bpf/libbpf.c | 2 +
> > >
> > > please split up update to tools/include/uapi/linux/bpf.h as a separate patch 2.
> > > We often have conflicts in there, so best to have a separate.
> > > Also please split tools/lib and tools/bpf chnages into patch 3.
> >
> > Will do in v3.
> >
> > > > struct qdisc_skb_cb {
> > > > - unsigned int pkt_len;
> > > > - u16 slave_dev_queue_mapping;
> > > > - u16 tc_classid;
> > > > + union {
> > > > + struct {
> > > > + unsigned int pkt_len;
> > > > + u16 slave_dev_queue_mapping;
> > > > + u16 tc_classid;
> > > > + };
> > > > + struct bpf_flow_keys *flow_keys;
> > > > + };
> > >
> > > is this magic really necessary? flow_dissector runs very early in recv path.
> > > There is no qdisc or conflicts with tcp/ip use of cb.
> > > I think the whole cb block can be used.
> >
> > The flow dissector also runs in the context of TC, from flower.
> > But that is not a reason to use this struct.
> >
> > We need both (a) data shared with the BPF application and between
> > applications using tail-calls, to pass along the parse offset (nhoff),
> > and (b) data not accessible by the program, to store the flow_keys
> > pointer.
> >
> > qdisc_skb_cb already has this split, exposing only the 20B .data
> > field to the application. Flow dissection currently reuses the existing
> > bpf_convert_ctx_access logic for this field.
> >
> > We could create a separate flowdissect_skb_cb struct with the
> > same split setup, but a second constraint is that relevant internal
> > BPF interfaces already expect qdisc_skb_cb, notably
> > bkf_skb_data_end. So the union was easier.
>
> got it. all makes sense.
>
> >
> > There is another way to pass nhoff besides cb[] (see below). But
> > I don't immediately see another place to store the flow_keys ptr.
> >
> > At least, if we pass skb as context. One larger change would
> > be to introduce another ctx struct, similar to sk_reuseport_(kern|md).
>
> yeah. thought about this too, but your approach looks easier and faster.
> Accesses to skb have one less dereference.
>
> > > > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> > > > /* ... here. */
> > > >
> > > > __u32 data_meta;
> > > > + __u32 flow_keys;
> > >
> > > please use
> > > struct bpf_flow_keys *flow_keys;
> > > instead.
> > >
> > > See what we did in 'struct sk_msg_md' and in 'struct sk_reuseport_md'.
> > > There is no need to hide pointers in u32.
> > >
> >
> > Will do in v3.
> >
> > > > @@ -658,6 +754,46 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> > > > FLOW_DISSECTOR_KEY_BASIC,
> > > > target_container);
> > > >
> > > > + rcu_read_lock();
> > > > + attached = skb ? rcu_dereference(dev_net(skb->dev)->flow_dissector_prog)
> > > > + : NULL;
> > > > + if (attached) {
> > > > + /* Note that even though the const qualifier is discarded
> > > > + * throughout the execution of the BPF program, all changes(the
> > > > + * control block) are reverted after the BPF program returns.
> > > > + * Therefore, __skb_flow_dissect does not alter the skb.
> > > > + */
> > > > + struct bpf_flow_keys flow_keys = {};
> > > > + struct qdisc_skb_cb cb_saved;
> > > > + struct qdisc_skb_cb *cb;
> > > > + u16 *pseudo_cb;
> > > > + u32 result;
> > > > +
> > > > + cb = qdisc_skb_cb(skb);
> > > > + pseudo_cb = (u16 *)bpf_skb_cb((struct sk_buff *)skb);
> > > > +
> > > > + /* Save Control Block */
> > > > + memcpy(&cb_saved, cb, sizeof(cb_saved));
> > > > + memset(cb, 0, sizeof(cb_saved));
> > > > +
> > > > + /* Pass parameters to the BPF program */
> > > > + cb->flow_keys = &flow_keys;
> > > > + *pseudo_cb = nhoff;
> > >
> > > I don't understand this bit.
> > > What is this pseudo_cb and why nhoff goes in there?
> > > Some odd way to pass it into the prog?
> >
> > Yes. nhoff passes the offset to the program to start parsing from.
> > Applications also pass this during tail calls.
> >
> > Alternatively we can just add a new field to struct bpf_flow_keys.
>
> I think that certainly will be cleaner and easier to use from
> bpf prog pov. Since flow_keys stay constant any change to nhoff
> between tail_calls will be preserved too. I see no cons to such approach.
Yes, it's definitely simpler. We'll do that.
Thanks!
Powered by blists - more mailing lists