netdev - Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190214055725.GC10595@mini-arch>
Date:   Wed, 13 Feb 2019 21:57:25 -0800
From:   Stanislav Fomichev <sdf@...ichev.me>
To:     Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc:     Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Stanislav Fomichev <sdf@...gle.com>,
        Network Development <netdev@...r.kernel.org>,
        David Miller <davem@...emloft.net>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        simon.horman@...ronome.com, Willem de Bruijn <willemb@...gle.com>
Subject: Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when
 called from eth_get_headlen

On 02/13, Alexei Starovoitov wrote:
> On Tue, Feb 12, 2019 at 09:02:32AM -0800, Stanislav Fomichev wrote:
> > On 02/05, Stanislav Fomichev wrote:
> > > On 02/05, Alexei Starovoitov wrote:
> > > > On Tue, Feb 05, 2019 at 07:56:19PM -0800, Stanislav Fomichev wrote:
> > > > > On 02/05, Alexei Starovoitov wrote:
> > > > > > On Tue, Feb 05, 2019 at 04:59:31PM -0800, Stanislav Fomichev wrote:
> > > > > > > On 02/05, Alexei Starovoitov wrote:
> > > > > > > > On Tue, Feb 05, 2019 at 12:40:03PM -0800, Stanislav Fomichev wrote:
> > > > > > > > > On 02/05, Willem de Bruijn wrote:
> > > > > > > > > > On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@...gle.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> > > > > > > > > > > skb. Because we use passed skb to lookup associated networking namespace
> > > > > > > > > > > to find whether we have a BPF program attached or not, we always use
> > > > > > > > > > > C-based flow dissector in this case.
> > > > > > > > > > >
> > > > > > > > > > > The goal of this patch series is to add new networking namespace argument
> > > > > > > > > > > to the eth_get_headlen and make BPF flow dissector programs be able to
> > > > > > > > > > > work in the skb-less case.
> > > > > > > > > > >
> > > > > > > > > > > The series goes like this:
> > > > > > > > > > > 1. introduce __init_skb and __init_skb_shinfo; those will be used to
> > > > > > > > > > >    initialize temporary skb
> > > > > > > > > > > 2. introduce skb_net which can be used to get networking namespace
> > > > > > > > > > >    associated with an skb
> > > > > > > > > > > 3. add new optional network namespace argument to __skb_flow_dissect and
> > > > > > > > > > >    plumb through the callers
> > > > > > > > > > > 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
> > > > > > > > > > >    (using __init_skb) and calls BPF flow dissector program
> > > > > > > > > > 
> > > > > > > > > > The main concern I see with this series is this cost of skb zeroing
> > > > > > > > > > for every packet in the device driver receive routine, *independent*
> > > > > > > > > > from the real skb allocation and zeroing which will likely happen
> > > > > > > > > > later.
> > > > > > > > > Yes, plus ~200 bytes on the stack for the callers.
> > > > > > > > > 
> > > > > > > > > Not sure how visible this zeroing though, I can probably try to get some
> > > > > > > > > numbers from BPF_PROG_TEST_RUN (running current version vs running with
> > > > > > > > > on-stack skb).
> > > > > > > > 
> > > > > > > > imo extra 256 byte memset for every packet is non starter.
> > > > > > > We can put pre-allocated/initialized skbs without data into percpu or even
> > > > > > > use pcpu_freelist_pop/pcpu_freelist_push to make sure we don't have to think
> > > > > > > about having multiple percpu for irq/softirq/process contexts.
> > > > > > > Any concerns with that approach?
> > > > > > > Any other possible concerns with the overall series?
> > > > > > 
> > > > > > I'm missing why the whole thing is needed.
> > > > > > You're saying:
> > > > > > " make BPF flow dissector programs be able to work in the skb-less case".
> > > > > > What does it mean specifically?
> > > > > > The only non-skb case is XDP.
> > > > > > Are you saying you want flow_dissector prog to be run in XDP?
> > > > > eth_get_headlen that drivers call on RX path on a chunk of data to
> > > > > guesstimate the length of the headers calls flow dissector without an skb
> > > > > (__skb_flow_dissect was a weird interface where it accepts skb or
> > > > > data+len). Right now, there is no way to trigger BPF flow dissector
> > > > > for this case (we don't have an skb to get associated namespace/etc/etc).
> > > > > The patch series tries to fix that to make sure that we always trigger
> > > > > BPF program if it's attached to a device's namespace.
> > > > 
> > > > then why not to create flow_dissector prog type that works without skb?
> > > > Why do you need to fake an skb?
> > > > XDP progs work just fine without it.
> > > What's the advantage of having another prog type? In this case we would have
> > > to write the same flow dissector program twice: first time against __skb_buff
> > > interface, second time against xdp_md.
> > > By using fake skb, we make the same flow dissector __sk_buff BPF program
> > > work in both contexts without a rewrite to an xdp interface (I don't
> > > think users should care whether flow dissector was called form "xdp" vs skb
> > > context; and we're sort of stuck with __sk_buff interface already).
> > Should I follow up with v2 where I address memset(,,256) for each packet?
> > Or you still have some questions/doubts/suggestions regarding the problem
> > I'm trying to solve?
> 
> sorry for delay. I'm still thinking what is the path forward here.
No worries, thanks for sticking with me :-)

> That 'stuck with __sk_buff' is what bothers me.
I might have use the wrong word here. I don't think there is another
option to be honest. Using __sk_buff makes flow dissector programs work
with fragmented packets; if we were to use xdp_meta instead, it would
not work in this case. Another point here: the fact that
eth_get_headlen calls flow dissector on a chunk of data instead of skb
feel like an implementation detail. Imo, application writers should not
care about this context; coding against __sk_buff feels like the best we
can do.

> It's an indication that api wasn't thought through if first thing
> it needs is this fake skb hack.
> If bpf_flow.c is a realistic example of such flow dissector prog
> it means that real skb fields are accessed.
> In particular skb->vlan_proto, skb->protocol.
I do manually set skb->protocol to eth->h_proto in my proposal. This is later
correctly handled by bpf_flow.c: parse_eth_proto() is called on skb->protocol
and we correctly handle bpf_htons(ETH_P_8021Q) there. So existing
bpf_flow.c works as expected.

Related: I was also thinking about moving this check out of bpf_flow.c
and pass n_proto directly. I don't see why bpf_flow_keys "export"
n_proto, we know it when we call flow dissector, there is no need to
test for skb->vlan_present in the bpf program.

> These fields in case of 'fake skb' will not be set, since eth_type_trans()
> isn't called yet.
Just to reiterate, I do set skb->protocol manually and
skb->vlan_preset == false (and bpf_flow.c handles this case).
We can also set correct vlan_preset with an additional check, but I
decided to not do it since bpf_flow.c handles that.

> So either flow_dissector needs a real __sk_buff and all of its fields
> should be real or it's a different flow_dissector prog type that
> needs ctx->data, ctx->data_end, ctx->flow_keys only.
> Either way going with fake skb is incorrect, since bpf_flow.c example
> will be broken and for program writers it will be hard to figure why
> it's broken.
It's fake in a sense that it's stack allocated in this particular case.
To the bpf_flow.c it looks like a real __sk_buff. I don't see why
bpf_flow.c example might be broken in this case, can you elaborate?

The goal of this patch series was to essentially make this skb/no-skb
context transparent to the bpf_flow.c (i.e. no changes from the user
flow programs). Adding another flow dissector for eth_get_headlen case
also seems as a no go.