netdev - RE: XDP multi-buffer design discussion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9258f0767307489cb54428a8c1d91a0d@EX13D11EUB003.ant.amazon.com>
Date:   Sun, 19 Jan 2020 07:34:00 +0000
From:   "Jubran, Samih" <sameehj@...zon.com>
To:     Luigi Rizzo <rizzo@....unipi.it>,
        Jesper Dangaard Brouer <brouer@...hat.com>
CC:     "Machulsky, Zorik" <zorik@...zon.com>,
        Daniel Borkmann <borkmann@...earbox.net>,
        David Miller <davem@...emloft.net>,
        "Tzalik, Guy" <gtzalik@...zon.com>,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Toke Høiland-Jørgensen <toke@...hat.com>,
        "Kiyanovski, Arthur" <akiyano@...zon.com>,
        Alexei Starovoitov <ast@...nel.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        David Ahern <dsahern@...il.com>
Subject: RE: XDP multi-buffer design discussion

> -----Original Message-----
> From: Luigi Rizzo <rizzo@....unipi.it>
> Sent: Thursday, December 19, 2019 7:29 PM
> To: Jesper Dangaard Brouer <brouer@...hat.com>
> Cc: Jubran, Samih <sameehj@...zon.com>; Machulsky, Zorik
> <zorik@...zon.com>; Daniel Borkmann <borkmann@...earbox.net>; David
> Miller <davem@...emloft.net>; Tzalik, Guy <gtzalik@...zon.com>; Ilias
> Apalodimas <ilias.apalodimas@...aro.org>; Toke Høiland-Jørgensen
> <toke@...hat.com>; Kiyanovski, Arthur <akiyano@...zon.com>; Alexei
> Starovoitov <ast@...nel.org>; netdev@...r.kernel.org; David Ahern
> <dsahern@...il.com>
> Subject: Re: XDP multi-buffer design discussion
> 
> On Thu, Dec 19, 2019 at 2:44 AM Jesper Dangaard Brouer
> <brouer@...hat.com> wrote:
> >
> > On Wed, 18 Dec 2019 16:03:54 +0000
> > "Jubran, Samih" <sameehj@...zon.com> wrote:
> >
> > > > -----Original Message-----
> > > > From: Luigi Rizzo <rizzo@....unipi.it> ...
> > > > Here is a possible course of action (please let me know if you
> > > >find  loose ends)
> > > >
> > > > 1. extend struct xdp_buff with a total length and sk_buff * (NULL
> > > > by default);
> 
> Top comment: thank you for the feedback, I see you disagree with my
> proposal :) so let me ask some questions to understand its shortcoming,
> identify use cases, and perhaps address subproblems individually.
> 
> > I don't like extending xdp_buff with an skb pointer. (Remember
> > xdp_buff is only "allocated" on the stack).
> 
> Nit, what are your specific concerns in adding a field or two (pointer
> + total length)
> to the xdp_buff (and to the struct visible to the bpf program) ?
> 
> > Perhaps xdp_buff with a total-length, *BUT* this can also be solved in
> > other ways.  Extend xdp_buff with a flag indicating that this xdp_buff
> > have multiple-buffers (segments).  Then we know that the area we store
> > the segments is valid, and we can also store the total-length there.
> > (We need this total-length later after xdp_buff is freed)
> 
> No matter how we extend the structure we must address problem #1
>  #1: old bpf programs may not be aware of the new format
>        hence may process packets incorrectly.
> I don't think there is a way out of this other than have a mechanism for the
> bpf programs to indicate which ABI(s) they understand.
> Since there are several ways to address it (via BTF, etc.) at load time and with
> minimal code changes in the kernel (and none in the loader programs), I
> think we can discuss this separately if needed, and proceed with the
> assumption that using a slightly different xdp_buff with additional info (or
> bpf helpers to retrieve them) is a solved problem.
> 
> ...
> > > > 2. add a netdev callback to construct the skb for the current
> > > >    packet. This code obviously already in all drivers, just needs to
> > > >    be exposed as function callable by a bpf helper (next bullet);
> > > >
> >
> > Today we already have functions that create an SKB from an xdp_frame.
> > ...
> 
> I probably should have been more clear on this point.
> 
> I introduced the skb to address problem #2:
>   #2:  how do we access a multi-segment packet from within the bpf program
>       (in other words, how do we represent a multi-segment packet)
> 
> and suggested the helper to address problem #3
>   #3: how do avoid constructing such a representation if not needed ?
> 
> I don't care whether this format is an skb (more details below), but the
> xdp_frame that you mention seems to have only information on the first
> segment so I am not sure how it can help with multi-segment frames.
> 
> To elaborate: for #2 we should definitely find some optimized solution for
> the common cases. The current xdp_buff is optimized for one segment (alas,
> it only supports that), and so drivers disable header split when doing xdp to
> comply.
> Header split with hdr+body is possibly very common too (especially if we
> don't artificially disable it), so we could/should redefine the
> xdp_{buff|frame} with static room for two segments (header + rest). This
> can be populated unconditionally at relatively low cost, for both the caller
> and the bpf program.
> 
> For larger number of segments, though, there is going to be an O(N)
> space/time cost, with potentially large N (say when a NIC does RSC before
> calling the driver, resulting in many segments), and non-negligible constants
> (each segment may require a dma_sync, and perhaps a decision to recycle or
> replenish the buffer).
> These are all things that we don't want to do upfront because it could be
> wasted work, hence my suggestion for a netdev method and matching bpf
> helper to create whatever multi-segment representation only when the bpf
> program asks for it.
> 
> Again, this would be problem #3, which relates to the next bullet:
> 
> > > > 3. add a new helper 'bpf_create_skb' that when invoked calls the
> > > >    previously mentioned netdev callback to  constructs an skb for
> > > >    the current packet, and sets the pointer in the xdp_buff, if not
> > > >    there already. A bpf program that needs to access segments beyond
> > > >    the first one can call bpf_create_skb() if needed, and then use
> > > >    existing helpers skb_load_bytes, skb_store_bytes, etc) to access
> > > >    the skb.
> > > >
> >
> > I respectfully disagree with this approach.
> >
> > One reason is that we want to support creating SKBs on a remote CPU.
> > Like we do today with CPUMAP redirect.  The proposed helper
> 'bpf_create_skb'
> > implies that this happens in BPF-context, during the NAPI loop.
> 
> restricting the discussion to the non-optimized case (>2 segments):
> there is an O(N) phase to grab the segments that has to happen in the napi
> loop. Likewise, storage for the list/array of segments must be procured in the
> napi loop (it could be in the page containing the packet, but we have no
> guarantee that there is space), and then copied on the remote CPU into a
> suitable format for xmit (which again does not have to be an skb, but see
> more details in response to the last item).
> 
> > > >   My rationale is that if we need to access multiple segments, we
> > > >   are already in an expensive territory and it makes little sense to
> > > >   define a multi segment format that would essentially be an skb.
> > > >
> >
> > I also disagree. Multi-egment frames also have some high speed
> > use-cases.  Especially HW header-split at IP+TCP boundry is
> > interesting.
> 
> Acknowledged and agreed. I have specifically added the 2-segments case in
> the the discussion above.
> 
> > > > The native driver for a device that cannot guarantee a single
> > > > segment would just refuse to load a program that does not
> > > > understand them (same as today), so the code would be:
> > > >
> > > > <construct xdp_buff with first segment and empty skb>  <call bpf
> > > > program>
> > > >
> > > > On return, we might find that an skb has been built by the xdp
> > > > program, and can be immediately used for XDP_PASS (or dropped in
> > > > case of XDP_DROP)
> >
> > I also disagree here.  SKB should first be created AFTER we know if
> > this will be a XDP_PASS action.
> 
> Note that I said "we might" - the decision to build the "full_descriptor"
> (skb or other format) is made by the bpf program itself. As a consequence, it
> won't build a full_descriptor before making a decision, UNLESS it needs to
> see the whole packet to make a decision, in which case there is no better
> solution anyways.
> 
> The only thing where we may look for optimizations is what format this
> full_descriptor should have. Depending on the outcome:
> XDP_DROP: don't care, it is going to be dropped
> XDP_PASS: skb seems convenient since the network stack expects that
> XDP_TX if the buffers are mapped dma_bidir, just an array of dma_addr + len
>    would work (no dma_map or page refcounts needed) XDP_REDIRECT array
> of page fragments (since we need to dma-map them
>   for the destination and play with refcounts)
> 
> cheers
> Luigi

Hi all,

We have two proposed design solutions.  One (Jasper’s) seems easier to implement and gives the average XDP developer the framework to deal with multiple buffers.  The other (Luigi’s) seems more complete but raises a few questions:
1.	The netdev's callback might be too intrusive to the net drivers and requires the driver to somehow save context of the current processed packet
2.	The solution might be an overkill to the average XDP developer.  Does the average XDP developer really need full access to the packet?

Since Jesper's design is easier to implement as well as leaves a way for future extension to Luigi's design, I'm going to implement and share it with you.