netdev - Re: XDP multi-buffer design discussion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191219114438.0bcb33ea@carbon>
Date:   Thu, 19 Dec 2019 11:44:38 +0100
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     "Jubran, Samih" <sameehj@...zon.com>
Cc:     Luigi Rizzo <rizzo@....unipi.it>,
        "Machulsky, Zorik" <zorik@...zon.com>,
        Daniel Borkmann <borkmann@...earbox.net>,
        David Miller <davem@...emloft.net>,
        "Tzalik, Guy" <gtzalik@...zon.com>,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Toke Høiland-Jørgensen 
        <toke@...hat.com>, "Kiyanovski, Arthur" <akiyano@...zon.com>,
        Alexei Starovoitov <ast@...nel.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        David Ahern <dsahern@...il.com>, brouer@...hat.com
Subject: Re: XDP multi-buffer design discussion

On Wed, 18 Dec 2019 16:03:54 +0000
"Jubran, Samih" <sameehj@...zon.com> wrote:

> > -----Original Message-----
> > From: Luigi Rizzo <rizzo@....unipi.it>
> > Sent: Wednesday, December 18, 2019 12:30 AM
> > To: Jesper Dangaard Brouer <brouer@...hat.com>
> > Cc: Jubran, Samih <sameehj@...zon.com>; Machulsky, Zorik
> > <zorik@...zon.com>; Daniel Borkmann <borkmann@...earbox.net>; David
> > Miller <davem@...emloft.net>; Tzalik, Guy <gtzalik@...zon.com>; Ilias
> > Apalodimas <ilias.apalodimas@...aro.org>; Toke Høiland-Jørgensen
> > <toke@...hat.com>; Kiyanovski, Arthur <akiyano@...zon.com>; Alexei
> > Starovoitov <ast@...nel.org>; netdev@...r.kernel.org; David Ahern
> > <dsahern@...il.com>
> > Subject: Re: XDP multi-buffer design discussion
> > 
> > On Tue, Dec 17, 2019 at 12:46 AM Jesper Dangaard Brouer
> > <brouer@...hat.com> wrote:  
> > >
> > > On Mon, 16 Dec 2019 20:15:12 -0800
> > > Luigi Rizzo <rizzo@....unipi.it> wrote:
> > >...  
> > > > For some use cases, the bpf program could deduct the total length
> > > > looking at the L3 header.  
> > >
> > > Yes, that actually good insight.  I guess the BPF-program could also
> > > use this to detect that it doesn't have access to the full-lineary
> > > packet this way(?)
> > >  
> > > > It won't work for XDP_TX response though.  
> > >
> > > The XDP_TX case also need to be discussed/handled. IMHO need to
> > > support XDP_TX for multi-buffer frames.  XDP_TX *can* be driver
> > > specific, but most drivers choose to convert xdp_buff to
> > > xdp_frame, which makes it possible to use/share part of the
> > > XDP_REDIRECT code from ndo_xdp_xmit.  
> > >
> > > We also need to handle XDP_REDIRECT, which becomes challenging, as the
> > > ndo_xdp_xmit functions of *all* drivers need to be updated (or
> > > introduce a flag to handle this incrementally).  
> > 
> > Here is a possible course of action (please let me know if you find
> > loose ends)
> > 
> > 1. extend struct xdp_buff with a total length and sk_buff * (NULL by default);
> >

I don't like extending xdp_buff with an skb pointer. (Remember xdp_buff
is only "allocated" on the stack).

Perhaps xdp_buff with a total-length, *BUT* this can also be solved in
other ways.  Extend xdp_buff with a flag indicating that this xdp_buff
have multiple-buffers (segments).  Then we know that the area we store
the segments is valid, and we can also store the total-length there.
(We need this total-length later after xdp_buff is freed)

To make life easier for BPF-developers, we could export/create a
BPF-helper bpf_xdp_total_len(ctx), that hide whether segments are there
or not.


> > 2. add a netdev callback to construct the skb for the current
> >    packet. This code obviously already in all drivers, just needs to
> >    be exposed as function callable by a bpf helper (next bullet);
> >

Today we already have functions that create an SKB from an xdp_frame.

First the xdp_buff is converted to an xdp_frame, which memory area
lives in the top (32 bytes) of the packet-page.
(See function call convert_to_xdp_frame() in include/net/xdp.h).

Today two function create an SKB from this xdp_frame, and they should
likely be consolidated.  Functions: (1) cpu_map_build_skb and (2)
veth_xdp_rcv_one (and dive into veth_build_skb). (Hint both use a
variant of build_skb).

The challenge for you, Samih, is the placement of skb_shared_info
within the packet-page memory area.  These two xdp_frame->SKB
functions, place skb_shared_info after xdp_frame->len packet len
(aligned).  IMHO the placement of skb_shared_info should NOT vary this
much. Instead this should be xdp->data_hard_end - 320 bytes (size of
skb_shared_info).  First challenge is fixing this...


> > 3. add a new helper 'bpf_create_skb' that when invoked calls the
> >    previously mentioned netdev callback to  constructs an skb for
> >    the current packet, and sets the pointer in the xdp_buff, if not
> >    there already. A bpf program that needs to access segments beyond
> >    the first one can call bpf_create_skb() if needed, and then use
> >    existing helpers skb_load_bytes, skb_store_bytes, etc) to access
> >    the skb.
> >

I respectfully disagree with this approach.

One reason is that we want to support creating SKBs on a remote CPU.
Like we do today with CPUMAP redirect.  The proposed helper 'bpf_create_skb'
implies that this happens in BPF-context, during the NAPI loop.


> >   My rationale is that if we need to access multiple segments, we
> >   are already in an expensive territory and it makes little sense to
> >   define a multi segment format that would essentially be an skb.
> >

I also disagree. Multi-egment frames also have some high speed
use-cases.  Especially HW header-split at IP+TCP boundry is
interesting.


> > 
> > 4. implement a mechanism to let so the driver know whether the
> >    currently loaded bpf program understands the new format.  There
> >    are multiple ways to do that, a trivial one would be to check,
> >    during load, that the program calls some known helper eg
> >    bpf_understands_fragments() which is then jit-ed to somethijng
> >    inexpensive
> > 
> >   Note that today, a netdev that cannot guarantee single segment
> > packets would not be able to enable xdp. Hence, without loss of
> > functionality, such netdev can refuse to load a program without
> > bpf_undersdands_fragments().
> > 
> >
> > With all the above, the generic xdp handler would do the following:
> >  if (!skb_is_linear() && !bpf_understands_fragments()) {
> >     < linearize skb>;
> >  }
> >   <construct xdp_buff with first segment and skb> // skb is unused by old style programs
> >   <call bpf program>
> > 
> > The native driver for a device that cannot guarantee a single
> > segment would just refuse to load a program that does not
> > understand them (same as today), so the code would be:
> >
> > <construct xdp_buff with first segment and empty skb>  <call bpf program>
> >
> > On return, we might find that an skb has been built by the xdp
> > program, and can be immediately used for XDP_PASS (or dropped in
> > case of XDP_DROP)

I also disagree here.  SKB should first be created AFTER we know if
this will be a XDP_PASS action. 


> > For XDP_TX and XDP_REDIRECT, something similar:
> > if the packet is a single segment and there is no skb, use the
> > existing accelerated path. If there are multiple segments,
> > construct the skb if not existing already, and pass it to the
> > standard tx path.
> > 
> > cheers
> > luigi  
> 
> I have went over your suggestion, it looks good to me! I couldn't
> find any loose ends. One thing to note is that the driver now needs
> to save the context of the currently processed packet in for each
> queue so that it can support the netdev callback for creating the skb.
>
> This sounds a bit messy, but I think it should work.
> 
> I'd love to hear more thoughts on this,
> Jesper, Toke what do you guys think?

As you can see from my comments I (respectfully) disagree with this
approach.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer