netdev - Re: [PATCH bpf-next 08/11] i40e: add AF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180831095521.08b3515f@cakuba.netronome.com>
Date:   Fri, 31 Aug 2018 09:55:21 +0200
From:   Jakub Kicinski <jakub.kicinski@...ronome.com>
To:     Björn Töpel <bjorn.topel@...el.com>
Cc:     Björn Töpel <bjorn.topel@...il.com>,
        magnus.karlsson@...el.com, magnus.karlsson@...il.com,
        alexander.h.duyck@...el.com, alexander.duyck@...il.com,
        ast@...nel.org, brouer@...hat.com, daniel@...earbox.net,
        netdev@...r.kernel.org, jesse.brandeburg@...el.com,
        anjali.singhai@...el.com, peter.waskiewicz.jr@...el.com,
        michael.lundkvist@...csson.com, willemdebruijn.kernel@...il.com,
        john.fastabend@...il.com, neerav.parikh@...el.com,
        mykyta.iziumtsev@...aro.org, francois.ozog@...aro.org,
        ilias.apalodimas@...aro.org, brian.brooks@...aro.org,
        u9012063@...il.com, pavel@...tnetmon.com, qi.z.zhang@...el.com
Subject: Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support

On Thu, 30 Aug 2018 14:06:24 +0200, Björn Töpel wrote:
> On 2018-08-29 21:14, Jakub Kicinski wrote:
>  > On Tue, 28 Aug 2018 14:44:32 +0200, Björn Töpel wrote:  
>  >> From: Björn Töpel <bjorn.topel@...el.com>
>  >>
>  >> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
>  >> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
>  >> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
>  >> queue.
>  >>
>  >> All AF_XDP specific functions are added to a new file, i40e_xsk.c.
>  >>
>  >> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
>  >> will allocate a new buffer and copy the zero-copy frame prior passing
>  >> it to the kernel stack.
>  >>
>  >> Signed-off-by: Björn Töpel <bjorn.topel@...el.com>  
>  >
>  > Mm.. I'm surprised you don't run into buffer reuse issues that I had
>  > when playing with AF_XDP.  What happens in i40e if someone downs the
>  > interface?  Will UMEMs get destroyed?  Will the RX buffers get freed?
>  >  
> 
> The UMEM will linger in the driver until the sockets are dead.
> 
>  > I'll shortly send an RFC with my quick and dirty RX buffer reuse queue,
>  > FWIW.
>  >  
> 
> Some background for folks that don't know the details: A zero-copy
> capable driver picks buffers off the fill ring and places them on the
> hardware Rx ring to be completed at a later point when DMA is
> complete. Analogous for the Tx side; The driver picks buffers off the
> Tx ring and places them on the Tx hardware ring.
> 
> In the typical flow, the Rx buffer will be placed onto an Rx ring
> (completed to the user), and the Tx buffer will be placed on the
> completion ring to notify the user that the transfer is done.
> 
> However, if the driver needs to tear down the hardware rings for some
> reason (interface goes down, reconfiguration and such), what should be
> done with the Rx and Tx buffers that has been given to the driver?
> 
> So, to frame the problem: What should a driver do when this happens,
> so that buffers aren't leaked?
> 
> Note that when the UMEM is going down, there's no need to complete
> anything, since the sockets are dying/dead already.
> 
> This is, as you state, a missing piece in the implementation and needs
> to be fixed.
> 
> Now on to possible solutions:
> 
> 1. Complete the buffers back to the user. For Tx, this is probably the
>     best way -- just place the buffers onto the completion ring.
> 
>     For Rx, we can give buffers back to user space by setting the
>     length in the Rx descriptor to zero And putting them on the Rx
>     ring. However, one complication here is that we do not have any
>     back-pressure mechanism for the Rx side like we have on Tx. If the
>     Rx ring(s) is (are) full the kernel will have to leak them or
>     implement a retry mechanism (ugly and should be avoided).
> 
>     Another option to solve this without needing any retry or leaking
>     for Rx is to implement the same back-pressure mechanism that we
>     have on the Tx path in the Rx path. In the Tx path, the driver will
>     only get a Tx packet to send if there is space for it in the
>     completion ring. On Rx, this would be that the driver would only
>     get a buffer from the fill ring if there is space for it to put it
>     on the Rx ring. The drawback of this is that it would likely impact
>     performance negatively since the Rx ring would have to be touch one
>     more time (in the Tx path, it increased performance since it made
>     it possible to implement the Tx path without any buffering), but it
>     would guarantee that all buffers can always be returned to user
>     space making solution this a viable option.
> 
> 2. Store the buffers internally in the driver, and make sure that they
>     are inserted into the "normal flow" again. For Rx that would be
>     putting the buffers back into the allocation scheme that the driver
>     is using. For Tx, placing the buffers back onto the Tx HW ring
>     (plus all the logic for making sure that all corner cases work).
> 
> 3. Mark the socket(s) as in error state, en require the user to redo
>     the setup. This is bit harsh...

That's a good summary, one more (bad) option:

4. Disallow any operations on the device which could lead to RX buffer
   discards if any UMEMs are attached.

> For i40e I think #2 for Rx (buffers reside in kernel, return to
> allocator) and #1 for Tx (complete to userland).
> 
> Your RFC is plumbing to implement #2 for Rx in a driver. I'm not a fan
> of extending the umem with the "reuse queue". This decision is really
> up the driver. Some driver might prefer another scheme, or simply
> prefer storing the buffers in another manner.

The only performance cost is the extra pointer in xdp_umem.  Drivers
have to opt-in by using the *_rq() helpers.  We can move the extra
pointer to driver structs, and have them pass it to the helpers, so that
would be zero extra cost, and we can reuse the implementation.

> Looking forward, as both you and Jesper has alluded to, we need a
> proper allocator for zero-copy. Then it would be a matter of injecting
> the Rx buffers back to the allocator.
> 
> I'll send out a patch, so that we don't leak the buffers, but I'd like
> to hear your thoughts on what the behavior should be.
> 
> And what should the behavior be when the netdev is removed from the
> kernel?

What would AF_PACKET do, return ENXIO?

> And thanks for looking into the code!
> 
> Björn