netdev - Re: [PATCH bpf-next 08/11] i40e: add AF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <65684cea-76bc-2f2d-600f-402f8504301d@intel.com>
Date:   Thu, 30 Aug 2018 14:06:24 +0200
From:   Björn Töpel <bjorn.topel@...el.com>
To:     Jakub Kicinski <jakub.kicinski@...ronome.com>,
        Björn Töpel <bjorn.topel@...il.com>
Cc:     magnus.karlsson@...el.com, magnus.karlsson@...il.com,
        alexander.h.duyck@...el.com, alexander.duyck@...il.com,
        ast@...nel.org, brouer@...hat.com, daniel@...earbox.net,
        netdev@...r.kernel.org, jesse.brandeburg@...el.com,
        anjali.singhai@...el.com, peter.waskiewicz.jr@...el.com,
        michael.lundkvist@...csson.com, willemdebruijn.kernel@...il.com,
        john.fastabend@...il.com, neerav.parikh@...el.com,
        mykyta.iziumtsev@...aro.org, francois.ozog@...aro.org,
        ilias.apalodimas@...aro.org, brian.brooks@...aro.org,
        u9012063@...il.com, pavel@...tnetmon.com, qi.z.zhang@...el.com
Subject: Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support

On 2018-08-29 21:14, Jakub Kicinski wrote:
 > On Tue, 28 Aug 2018 14:44:32 +0200, Björn Töpel wrote:
 >> From: Björn Töpel <bjorn.topel@...el.com>
 >>
 >> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
 >> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
 >> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
 >> queue.
 >>
 >> All AF_XDP specific functions are added to a new file, i40e_xsk.c.
 >>
 >> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
 >> will allocate a new buffer and copy the zero-copy frame prior passing
 >> it to the kernel stack.
 >>
 >> Signed-off-by: Björn Töpel <bjorn.topel@...el.com>
 >
 > Mm.. I'm surprised you don't run into buffer reuse issues that I had
 > when playing with AF_XDP.  What happens in i40e if someone downs the
 > interface?  Will UMEMs get destroyed?  Will the RX buffers get freed?
 >

The UMEM will linger in the driver until the sockets are dead.

 > I'll shortly send an RFC with my quick and dirty RX buffer reuse queue,
 > FWIW.
 >

Some background for folks that don't know the details: A zero-copy
capable driver picks buffers off the fill ring and places them on the
hardware Rx ring to be completed at a later point when DMA is
complete. Analogous for the Tx side; The driver picks buffers off the
Tx ring and places them on the Tx hardware ring.

In the typical flow, the Rx buffer will be placed onto an Rx ring
(completed to the user), and the Tx buffer will be placed on the
completion ring to notify the user that the transfer is done.

However, if the driver needs to tear down the hardware rings for some
reason (interface goes down, reconfiguration and such), what should be
done with the Rx and Tx buffers that has been given to the driver?

So, to frame the problem: What should a driver do when this happens,
so that buffers aren't leaked?

Note that when the UMEM is going down, there's no need to complete
anything, since the sockets are dying/dead already.

This is, as you state, a missing piece in the implementation and needs
to be fixed.

Now on to possible solutions:

1. Complete the buffers back to the user. For Tx, this is probably the
    best way -- just place the buffers onto the completion ring.

    For Rx, we can give buffers back to user space by setting the
    length in the Rx descriptor to zero And putting them on the Rx
    ring. However, one complication here is that we do not have any
    back-pressure mechanism for the Rx side like we have on Tx. If the
    Rx ring(s) is (are) full the kernel will have to leak them or
    implement a retry mechanism (ugly and should be avoided).

    Another option to solve this without needing any retry or leaking
    for Rx is to implement the same back-pressure mechanism that we
    have on the Tx path in the Rx path. In the Tx path, the driver will
    only get a Tx packet to send if there is space for it in the
    completion ring. On Rx, this would be that the driver would only
    get a buffer from the fill ring if there is space for it to put it
    on the Rx ring. The drawback of this is that it would likely impact
    performance negatively since the Rx ring would have to be touch one
    more time (in the Tx path, it increased performance since it made
    it possible to implement the Tx path without any buffering), but it
    would guarantee that all buffers can always be returned to user
    space making solution this a viable option.

2. Store the buffers internally in the driver, and make sure that they
    are inserted into the "normal flow" again. For Rx that would be
    putting the buffers back into the allocation scheme that the driver
    is using. For Tx, placing the buffers back onto the Tx HW ring
    (plus all the logic for making sure that all corner cases work).

3. Mark the socket(s) as in error state, en require the user to redo
    the setup. This is bit harsh...

For i40e I think #2 for Rx (buffers reside in kernel, return to
allocator) and #1 for Tx (complete to userland).

Your RFC is plumbing to implement #2 for Rx in a driver. I'm not a fan
of extending the umem with the "reuse queue". This decision is really
up the driver. Some driver might prefer another scheme, or simply
prefer storing the buffers in another manner.

Looking forward, as both you and Jesper has alluded to, we need a
proper allocator for zero-copy. Then it would be a matter of injecting
the Rx buffers back to the allocator.

I'll send out a patch, so that we don't leak the buffers, but I'd like
to hear your thoughts on what the behavior should be.

And what should the behavior be when the netdev is removed from the
kernel?

And thanks for looking into the code!


Björn