netdev - Re: [RFC feedback] AF_XDP and non-Intel hardware

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPvmOuHFXKOqfY5z4UUh7qEMygp7NMBsMG_AW4BQu1vG+qJBjg@mail.gmail.com>
Date:   Tue, 22 May 2018 09:45:53 +0200
From:   Mykyta Iziumtsev <mykyta.iziumtsev@...aro.org>
To:     Björn Töpel <bjorn.topel@...il.com>
Cc:     Netdev <netdev@...r.kernel.org>,
        Björn Töpel <bjorn.topel@...el.com>,
        "Karlsson, Magnus" <magnus.karlsson@...el.com>,
        "Zhang, Qi Z" <qi.z.zhang@...el.com>,
        Francois Ozog <francois.ozog@...aro.org>,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Brian Brooks <brian.brooks@...aro.org>,
        Jesper Dangaard Brouer <brouer@...hat.com>,
        andy@...yhouse.net, michael.chan@...adcom.com, luke@...bb.co
Subject: Re: [RFC feedback] AF_XDP and non-Intel hardware

On 21 May 2018 at 20:55, Björn Töpel <bjorn.topel@...il.com> wrote:
> 2018-05-21 14:34 GMT+02:00 Mykyta Iziumtsev <mykyta.iziumtsev@...aro.org>:
>> Hi Björn and Magnus,
>>
>> (This thread is a follow up to private dialogue. The intention is to
>> let community know that AF_XDP can be enhanced further to make it
>> compatible with wider range of NIC vendors).
>>
>
> Mykyta, thanks for doing the write-up and sending it to the netdev
> list! The timing could not be better -- we need to settle on an uapi
> that works for all vendors prior enabling it in the kernel.
>
>> There are two NIC variations which don't fit well with current AF_XDP proposal.
>>
>> The first variation is represented by some NXP and Cavium NICs. AF_XDP
>> expects NIC to put incoming frames into slots defined by UMEM area.
>> Here slot size is set in XDP_UMEM_REG xdp_umem.reg.frame_size and
>> slots available to NIC are communicated to the kernel via UMEM fill
>> queue. While Intel NICs support only one slot size, NXP and Cavium
>> support multiple slot sizes to optimize memory usage (e.g. { 128, 512,
>> 2048 }, please refer to [1] for rationale). On frame reception the NIC
>> pulls a slot from best fit pool based on frame size.
>>
>> The second variation is represented by e.g. Chelsio T5/T6 and Netcope
>> NICs. As shown above, AF_XDP expects NIC to put incoming frames at
>> predefined addresses. This is not the case here as the NIC is in
>> charge of placing frames in memory based on it's own algorithm. For
>> example, Chelsio T5/T6 is expecting to get whole pages from the driver
>> and puts incoming frames on the page in a 'tape recorder' fashion.
>> Assuming 'base' is page address and flen[N] is an array of frame
>> lengths, the frame placement in memory will look like that:
>>   base + 0
>>   base + frame_len[0]
>>   base + frame_len[0] + frame_len[1]
>>   base + frame_len[0] + frame_len[1] + frame_len[2]
>>   ...
>>
>> To better support these two NIC variations I suggest to abandon 'frame
>> size' structuring in UMEM and stick with 'pages' instead. The NIC
>> kernel driver is then responsible for splitting provided pages into
>> slots expected by underlying HW (or not splitting at all in case of
>> Chelsio/Netcope).
>>
>
> Let's call the first variation "multi-pool" and the second
> "type-writer" for simplicity. The type-writer model is very
> interesting, and makes a lot of sense when the PCIe bus is a
> constraint.
>
>> On XDP_UMEM_REG the application needs to specify page_size. Then the
>> application can pass empty pages to the kernel driver using UMEM
>> 'fill' queue by specifying page offset within the UMEM area. xdp_desc
>> format needs to be changed as well: frame location will be defined by
>> offset from the beginning of UMEM area instead of frame index. As
>> payload headroom can vary with AF_XDP we'll need to specify it in
>> xdp_desc as well. Beside that it could be worth to consider changing
>> payload length to u16 as 64k+ frames aren't very common in networking.
>> The resulting xdp_desc would look like that:
>>
>> struct xdp_desc {
>>         __u64 offset;
>>         __u16 headroom;
>>         __u16 len;
>>         __u8 flags;
>>         __u8 padding[3];
>> };
>>
>
> Let's expand a bit here; Instead of passing indicies to fixed sized
> frames to the fill ring, we now pass a memory region. For simplicity,
> let's say a page. The fill ring descriptor requires offset and
> len. The offset is a global offset from an UMEM perspective, and len
> is the size of the region.
>

I would rather stick with region equal to page (regular or huge page,
defined by application). The page size can be extracted from
vm_area_struct in XDP_UMEM_REG (preferred) or configured by
application.

> On the Rx ring the descriptor, as you wrote, must be changed as well
> to your suggestion above. Note, that headroom is still needed, since
> XDP can change the size of a packet, so the fixed headroom stated in
> UMEM registration is not sufficient.
>
> This model is obviously more flexible, but then also a bit more
> complex. E.g. a fixed-frame NIC (like ixgbe), what should the
> behaviour be? Should the fill ring entry be used only for *one* frame,
> or chopped up to multiple receive frames? Should it be configurable?
> Driver specific?

I think driver-specific makes most sense here. In case of fixed-frame
NIC the driver shall chop the ring entry into multiple receive frames.

>
> Also, validating the entries in the fill queue require more work
> (compared to the current index variant). Currently, we only skip
> invalid indicies. What should we do when say, you pass a memory window
> that is too narrow (say 128B) but correct from a UMEM
> perspective. Going this path, we need to add pretty hard constraints
> so we don't end up it too complex code -- because then it'll be too
> slow.

If we stick with pages -- the only possible erroneous input will be
'page out of UMEM boundaries'. The validation will be essentially:

if ((offset > umem->size) || (offset & (umem->page_size - 1))
    fail

The question is what shall be done if validation fails ? Would
SEGFAULT be reasonable ? This is more or less equivalent to
dereferencing invalid pointer.

>
>> In current proposal you have a notion of 'frame ownership': 'owned by
>> kernel' or 'owned by application'. The ownership is transferred by
>> means of enqueueing frame index in UMEM 'fill' queue (from application
>> to kernel) or in UMEM 'tx completion' queue (from kernel to
>> application). If you decide to adopt 'page' approach this notion needs
>> to be changed a bit. This is because in case of packet forwarding one
>> and the same page can be used for RX (parts of it enqueued in HW 'free
>> lists') and TX (forwarding of previously RXed packets).
>>
>> I propose to define 'ownership' as a right to manipulate the
>> partitioning of the page into frames. Whenever application passes a
>> page to the kernel via UMEM 'fill' queue -- the ownership is
>> transferred to the kernel. The application can't allocate packets on
>> this page until kernel is done with it, but it can change payload of
>> RXed packets before forwarding them. The kernel can pass ownership
>> back by means of 'end-of-page' in xdp_desc.flags.
>>
>
> I like the end-of-page mechanism.
>
>> The pages are definitely supposed to be recycled sooner or later. Even
>> if it's not part of kernel API and the housekeeping implementation
>> resided completely in application I still would like to propose
>> possible (hopefully, cost efficient) solution to that. The recycling
>> could be achieved by keeping refcount on pages and recycling the page
>> only when it's owned by application and refcount reaches 0.
>>
>> Whenever application transfers page ownership to the kernel the
>> refcount shall be initialized to 0. With each incoming RX xdp_desc the
>> corresponding page needs to be identified (xdp_desc.offset >>
>> PAGE_SHIFT) and refcount incremented. When the packet gets freed the
>> refcount shall be decremented. If packet is forwarded in TX xdp_desc
>> -- the refcount gets decremented only on TX completion (again,
>> tx_completion.desc >> PAGE_SHIFT). For packets originating from the
>> application itself the payload buffers needs to be allocated from
>> empty page owned by the application and refcount needs to be
>> incremented as well.
>>
>
> Strictly speaking, we're not enforcing correctness in the current
> solution. If the userspace application passes index 1 mulitple times
> to the fill ring, and at the same time send index 1, things will
> break. So, with the existing solution the userspace application
> *still* needs to track the frames. With this new model, the
> tracking/refcounting will be more complex. That might be a concern.
>
> For the multi-pool NICs I think we can still just have one UMEM, and
> let the driver decide where in which pool to place this certain chunk
> of memory. Thoughts?

Definitely agree with that. This is HW specific and exposing it to the
application would only harm portability.

>
> Now, how do we go forward? I think this is very important, and I will
> hack a copy-mode version for this new API. I'm a bit worried that the
> complexity/configuration space will grow and impact performance, but
> let's see.
>
> To prove that the API works for the NICs you mentioned, we need an
> actual zero-copy implementation for them. Do you think Linaro could
> work on a zero-copy variant for any of the NICs above?
>

Linaro will definitely contribute zero-copy implementation for some
ARM-based NICs with 'multi-pool' variation. Implementation of
'type-writer' variation is up to Chelsio/Netcope, we only try to come
up with API which (most probably) will fit them as well.

>
> Again thanks for bringing this up!
> Björn
>
>
>
>> [1] https://en.wikipedia.org/wiki/Internet_Mix
>>
>> With best regards,
>> Mykyta