netdev - [RFC feedback] AF_XDP and non-Intel hardware

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 21 May 2018 14:34:45 +0200
From:   Mykyta Iziumtsev <mykyta.iziumtsev@...aro.org>
To:     netdev@...r.kernel.org
Cc:     Mykyta Iziumtsev <mykyta.iziumtsev@...aro.org>,
        Björn Töpel <bjorn.topel@...el.com>,
        "Karlsson, Magnus" <magnus.karlsson@...el.com>,
        qi.z.zhang@...el.com, Francois Ozog <francois.ozog@...aro.org>,
        Ilias Apalodimas <ilias.apalodimas@...aro.org>,
        Brian Brooks <brian.brooks@...aro.org>
Subject: [RFC feedback] AF_XDP and non-Intel hardware

Hi Björn and Magnus,

(This thread is a follow up to private dialogue. The intention is to
let community know that AF_XDP can be enhanced further to make it
compatible with wider range of NIC vendors).

There are two NIC variations which don't fit well with current AF_XDP proposal.

The first variation is represented by some NXP and Cavium NICs. AF_XDP
expects NIC to put incoming frames into slots defined by UMEM area.
Here slot size is set in XDP_UMEM_REG xdp_umem.reg.frame_size and
slots available to NIC are communicated to the kernel via UMEM fill
queue. While Intel NICs support only one slot size, NXP and Cavium
support multiple slot sizes to optimize memory usage (e.g. { 128, 512,
2048 }, please refer to [1] for rationale). On frame reception the NIC
pulls a slot from best fit pool based on frame size.

The second variation is represented by e.g. Chelsio T5/T6 and Netcope
NICs. As shown above, AF_XDP expects NIC to put incoming frames at
predefined addresses. This is not the case here as the NIC is in
charge of placing frames in memory based on it's own algorithm. For
example, Chelsio T5/T6 is expecting to get whole pages from the driver
and puts incoming frames on the page in a 'tape recorder' fashion.
Assuming 'base' is page address and flen[N] is an array of frame
lengths, the frame placement in memory will look like that:
  base + 0
  base + frame_len[0]
  base + frame_len[0] + frame_len[1]
  base + frame_len[0] + frame_len[1] + frame_len[2]
  ...

To better support these two NIC variations I suggest to abandon 'frame
size' structuring in UMEM and stick with 'pages' instead. The NIC
kernel driver is then responsible for splitting provided pages into
slots expected by underlying HW (or not splitting at all in case of
Chelsio/Netcope).

On XDP_UMEM_REG the application needs to specify page_size. Then the
application can pass empty pages to the kernel driver using UMEM
'fill' queue by specifying page offset within the UMEM area. xdp_desc
format needs to be changed as well: frame location will be defined by
offset from the beginning of UMEM area instead of frame index. As
payload headroom can vary with AF_XDP we'll need to specify it in
xdp_desc as well. Beside that it could be worth to consider changing
payload length to u16 as 64k+ frames aren't very common in networking.
The resulting xdp_desc would look like that:

struct xdp_desc {
        __u64 offset;
        __u16 headroom;
        __u16 len;
        __u8 flags;
        __u8 padding[3];
};

In current proposal you have a notion of 'frame ownership': 'owned by
kernel' or 'owned by application'. The ownership is transferred by
means of enqueueing frame index in UMEM 'fill' queue (from application
to kernel) or in UMEM 'tx completion' queue (from kernel to
application). If you decide to adopt 'page' approach this notion needs
to be changed a bit. This is because in case of packet forwarding one
and the same page can be used for RX (parts of it enqueued in HW 'free
lists') and TX (forwarding of previously RXed packets).

I propose to define 'ownership' as a right to manipulate the
partitioning of the page into frames. Whenever application passes a
page to the kernel via UMEM 'fill' queue -- the ownership is
transferred to the kernel. The application can't allocate packets on
this page until kernel is done with it, but it can change payload of
RXed packets before forwarding them. The kernel can pass ownership
back by means of 'end-of-page' in xdp_desc.flags.

The pages are definitely supposed to be recycled sooner or later. Even
if it's not part of kernel API and the housekeeping implementation
resided completely in application I still would like to propose
possible (hopefully, cost efficient) solution to that. The recycling
could be achieved by keeping refcount on pages and recycling the page
only when it's owned by application and refcount reaches 0.

Whenever application transfers page ownership to the kernel the
refcount shall be initialized to 0. With each incoming RX xdp_desc the
corresponding page needs to be identified (xdp_desc.offset >>
PAGE_SHIFT) and refcount incremented. When the packet gets freed the
refcount shall be decremented. If packet is forwarded in TX xdp_desc
-- the refcount gets decremented only on TX completion (again,
tx_completion.desc >> PAGE_SHIFT). For packets originating from the
application itself the payload buffers needs to be allocated from
empty page owned by the application and refcount needs to be
incremented as well.

[1] https://en.wikipedia.org/wiki/Internet_Mix

With best regards,
Mykyta