[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87plfbcq4m.fsf@toke.dk>
Date: Tue, 10 Jun 2025 22:12:41 +0200
From: Toke Høiland-Jørgensen <toke@...hat.com>
To: Daniel Borkmann <daniel@...earbox.net>, Stanislav Fomichev
<stfomichev@...il.com>, Jesper Dangaard Brouer <hawk@...nel.org>
Cc: bpf@...r.kernel.org, netdev@...r.kernel.org, Jakub Kicinski
<kuba@...nel.org>, lorenzo@...nel.org, Alexei Starovoitov
<ast@...nel.org>, Daniel Borkmann <borkmann@...earbox.net>, Eric Dumazet
<eric.dumazet@...il.com>, "David S. Miller" <davem@...emloft.net>, Paolo
Abeni <pabeni@...hat.com>, sdf@...ichev.me, kernel-team@...udflare.com,
arthur@...hurfabre.com, jakub@...udflare.com, Magnus Karlsson
<magnus.karlsson@...el.com>, Maciej Fijalkowski
<maciej.fijalkowski@...el.com>
Subject: Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for
xdp-rx-metadata.rst
Daniel Borkmann <daniel@...earbox.net> writes:
> On 6/6/25 4:45 AM, Stanislav Fomichev wrote:
>> On 06/03, Jesper Dangaard Brouer wrote:
>>> Update the documentation[1] based on the changes in this patchset.
>>>
>>> [1] https://docs.kernel.org/networking/xdp-rx-metadata.html
>>>
>>> Signed-off-by: Jesper Dangaard Brouer <hawk@...nel.org>
>>> ---
>>> Documentation/networking/xdp-rx-metadata.rst | 74 ++++++++++++++++++++------
>>> net/core/xdp.c | 32 +++++++++++
>>> 2 files changed, 90 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
>>> index a6e0ece18be5..2c54208e4f7e 100644
>>> --- a/Documentation/networking/xdp-rx-metadata.rst
>>> +++ b/Documentation/networking/xdp-rx-metadata.rst
>>> @@ -90,22 +90,64 @@ the ``data_meta`` pointer.
>>> In the future, we'd like to support a case where an XDP program
>>> can override some of the metadata used for building ``skbs``.
>>>
>>> -bpf_redirect_map
>>> -================
>>> -
>>> -``bpf_redirect_map`` can redirect the frame to a different device.
>>> -Some devices (like virtual ethernet links) support running a second XDP
>>> -program after the redirect. However, the final consumer doesn't have
>>> -access to the original hardware descriptor and can't access any of
>>> -the original metadata. The same applies to XDP programs installed
>>> -into devmaps and cpumaps.
>>> -
>>> -This means that for redirected packets only custom metadata is
>>> -currently supported, which has to be prepared by the initial XDP program
>>> -before redirect. If the frame is eventually passed to the kernel, the
>>> -``skb`` created from such a frame won't have any hardware metadata populated
>>> -in its ``skb``. If such a packet is later redirected into an ``XSK``,
>>> -that will also only have access to the custom metadata.
>>> +XDP_REDIRECT
>>> +============
>>> +
>>> +The ``XDP_REDIRECT`` action forwards an XDP frame to another net device or a CPU
>>> +(via cpumap/devmap) for further processing. It is invoked using BPF helpers like
>>> +``bpf_redirect_map()`` or ``bpf_redirect()``. When an XDP frame is redirected,
>>> +the recipient (e.g., an XDP program on a veth device, or the kernel stack via
>>> +cpumap) loses direct access to the original NIC's hardware descriptor and thus
>>> +its hardware metadata
>>> +
>>> +By default, this loss of access means that if an ``xdp_frame`` is redirected and
>>> +then converted to an ``skb``, its ``skb`` fields for hardware-derived metadata
>>> +(like ``skb->hash`` or VLAN info) are not populated from the original
>>> +packet. This can impact features like Generic Receive Offload (GRO). While XDP
>>> +programs can manually save custom data (e.g., using ``bpf_xdp_adjust_meta()``),
>>> +propagating specific *hardware* RX hints to ``skb`` creation requires using the
>>> +kfuncs described below.
>>> +
>>> +To enable propagating selected hardware RX hints, store BPF kfuncs allow an
>>> +XDP program on the initial NIC to read these hints and then explicitly
>>> +*store* them. The kfuncs place this metadata in locations associated with
>>> +the XDP packet buffer, making it available if an ``skb`` is later built or
>>> +the frame is otherwise processed. For instance, RX hash and VLAN tags are
>>> +stored within the XDP frame's addressable headroom, while RX timestamps are
>>> +typically written to an area corresponding to ``skb_shared_info``.
>>> +
>>> +**Crucially, the BPF programmer must call these "store" kfuncs to save the
>>> +desired hardware hints for propagation.** The system does not do this
>>> +automatically. The NIC driver is responsible for ensuring sufficient headroom is
>>> +available; kfuncs may return ``-ENOSPC`` if space is inadequate for storing
>>> +these hints.
>>
>> Why not have a new flag for bpf_redirect that transparently stores all
>> available metadata? If you care only about the redirect -> skb case.
>> Might give us more wiggle room in the future to make it work with
>> traits.
>
> Also q from my side: If I understand the proposal correctly, in order to fully
> populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
> to collect the data from the driver descriptors (indirect call), and then yet
> again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
> xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
> meta data aren't you better off switching to tc(x) directly so the driver can
> do all this natively? :/
I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
hope was (back when we added the initial HW metadata support) that we
would be able to inline them to avoid the function call overhead.
That being said, even with half a dozen function calls, that's still a
lot less overhead from going all the way to TC(x). The goal of the use
case here is to do as little work as possible on the CPU that initially
receives the packet, instead moving the network stack processing (and
skb allocation) to a different CPU with cpumap.
So even if the *total* amount of work being done is a bit higher because
of the kfunc overhead, that can still be beneficial because it's split
between two (or more) CPUs.
I'm sure Jesper has some concrete benchmarks for this lying around
somewhere, hopefully he can share those :)
> Also, have you thought about taking the opportunity to generalize the existing
> struct xsk_tx_metadata? It would be nice to actually use the same/similar struct
> for RX and TX, similarly as done in struct virtio_net_hdr. Such that we have
> XDP_{RX,TX}_METADATA and XDP_{RX,TX}MD_FLAGS_* to describe what meta data we
> have and from a developer PoV this will be a nicely consistent API in XDP. Then
> you could store at the right location in the meta data region just with
> bpf_xdp_metadata_* kfuncs (and/or plain BPF code) and finally set XDP_RX_METADATA
> indicator bit.
Wouldn't this make the whole thing (effectively) UAPI?
-Toke
Powered by blists - more mailing lists