[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ca38f2ed-999f-4ce1-8035-8ee9247f27f2@kernel.org>
Date: Fri, 13 Jun 2025 12:59:32 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Daniel Borkmann <borkmann@...earbox.net>,
Stanislav Fomichev <stfomichev@...il.com>,
Lorenzo Bianconi <lorenzo@...nel.org>
Cc: Toke Høiland-Jørgensen <toke@...hat.com>,
Daniel Borkmann <daniel@...earbox.net>, bpf@...r.kernel.org,
netdev@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
Alexei Starovoitov <ast@...nel.org>, Eric Dumazet <eric.dumazet@...il.com>,
"David S. Miller" <davem@...emloft.net>, Paolo Abeni <pabeni@...hat.com>,
sdf@...ichev.me, kernel-team@...udflare.com, arthur@...hurfabre.com,
jakub@...udflare.com, Magnus Karlsson <magnus.karlsson@...el.com>,
Maciej Fijalkowski <maciej.fijalkowski@...el.com>, arzeznik@...udflare.com,
Yan Zhai <yan@...udflare.com>
Subject: Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for
xdp-rx-metadata.rst
On 11/06/2025 05.40, Stanislav Fomichev wrote:
> On 06/11, Lorenzo Bianconi wrote:
>>> Daniel Borkmann <daniel@...earbox.net> writes:
>>>
>> [...]
>>>>>
>>>>> Why not have a new flag for bpf_redirect that transparently stores all
>>>>> available metadata? If you care only about the redirect -> skb case.
>>>>> Might give us more wiggle room in the future to make it work with
>>>>> traits.
>>>>
>>>> Also q from my side: If I understand the proposal correctly, in order to fully
>>>> populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
>>>> to collect the data from the driver descriptors (indirect call), and then yet
>>>> again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
>>>> xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
>>>> meta data aren't you better off switching to tc(x) directly so the driver can
>>>> do all this natively? :/
>>>
>>> I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
>>> hope was (back when we added the initial HW metadata support) that we
>>> would be able to inline them to avoid the function call overhead.
>>>
>>> That being said, even with half a dozen function calls, that's still a
>>> lot less overhead from going all the way to TC(x). The goal of the use
>>> case here is to do as little work as possible on the CPU that initially
>>> receives the packet, instead moving the network stack processing (and
>>> skb allocation) to a different CPU with cpumap.
>>>
>>> So even if the *total* amount of work being done is a bit higher because
>>> of the kfunc overhead, that can still be beneficial because it's split
>>> between two (or more) CPUs.
>>>
>>> I'm sure Jesper has some concrete benchmarks for this lying around
>>> somewhere, hopefully he can share those :)
>>
>> Another possible approach would be to have some utility functions (not kfuncs)
>> used to 'store' the hw metadata in the xdp_frame that are executed in each
>> driver codebase before performing XDP_REDIRECT. The downside of this approach
>> is we need to parse the hw metadata twice if the eBPF program that is bounded
>> to the NIC is consuming these info. What do you think?
>
> That's the option I was asking about. I'm assuming we should be able
> to reuse existing xmo metadata callbacks for this. We should be able
> to hide it from the drivers also hopefully.
I'm not against this idea of transparently stores all available metadata
into the xdp_frame (via some flag/config), but it does not fit our
production use-case. I also think that this can be added later.
We need the ability to overwrite the RX-hash value, before redirecting
packet to CPUMAP (remember as cover-letter describe RX-hash needed
*before* the GRO engine processes the packet in CPUMAP. This is before
TC/BPF).
Our use-case for overwriting the RX-hash value is load-balancing IPSEC
encapsulated tunnel traffic at XDP stage via CPUMAP redirects. This is
generally applicable to tunneling in that we want the store the RX-hash
of the tunnels inner-headers. Our IPSEC use-case have a variation that
we only decrypt[1] the first 32 bytes to calc a LB hash over
inner-headers, and then redirect the original packet to CPUMAP. The
IPSEC packets travel into a veth device, which we discovered will send
everything on a single RX-queue... because RX-hash (calc by netstack)
will obviously use the outer-headers, meaning this LB doesn't scale.
I hope this makes it clear, why we need BPF-prog ability to explicitly
"store" the RX-hash in the xdp-frame.
--Jesper
[1] https://docs.ebpf.io/linux/kfuncs/bpf_crypto_decrypt/
Powered by blists - more mailing lists