netdev - Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for xdp-rx-metadata.rst

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1221e418-a9b8-41e8-a940-4e7a25288fe0@kernel.org>
Date: Tue, 17 Jun 2025 18:15:33 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Stanislav Fomichev <stfomichev@...il.com>
Cc: Daniel Borkmann <borkmann@...earbox.net>,
 Lorenzo Bianconi <lorenzo@...nel.org>,
 Toke Høiland-Jørgensen <toke@...hat.com>,
 Daniel Borkmann <daniel@...earbox.net>, bpf@...r.kernel.org,
 netdev@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
 Alexei Starovoitov <ast@...nel.org>, Eric Dumazet <eric.dumazet@...il.com>,
 "David S. Miller" <davem@...emloft.net>, Paolo Abeni <pabeni@...hat.com>,
 sdf@...ichev.me, kernel-team@...udflare.com, arthur@...hurfabre.com,
 jakub@...udflare.com, Magnus Karlsson <magnus.karlsson@...el.com>,
 Maciej Fijalkowski <maciej.fijalkowski@...el.com>, arzeznik@...udflare.com,
 Yan Zhai <yan@...udflare.com>
Subject: Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for
 xdp-rx-metadata.rst



On 16/06/2025 17.34, Stanislav Fomichev wrote:
> On 06/13, Jesper Dangaard Brouer wrote:
>>
>> On 11/06/2025 05.40, Stanislav Fomichev wrote:
>>> On 06/11, Lorenzo Bianconi wrote:
>>>>> Daniel Borkmann <daniel@...earbox.net> writes:
>>>>>
>>>> [...]
>>>>>>>
>>>>>>> Why not have a new flag for bpf_redirect that transparently stores all
>>>>>>> available metadata? If you care only about the redirect -> skb case.
>>>>>>> Might give us more wiggle room in the future to make it work with
>>>>>>> traits.
>>>>>>
>>>>>> Also q from my side: If I understand the proposal correctly, in order to fully
>>>>>> populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
>>>>>> to collect the data from the driver descriptors (indirect call), and then yet
>>>>>> again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
>>>>>> xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
>>>>>> meta data aren't you better off switching to tc(x) directly so the driver can
>>>>>> do all this natively? :/
>>>>>
>>>>> I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
>>>>> hope was (back when we added the initial HW metadata support) that we
>>>>> would be able to inline them to avoid the function call overhead.
>>>>>
>>>>> That being said, even with half a dozen function calls, that's still a
>>>>> lot less overhead from going all the way to TC(x). The goal of the use
>>>>> case here is to do as little work as possible on the CPU that initially
>>>>> receives the packet, instead moving the network stack processing (and
>>>>> skb allocation) to a different CPU with cpumap.
>>>>>
>>>>> So even if the *total* amount of work being done is a bit higher because
>>>>> of the kfunc overhead, that can still be beneficial because it's split
>>>>> between two (or more) CPUs.
>>>>>
>>>>> I'm sure Jesper has some concrete benchmarks for this lying around
>>>>> somewhere, hopefully he can share those :)
>>>>
>>>> Another possible approach would be to have some utility functions (not kfuncs)
>>>> used to 'store' the hw metadata in the xdp_frame that are executed in each
>>>> driver codebase before performing XDP_REDIRECT. The downside of this approach
>>>> is we need to parse the hw metadata twice if the eBPF program that is bounded
>>>> to the NIC is consuming these info. What do you think?
>>>
>>> That's the option I was asking about. I'm assuming we should be able
>>> to reuse existing xmo metadata callbacks for this. We should be able
>>> to hide it from the drivers also hopefully.
>>
>> I'm not against this idea of transparently stores all available metadata
>> into the xdp_frame (via some flag/config), but it does not fit our
>> production use-case.  I also think that this can be added later.
>>
>> We need the ability to overwrite the RX-hash value, before redirecting
>> packet to CPUMAP (remember as cover-letter describe RX-hash needed
>> *before* the GRO engine processes the packet in CPUMAP. This is before
>> TC/BPF).
> 
> Make sense. Can we make GRO not flush a bucket for same_flow=0 instead?
> This will also make it work better for other regular tunneled traffic.
> Setting hash in BPF to make GRO go fast seems too implementation specific :-(

I feel misunderstood here.  This was a GRO side-note to remind reviewers
that netstack expect that RX-hash isn't zero at napi_gro_receive().
This is not a make GRO faster, but a lets comply with netstack.

The important BPF optimization is the part that you forgot to quote in
the reply, so let me reproduce what I wrote below.  TL;DR: RX-hash
needed to be the tunnel inner-headers else outer-headers SW hash calc
will land everything on same veth RX-queue.

On 13/06/2025 12.59, Jesper Dangaard Brouer wrote:
 >
>> Our use-case for overwriting the RX-hash value is load-balancing
>> IPSEC encapsulated tunnel traffic at XDP stage via CPUMAP redirects.
>> This is generally applicable to tunneling in that we want the store
>> the RX-hash of the tunnels inner-headers.  Our IPSEC use-case have a
>> variation that we only decrypt[1] the first 32 bytes to calc a LB
>> hash over inner-headers, and then redirect the original packet to
>> CPUMAP.  The IPSEC packets travel into a veth device, which we
>> discovered will send everything on a single RX-queue... because
>> RX-hash (calc by netstack) will obviously use the outer-headers,
>> meaning this LB doesn't scale. >>
>> I hope this makes it clear, why we need BPF-prog ability to
>> explicitly "store" the RX-hash in the xdp-frame.