lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <76a5330e-dc52-41ea-89c2-ddcde4b414bd@kernel.org>
Date: Tue, 17 Jun 2025 16:47:18 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Toke Høiland-Jørgensen <toke@...hat.com>,
 Lorenzo Bianconi <lorenzo@...nel.org>,
 Stanislav Fomichev <stfomichev@...il.com>
Cc: Daniel Borkmann <daniel@...earbox.net>, bpf@...r.kernel.org,
 netdev@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
 Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann
 <borkmann@...earbox.net>, Eric Dumazet <eric.dumazet@...il.com>,
 "David S. Miller" <davem@...emloft.net>, Paolo Abeni <pabeni@...hat.com>,
 sdf@...ichev.me, kernel-team@...udflare.com, arthur@...hurfabre.com,
 jakub@...udflare.com, Magnus Karlsson <magnus.karlsson@...el.com>,
 Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Subject: Re: [PATCH bpf-next V1 7/7] net: xdp: update documentation for
 xdp-rx-metadata.rst



On 17/06/2025 13.50, Toke Høiland-Jørgensen wrote:
> Lorenzo Bianconi <lorenzo@...nel.org> writes:
> 
>>> On 06/16, Lorenzo Bianconi wrote:
>>>> On Jun 10, Stanislav Fomichev wrote:
>>>>> On 06/11, Lorenzo Bianconi wrote:
>>>>>>> Daniel Borkmann <daniel@...earbox.net> writes:
>>>>>>>
>>>>>> [...]
>>>>>>>>>
>>>>>>>>> Why not have a new flag for bpf_redirect that transparently stores all
>>>>>>>>> available metadata? If you care only about the redirect -> skb case.
>>>>>>>>> Might give us more wiggle room in the future to make it work with
>>>>>>>>> traits.
>>>>>>>>
>>>>>>>> Also q from my side: If I understand the proposal correctly, in order to fully
>>>>>>>> populate an skb at some point, you have to call all the bpf_xdp_metadata_* kfuncs
>>>>>>>> to collect the data from the driver descriptors (indirect call), and then yet
>>>>>>>> again all equivalent bpf_xdp_store_rx_* kfuncs to re-store the data in struct
>>>>>>>> xdp_rx_meta again. This seems rather costly and once you add more kfuncs with
>>>>>>>> meta data aren't you better off switching to tc(x) directly so the driver can
>>>>>>>> do all this natively? :/
>>>>>>>
>>>>>>> I agree that the "one kfunc per metadata item" scales poorly. IIRC, the
>>>>>>> hope was (back when we added the initial HW metadata support) that we
>>>>>>> would be able to inline them to avoid the function call overhead.
>>>>>>>
>>>>>>> That being said, even with half a dozen function calls, that's still a
>>>>>>> lot less overhead from going all the way to TC(x). The goal of the use
>>>>>>> case here is to do as little work as possible on the CPU that initially
>>>>>>> receives the packet, instead moving the network stack processing (and
>>>>>>> skb allocation) to a different CPU with cpumap.
>>>>>>>
>>>>>>> So even if the *total* amount of work being done is a bit higher because
>>>>>>> of the kfunc overhead, that can still be beneficial because it's split
>>>>>>> between two (or more) CPUs.
>>>>>>>
>>>>>>> I'm sure Jesper has some concrete benchmarks for this lying around
>>>>>>> somewhere, hopefully he can share those :)
>>>>>>
>>>>>> Another possible approach would be to have some utility functions (not kfuncs)
>>>>>> used to 'store' the hw metadata in the xdp_frame that are executed in each
>>>>>> driver codebase before performing XDP_REDIRECT. The downside of this approach
>>>>>> is we need to parse the hw metadata twice if the eBPF program that is bounded
>>>>>> to the NIC is consuming these info. What do you think?
>>>>>
>>>>> That's the option I was asking about. I'm assuming we should be able
>>>>> to reuse existing xmo metadata callbacks for this. We should be able
>>>>> to hide it from the drivers also hopefully.
>>>>
>>>> If we move the hw metadata 'store' operations to the driver codebase (running
>>>> xmo metadata callbacks before performing XDP_REDIRECT), we will parse the hw
>>>> metadata twice if we attach to the NIC an AF_XDP program consuming the hw
>>>> metadata, right? One parsing is done by the AF_XDP hw metadata kfunc, and the
>>>> second one would be performed by the native driver codebase.
>>>
>>> The native driver codebase will parse the hw metadata only if the
>>> bpf_redirect set some flag, so unless I'm missing something, there
>>> should not be double parsing. (but it's all user controlled, so doesn't
>>> sound like a problem?)
>>
>> I do not have a strong opinion about it, I guess it is fine, but I am not
>> 100% sure if it fits in Jesper's use case.
>> @Jesper: any input on it?
> 
> FWIW, one of the selling points of XDP is (IMO) that it allows you to
> basically override any processing the stack does. I think this should
> apply to hardware metadata as well (for instance, if the HW metadata
> indicates that a packet is TCP, and XDP performs encapsulation before
> PASSing it, the metadata should be overridden to reflect this).

I fully agree :-)

> So if the driver populates these fields natively, I think this should
> either happen before the XDP program is run (so it can be overridden),
> or it should check if the XDP program already set the values and leave
> them be if so. Both of those obviously incur overhead; not sure which
> would be more expensive, though...
> 

Yes, if the XDP BPF-prog choose to override a field, then it's value
should "win". As I explained in [0], this is our first main production
use-case.  To override the RX-hash with the tunnel inner-headers.

  [0] 
https://lore.kernel.org/all/ca38f2ed-999f-4ce1-8035-8ee9247f27f2@kernel.org/

Later we will look at using the vlan tag. Today we have disabled HW
vlan-offloading, because XDP originally didn't support accessing HW vlan
tags.  Next hurdle for us is our usage of tail-calls, which cannot
access the HW RX-metadata via the "get" kfuncs.  Not part of this
patchset, but we have considered if someone calls the "store" kfunc, if
tail-calls invoking the "get" kfunc could return the stored value
(even-though it is not device bound).

The last field is the HW timestamp, which we also don't currently use.
Notice that this patchset stores the timestamp in skb_shared_info area.
Storing in this area will likely cause a cache-miss. Thus, "if the
driver populates these fields natively" and automatically then it will
be a performance concern.

For now, I just need to override the RX-hash and have this survive to
CPUMAP (+veth).  Adding some flag that stores all three HW metadata
natively in the driver is something that we can add later IMHO.

--Jesper



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ