lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aHeKYZY7l2i1xwel@lore-desk>
Date: Wed, 16 Jul 2025 13:17:53 +0200
From: Lorenzo Bianconi <lorenzo@...nel.org>
To: Stanislav Fomichev <stfomichev@...il.com>
Cc: Jesper Dangaard Brouer <hawk@...nel.org>, bpf@...r.kernel.org,
	netdev@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
	Alexei Starovoitov <ast@...nel.org>,
	Daniel Borkmann <borkmann@...earbox.net>,
	Eric Dumazet <eric.dumazet@...il.com>,
	"David S. Miller" <davem@...emloft.net>,
	Paolo Abeni <pabeni@...hat.com>, sdf@...ichev.me,
	kernel-team@...udflare.com, arthur@...hurfabre.com,
	jakub@...udflare.com
Subject: Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for
 XDP_REDIRECTed packets

On Jul 11, Stanislav Fomichev wrote:
> On 07/09, Lorenzo Bianconi wrote:
> > On Jul 07, Stanislav Fomichev wrote:
> > > On 07/03, Jesper Dangaard Brouer wrote:
> > > > 
> > > > 
> > > > On 02/07/2025 18.05, Stanislav Fomichev wrote:
> > > > > On 07/02, Jesper Dangaard Brouer wrote:
> > > > > > This patch series introduces a mechanism for an XDP program to store RX
> > > > > > metadata hints - specifically rx_hash, rx_vlan_tag, and rx_timestamp -
> > > > > > into the xdp_frame. These stored hints are then used to populate the
> > > > > > corresponding fields in the SKB that is created from the xdp_frame
> > > > > > following an XDP_REDIRECT.
> > > > > > 
> > > > > > The chosen RX metadata hints intentionally map to the existing NIC
> > > > > > hardware metadata that can be read via kfuncs [1]. While this design
> > > > > > allows a BPF program to read and propagate existing hardware hints, our
> > > > > > primary motivation is to enable setting custom values. This is important
> > > > > > for use cases where the hardware-provided information is insufficient or
> > > > > > needs to be calculated based on packet contents unavailable to the
> > > > > > hardware.
> > > > > > 
> > > > > > The primary motivation for this feature is to enable scalable load
> > > > > > balancing of encapsulated tunnel traffic at the XDP layer. When tunnelled
> > > > > > packets (e.g., IPsec, GRE) are redirected via cpumap or to a veth device,
> > > > > > the networking stack later calculates a software hash based on the outer
> > > > > > headers. For a single tunnel, these outer headers are often identical,
> > > > > > causing all packets to be assigned the same hash. This collapses all
> > > > > > traffic onto a single RX queue, creating a performance bottleneck and
> > > > > > defeating receive-side scaling (RSS).
> > > > > > 
> > > > > > Our immediate use case involves load balancing IPsec traffic. For such
> > > > > > tunnelled traffic, any hardware-provided RX hash is calculated on the
> > > > > > outer headers and is therefore incorrect for distributing inner flows.
> > > > > > There is no reason to read the existing value, as it must be recalculated.
> > > > > > In our XDP program, we perform a partial decryption to access the inner
> > > > > > headers and calculate a new load-balancing hash, which provides better
> > > > > > flow distribution. However, without this patch set, there is no way to
> > > > > > persist this new hash for the network stack to use post-redirect.
> > > > > > 
> > > > > > This series solves the problem by introducing new BPF kfuncs that allow an
> > > > > > XDP program to write e.g. the hash value into the xdp_frame. The
> > > > > > __xdp_build_skb_from_frame() function is modified to use this stored value
> > > > > > to set skb->hash on the newly created SKB. As a result, the veth driver's
> > > > > > queue selection logic uses the BPF-supplied hash, achieving proper
> > > > > > traffic distribution across multiple CPU cores. This also ensures that
> > > > > > consumers, like the GRO engine, can operate effectively.
> > > > > > 
> > > > > > We considered XDP traits as an alternative to adding static members to
> > > > > > struct xdp_frame. Given the immediate need for this functionality and the
> > > > > > current development status of traits, we believe this approach is a
> > > > > > pragmatic solution. We are open to migrating to a traits-based
> > > > > > implementation if and when they become a generally accepted mechanism for
> > > > > > such extensions.
> > > > > > 
> > > > > > [1] https://docs.kernel.org/networking/xdp-rx-metadata.html
> > > > > > ---
> > > > > > V1: https://lore.kernel.org/all/174897271826.1677018.9096866882347745168.stgit@firesoul/
> > > > > 
> > > > > No change log?
> > > > 
> > > > We have fixed selftest as requested by Alexie.
> > > > And we have updated cover-letter and doc as you Stanislav requested.
> > > > 
> > > > > 
> > > > > Btw, any feedback on the following from v1?
> > > > > - https://lore.kernel.org/netdev/aFHUd98juIU4Rr9J@mini-arch/
> > > > 
> > > > Addressed as updated cover-letter and documentation. I hope this helps
> > > > reviewers understand the use-case, as the discussion turn into "how do we
> > > > transfer all HW metadata", which is NOT what we want (and a waste of
> > > > precious cycles).
> > > > 
> > > > For our use-case, it doesn't make sense to "transfer all HW metadata".
> > > > In fact we don't even want to read the hardware RH-hash, because we already
> > > > know it is wrong (for tunnels), we just want to override the RX-hash used at
> > > > SKB creation.  We do want the BPF programmers flexibility to call these
> > > > kfuncs individually (when relevant).
> > > > 
> > > > > - https://lore.kernel.org/netdev/20250616145523.63bd2577@kernel.org/
> > > > 
> > > > I feel pressured into critiquing Jakub's suggestion, hope this is not too
> > > > harsh.  First of all it is not relevant to our this patchset use-case, as it
> > > > focus on all HW metadata.
> > > 
> > > [..]
> > > 
> > > > Second, I disagree with the idea/mental model of storing in a
> > > > "driver-specific format". The current implementation of driver-specific
> > > > kfunc helpers that "get the metadata" is already doing a conversion to a
> > > > common format, because the BPF-programmer naturally needs this to be the
> > > > same across drivers.  Thus, it doesn't make sense to store it back in a
> > > > "driver-specific format", as that just complicate things.  My mental model
> > > > is thus, that after the driver-specific "get" operation to result is in a
> > > > common format, that is simply defined by the struct type of the kfunc, which
> > > > is both known by the kernel and BPF-prog.
> > > 
> > > Having get/set model seems a bit more generic, no? Potentially giving us the
> > > ability to "correct" HW metadata for the non-redirected cases as well.
> > > Plus we don't hard-code the (internal) layout. Solving only xdp_redirect
> > > seems a bit too narrow, idk..
> > 
> > I can't see what the non-redirected use-case could be. Can you please provide
> > more details?
> > Moreover, can it be solved without storing the rx_hash (or the other
> > hw-metadata) in a non-driver specific format?
> 
> Having setters feels more generic than narrowly solving only the redirect,
> but I don't have a good use-case in mind.
> 
> > Storing the hw-metadata in some of hw-specific format in xdp_frame will not
> > allow to consume them directly building the skb and we will require to decode
> > them again. What is the upside/use-case of this approach? (not considering the
> > orthogonality with the get method).
> 
> If we add the store kfuncs to regular drivers, the metadata  won't be stored
> in the xdp_frame; it will go into the rx descriptors so regular path that
> builds skbs will use it.

IIUC, the described use-case would be to modify the hw metadata via a
'setter' kfunc executed by an eBPF program bounded to the NIC and to store
the new metadata in the DMA descriptor in order to be consumed by the driver
codebase building the skb, right?
If so:
- we can get the same result just storing (running a kfunc) the modified hw
  metadata in the xdp_buff struct using a well-known/generic layout and
  consume it in the driver codebase (e.g. if the bounded eBPF program
  returns XDP_PASS) using a generic xdp utility routine. This part is not in
  the current series.
- Using this approach we are still not preserving the hw metadata if we pass
  the xdp_frame to a remote CPU returning XDP_REDIRCT (we need to add more
  code)
- I am not completely sure if can always modify the DMA descriptor directly
  since it is DMA mapped.

What do you think?

Regards,
Lorenzo


Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ