[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <D9KNCG5YG8DF.284WM1C7ABNNX@arthurfabre.com>
Date: Thu, 01 May 2025 09:30:36 +0200
From: "Arthur Fabre" <arthur@...hurfabre.com>
To: "Alexei Starovoitov" <alexei.starovoitov@...il.com>,
Toke Høiland-Jørgensen <toke@...hat.com>
Cc: "Network Development" <netdev@...r.kernel.org>, "bpf"
<bpf@...r.kernel.org>, "Jakub Sitnicki" <jakub@...udflare.com>, "Jesper
Dangaard Brouer" <hawk@...nel.org>, "Yan Zhai" <yan@...udflare.com>,
<jbrandeburg@...udflare.com>, <lbiancon@...hat.com>, "Alexei Starovoitov"
<ast@...nel.org>, "Jakub Kicinski" <kuba@...nel.org>, "Eric Dumazet"
<edumazet@...gle.com>
Subject: Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for
packet metadata
On Wed Apr 30, 2025 at 6:29 PM CEST, Alexei Starovoitov wrote:
> On Wed, Apr 30, 2025 at 2:19 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
> >
> > Alexei Starovoitov <alexei.starovoitov@...il.com> writes:
> >
> > > On Fri, Apr 25, 2025 at 12:27 PM Arthur Fabre <arthur@...hurfabre.com> wrote:
> > >>
> > >> On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
> > >> > On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@...hurfabre.com> wrote:
> > >> > >
> > >> > > +/**
> > >> > > + * trait_set() - Set a trait key.
> > >> > > + * @traits: Start of trait store area.
> > >> > > + * @hard_end: Hard limit the trait store can currently grow up against.
> > >> > > + * @key: The key to set.
> > >> > > + * @val: The value to set.
> > >> > > + * @len: The length of the value.
> > >> > > + * @flags: Unused for now. Should be 0.
> > >> > > + *
> > >> > > + * Return:
> > >> > > + * * %0 - Success.
> > >> > > + * * %-EINVAL - Key or length invalid.
> > >> > > + * * %-EBUSY - Key previously set with different length.
> > >> > > + * * %-ENOSPC - Not enough room left to store value.
> > >> > > + */
> > >> > > +static __always_inline
> > >> > > +int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u64 flags)
> > >> > > +{
> > >> > > + if (!__trait_valid_key(key) || !__trait_valid_len(len))
> > >> > > + return -EINVAL;
> > >> > > +
> > >> > > + struct __trait_hdr *h = (struct __trait_hdr *)traits;
> > >> > > +
> > >> > > + /* Offset of value of this key. */
> > >> > > + int off = __trait_offset(*h, key);
> > >> > > +
> > >> > > + if ((h->high & (1ull << key)) || (h->low & (1ull << key))) {
> > >> > > + /* Key is already set, but with a different length */
> > >> > > + if (__trait_total_length(__trait_and(*h, (1ull << key))) != len)
> > >> > > + return -EBUSY;
> > >> > > + } else {
> > >> > > + /* Figure out if we have enough room left: total length of everything now. */
> > >> > > + if (traits + sizeof(struct __trait_hdr) + __trait_total_length(*h) + len > hard_end)
> > >> > > + return -ENOSPC;
> > >> >
> > >> > I'm still not excited about having two metadata-s
> > >> > in front of the packet.
> > >> > Why cannot traits use the same metadata space ?
> > >> >
> > >> > For trait_set() you already pass hard_end and have to check it
> > >> > at run-time.
> > >> > If you add the same hard_end to trait_get/del the kfuncs will deal
> > >> > with possible corruption of metadata by the program.
> > >> > Transition from xdp to skb will be automatic. The code won't care that traits
> > >> > are there. It will just copy all metadata from xdp to skb. Corrupted or not.
> > >> > bpf progs in xdp and skb might even use the same kfuncs
> > >> > (or two different sets if the verifier is not smart enough right now).
> > >>
> > >> Good idea, that would solve the corruption problem.
> > >>
> > >> But I think storing metadata at the "end" of the headroom (ie where
> > >> XDP metadata is today) makes it harder to persist in the SKB layer.
> > >> Functions like __skb_push() assume that skb_headroom() bytes are
> > >> available just before skb->data.
> > >>
> > >> They can be updated to move XDP metadata out of the way, but this
> > >> assumption seems pretty pervasive.
> > >
> > > The same argument can be flipped.
> > > Why does the skb layer need to push?
> > > If it needs to encapsulate it will forward to tunnel device
> > > to go on the wire. At this point any kind of metadata is going
> > > to be lost on the wire. bpf prog would need to convert
> > > metadata into actual on the wire format or stash it
> > > or send to user space.
> > > I don't see a use case where skb layer would move medadata by N
> > > bytes, populate these N bytes with "???" and pass to next skb layer.
> > > skb layers strip (pop) the header when it goes from ip to tcp to user space.
> > > No need to move metadata.
> > >
> > >> By using the "front" of the headroom, we can hide that from the rest of
> > >> the SKB code. We could even update skb->head to completely hide the
> > >> space used at the front of the headroom.
> > >> It also avoids the cost of moving the metadata around (but maybe that
> > >> would be insignificant).
> > >
> > > That's a theory. Do you actually have skb layers pushing things
> > > while metadata is there?
> >
> > Erm, any encapsulation? UDP tunnels, wireguard, WiFi, etc. There are
> > almost 1000 calls to skb_push() all over the kernel. One of the primary
> > use cases for traits is to tag a packet with an ID to follow it
> > throughout its lifetime inside the kernel. This absolutely includes
> > encapsulation; in fact, having the tracing ID survive these kinds of
> > transformations is one of the primary motivators for this work.
>
> and the assumption here is that placing traits in the front will
> be enough for all possible encaps that the stack can do?
> Hopefully not. Moving traits due to skb_push() is mandatory anyway.
The network stack seems to assume there is at least 32 bytes of headroom
available, that's guaranteed by NET_SKB_PAD.
Callers that need more seem to check if there's enough headroom
available first, and pskb_expand_head() if not.
Placing traits at the front lets most of this work naturally if we leave
NET_SKB_PAD headroom.
But I haven't looked at every single __skb_push() call, maybe there are
cases I don't know about.
>
> The other point you're arguing is that the current metadata approach
> is broken, since it doesn't work for encap?
> pskb_expand_head() does skb_metadata_clear().
> Though I don't remember anyone complaining of that behavior,
No one can complain yet, because metadata isn't accessible after TC
today. It's limited to just XDP and TC, which avoids most of the SKB
processing and these trickier cases.
> let's assume it's a problem indeed.
> With traits in front the pskb_expand_head() should either fail or move
> traits when it runs out of room.
> Let's be consistent and do the same for regular metadata.
>
> My main point is we should not introduce a second kind of metadata.
> If the current metadata doesn't quite work, fix it instead of adding
> a new one.
I'll think about this more. It's tempting to just move the current
metadata area to the front of the headroom, but that breaks things like
AF_XDP that expect the metadata to be right before the packet data.
Powered by blists - more mailing lists