[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAADnVQKe3Jfd+pVt868P32-m2a-moP4H7ms_kdZnrYALCxx53Q@mail.gmail.com>
Date: Tue, 29 Apr 2025 16:36:28 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Arthur Fabre <arthur@...hurfabre.com>
Cc: Network Development <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
Jakub Sitnicki <jakub@...udflare.com>, Jesper Dangaard Brouer <hawk@...nel.org>, Yan Zhai <yan@...udflare.com>,
jbrandeburg@...udflare.com, Toke Høiland-Jørgensen <thoiland@...hat.com>,
lbiancon@...hat.com, Alexei Starovoitov <ast@...nel.org>, Jakub Kicinski <kuba@...nel.org>,
Eric Dumazet <edumazet@...gle.com>
Subject: Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
On Fri, Apr 25, 2025 at 12:27 PM Arthur Fabre <arthur@...hurfabre.com> wrote:
>
> On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
> > On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@...hurfabre.com> wrote:
> > >
> > > +/**
> > > + * trait_set() - Set a trait key.
> > > + * @traits: Start of trait store area.
> > > + * @hard_end: Hard limit the trait store can currently grow up against.
> > > + * @key: The key to set.
> > > + * @val: The value to set.
> > > + * @len: The length of the value.
> > > + * @flags: Unused for now. Should be 0.
> > > + *
> > > + * Return:
> > > + * * %0 - Success.
> > > + * * %-EINVAL - Key or length invalid.
> > > + * * %-EBUSY - Key previously set with different length.
> > > + * * %-ENOSPC - Not enough room left to store value.
> > > + */
> > > +static __always_inline
> > > +int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u64 flags)
> > > +{
> > > + if (!__trait_valid_key(key) || !__trait_valid_len(len))
> > > + return -EINVAL;
> > > +
> > > + struct __trait_hdr *h = (struct __trait_hdr *)traits;
> > > +
> > > + /* Offset of value of this key. */
> > > + int off = __trait_offset(*h, key);
> > > +
> > > + if ((h->high & (1ull << key)) || (h->low & (1ull << key))) {
> > > + /* Key is already set, but with a different length */
> > > + if (__trait_total_length(__trait_and(*h, (1ull << key))) != len)
> > > + return -EBUSY;
> > > + } else {
> > > + /* Figure out if we have enough room left: total length of everything now. */
> > > + if (traits + sizeof(struct __trait_hdr) + __trait_total_length(*h) + len > hard_end)
> > > + return -ENOSPC;
> >
> > I'm still not excited about having two metadata-s
> > in front of the packet.
> > Why cannot traits use the same metadata space ?
> >
> > For trait_set() you already pass hard_end and have to check it
> > at run-time.
> > If you add the same hard_end to trait_get/del the kfuncs will deal
> > with possible corruption of metadata by the program.
> > Transition from xdp to skb will be automatic. The code won't care that traits
> > are there. It will just copy all metadata from xdp to skb. Corrupted or not.
> > bpf progs in xdp and skb might even use the same kfuncs
> > (or two different sets if the verifier is not smart enough right now).
>
> Good idea, that would solve the corruption problem.
>
> But I think storing metadata at the "end" of the headroom (ie where
> XDP metadata is today) makes it harder to persist in the SKB layer.
> Functions like __skb_push() assume that skb_headroom() bytes are
> available just before skb->data.
>
> They can be updated to move XDP metadata out of the way, but this
> assumption seems pretty pervasive.
The same argument can be flipped.
Why does the skb layer need to push?
If it needs to encapsulate it will forward to tunnel device
to go on the wire. At this point any kind of metadata is going
to be lost on the wire. bpf prog would need to convert
metadata into actual on the wire format or stash it
or send to user space.
I don't see a use case where skb layer would move medadata by N
bytes, populate these N bytes with "???" and pass to next skb layer.
skb layers strip (pop) the header when it goes from ip to tcp to user space.
No need to move metadata.
> By using the "front" of the headroom, we can hide that from the rest of
> the SKB code. We could even update skb->head to completely hide the
> space used at the front of the headroom.
> It also avoids the cost of moving the metadata around (but maybe that
> would be insignificant).
That's a theory. Do you actually have skb layers pushing things
while metadata is there?
> XDP metadata also doesn't work well with GRO (see below).
>
> > Ideally we add hweight64 as new bpf instructions then maybe
> > we won't need any kernel changes at all.
> > bpf side will do:
> > bpf_xdp_adjust_meta(xdp, -max_offset_for_largest_key);
> > and then
> > trait_set(xdp->data_meta /* pointer to trait header */, xdp->data /*
> > hard end */, ...);
> > can be implemented as bpf prog.
> >
> > Same thing for skb progs.
> > netfilter/iptable can use another bpf prog to make decisions.
>
> There are (at least) two reasons for wanting the kernel to understand the
> format:
>
> * GRO: With an opaque binary blob, the kernel can either forbid GRO when
> the metadata differs (like XDP metadata today), or keep the entire blob
> of one of the packets.
> But maybe some users will want to keep a KV of the first packet, or
> the last packet, eg for receive timestamps.
> With a KV store we can have a sane default option for merging the
> different KV stores, and even add a per KV policy in the future if
> needed.
We can have this default for metadata too.
If all bytes in the metadata blob are the same -> let it GRO.
I don't think it's a good idea to extend GRO with bpf to make it "smart".
> * Hardware metadata: metadata exposed from NICs (like the receive
> timestamp, 4 tuple hash...) is currently only exposed to XDP programs
> (via kfuncs).
> But that doesn't expose them to the rest of the stack.
> Storing them in traits would allow XDP, other BPF programs, and the
> kernel to access and modify them (for example to into account
> decapsulating a packet).
Sure. If traits == existing metadata bpf prog in xdp can communicate
with bpf prog in skb layer via that "trait" format.
xdp can take tuple hash and store it as key==0 in the trait.
The kernel doesn't need to know how to parse that format.
Powered by blists - more mailing lists