netdev - Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <D9FYTORERFI7.36F4WG8G3NHGX@arthurfabre.com>
Date: Fri, 25 Apr 2025 21:26:59 +0200
From: "Arthur Fabre" <arthur@...hurfabre.com>
To: "Alexei Starovoitov" <alexei.starovoitov@...il.com>
Cc: "Network Development" <netdev@...r.kernel.org>, "bpf"
 <bpf@...r.kernel.org>, "Jakub Sitnicki" <jakub@...udflare.com>, "Jesper
 Dangaard Brouer" <hawk@...nel.org>, "Yan Zhai" <yan@...udflare.com>,
 <jbrandeburg@...udflare.com>, Toke Høiland-Jørgensen
 <thoiland@...hat.com>, <lbiancon@...hat.com>, "Alexei Starovoitov"
 <ast@...nel.org>, "Jakub Kicinski" <kuba@...nel.org>, "Eric Dumazet"
 <edumazet@...gle.com>
Subject: Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for
 packet metadata

On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
> On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@...hurfabre.com> wrote:
> >
> > +/**
> > + * trait_set() - Set a trait key.
> > + * @traits: Start of trait store area.
> > + * @hard_end: Hard limit the trait store can currently grow up against.
> > + * @key: The key to set.
> > + * @val: The value to set.
> > + * @len: The length of the value.
> > + * @flags: Unused for now. Should be 0.
> > + *
> > + * Return:
> > + * * %0       - Success.
> > + * * %-EINVAL - Key or length invalid.
> > + * * %-EBUSY  - Key previously set with different length.
> > + * * %-ENOSPC - Not enough room left to store value.
> > + */
> > +static __always_inline
> > +int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u64 flags)
> > +{
> > +       if (!__trait_valid_key(key) || !__trait_valid_len(len))
> > +               return -EINVAL;
> > +
> > +       struct __trait_hdr *h = (struct __trait_hdr *)traits;
> > +
> > +       /* Offset of value of this key. */
> > +       int off = __trait_offset(*h, key);
> > +
> > +       if ((h->high & (1ull << key)) || (h->low & (1ull << key))) {
> > +               /* Key is already set, but with a different length */
> > +               if (__trait_total_length(__trait_and(*h, (1ull << key))) != len)
> > +                       return -EBUSY;
> > +       } else {
> > +               /* Figure out if we have enough room left: total length of everything now. */
> > +               if (traits + sizeof(struct __trait_hdr) + __trait_total_length(*h) + len > hard_end)
> > +                       return -ENOSPC;
>
> I'm still not excited about having two metadata-s
> in front of the packet.
> Why cannot traits use the same metadata space ?
>
> For trait_set() you already pass hard_end and have to check it
> at run-time.
> If you add the same hard_end to trait_get/del the kfuncs will deal
> with possible corruption of metadata by the program.
> Transition from xdp to skb will be automatic. The code won't care that traits
> are there. It will just copy all metadata from xdp to skb. Corrupted or not.
> bpf progs in xdp and skb might even use the same kfuncs
> (or two different sets if the verifier is not smart enough right now).

Good idea, that would solve the corruption problem.

But I think storing metadata at the "end" of the headroom (ie where 
XDP metadata is today) makes it harder to persist in the SKB layer.
Functions like __skb_push() assume that skb_headroom() bytes are
available just before skb->data.

They can be updated to move XDP metadata out of the way, but this
assumption seems pretty pervasive.

By using the "front" of the headroom, we can hide that from the rest of
the SKB code. We could even update skb->head to completely hide the
space used at the front of the headroom.
It also avoids the cost of moving the metadata around (but maybe that
would be insignificant).

XDP metadata also doesn't work well with GRO (see below).

> Ideally we add hweight64 as new bpf instructions then maybe
> we won't need any kernel changes at all.
> bpf side will do:
> bpf_xdp_adjust_meta(xdp, -max_offset_for_largest_key);
> and then
> trait_set(xdp->data_meta /* pointer to trait header */, xdp->data /*
> hard end */, ...);
> can be implemented as bpf prog.
>
> Same thing for skb progs.
> netfilter/iptable can use another bpf prog to make decisions.

There are (at least) two reasons for wanting the kernel to understand the
format:

* GRO: With an opaque binary blob, the kernel can either forbid GRO when
  the metadata differs (like XDP metadata today), or keep the entire blob
  of one of the packets.
  But maybe some users will want to keep a KV of the first packet, or
  the last packet, eg for receive timestamps.
  With a KV store we can have a sane default option for merging the
  different KV stores, and even add a per KV policy in the future if
  needed.

* Hardware metadata: metadata exposed from NICs (like the receive
  timestamp, 4 tuple hash...) is currently only exposed to XDP programs
  (via kfuncs).
  But that doesn't expose them to the rest of the stack.
  Storing them in traits would allow XDP, other BPF programs, and the
  kernel to access and modify them (for example to into account
  decapsulating a packet).