[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZwAXgm-wV8-WQAqU@mini-arch>
Date: Fri, 4 Oct 2024 09:27:46 -0700
From: Stanislav Fomichev <stfomichev@...il.com>
To: Daniel Xu <dxu@...uu.xyz>
Cc: Arthur Fabre <afabre@...udflare.com>,
Toke Høiland-Jørgensen <toke@...hat.com>,
Lorenzo Bianconi <lorenzo@...nel.org>,
Lorenzo Bianconi <lorenzo.bianconi@...hat.com>,
Jesper Dangaard Brouer <hawk@...nel.org>,
Jakub Sitnicki <jakub@...udflare.com>,
Alexander Lobakin <aleksander.lobakin@...el.com>,
bpf@...r.kernel.org, netdev@...r.kernel.org, ast@...nel.org,
daniel@...earbox.net, davem@...emloft.net, kuba@...nel.org,
john.fastabend@...il.com, edumazet@...gle.com, pabeni@...hat.com,
sdf@...ichev.me, tariqt@...dia.com, saeedm@...dia.com,
anthony.l.nguyen@...el.com, przemyslaw.kitszel@...el.com,
intel-wired-lan@...ts.osuosl.org, mst@...hat.com,
jasowang@...hat.com, mcoquelin.stm32@...il.com,
alexandre.torgue@...s.st.com,
kernel-team <kernel-team@...udflare.com>,
Yan Zhai <yan@...udflare.com>
Subject: Re: [RFC bpf-next 0/4] Add XDP rx hw hints support performing
XDP_REDIRECT
On 10/03, Daniel Xu wrote:
> Hi Stan,
>
> On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote:
> > On 10/03, Arthur Fabre wrote:
> > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote:
> > > > On 10/02, Toke Høiland-Jørgensen wrote:
> > > > > Stanislav Fomichev <stfomichev@...il.com> writes:
> > > > >
> > > > > > On 10/01, Toke Høiland-Jørgensen wrote:
> > > > > >> Lorenzo Bianconi <lorenzo@...nel.org> writes:
> > > > > >>
> > > > > >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> > > > > >> >> > > Lorenzo Bianconi <lorenzo@...nel.org> writes:
> > > > > >> >> > >
> > > > > >> >> > > >> > We could combine such a registration API with your header format, so
> > > > > >> >> > > >> > that the registration just becomes a way of allocating one of the keys
> > > > > >> >> > > >> > from 0-63 (and the registry just becomes a global copy of the header).
> > > > > >> >> > > >> > This would basically amount to moving the "service config file" into the
> > > > > >> >> > > >> > kernel, since that seems to be the only common denominator we can rely
> > > > > >> >> > > >> > on between BPF applications (as all attempts to write a common daemon
> > > > > >> >> > > >> > for BPF management have shown).
> > > > > >> >> > > >>
> > > > > >> >> > > >> That sounds reasonable. And I guess we'd have set() check the global
> > > > > >> >> > > >> registry to enforce that the key has been registered beforehand?
> > > > > >> >> > > >>
> > > > > >> >> > > >> >
> > > > > >> >> > > >> > -Toke
> > > > > >> >> > > >>
> > > > > >> >> > > >> Thanks for all the feedback!
> > > > > >> >> > > >
> > > > > >> >> > > > I like this 'fast' KV approach but I guess we should really evaluate its
> > > > > >> >> > > > impact on performances (especially for xdp) since, based on the kfunc calls
> > > > > >> >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy for
> > > > > >> >> > > > each packet, right?
> > > > > >> >> > >
> > > > > >> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
> > > > > >> >> > > a global registry for offsets would sidestep this, but have the
> > > > > >> >> > > synchronisation issues we discussed up-thread. So on balance, I think
> > > > > >> >> > > the memmove() suggestion will probably lead to the least pain.
> > > > > >> >> > >
> > > > > >> >> > > For the HW metadata we could sidestep this by always having a fixed
> > > > > >> >> > > struct for it (but using the same set/get() API with reserved keys). The
> > > > > >> >> > > only drawback of doing that is that we statically reserve a bit of
> > > > > >> >> > > space, but I'm not sure that is such a big issue in practice (at least
> > > > > >> >> > > not until this becomes to popular that the space starts to be contended;
> > > > > >> >> > > but surely 256 bytes ought to be enough for everybody, right? :)).
> > > > > >> >> >
> > > > > >> >> > I am fine with the proposed approach, but I think we need to verify what is the
> > > > > >> >> > impact on performances (in the worst case??)
> > > > > >> >>
> > > > > >> >> If drivers are responsible for populating the hardware metadata before
> > > > > >> >> XDP, we could make sure drivers set the fields in order to avoid any
> > > > > >> >> memove() (and maybe even provide a helper to ensure this?).
> > > > > >> >
> > > > > >> > nope, since the current APIs introduced by Stanislav are consuming NIC
> > > > > >> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding,
> > > > > >> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash,
> > > > > >> > timestamping, ..) into the packet (this is what Toke is proposing, right?).
> > > > > >> > In this case kfunc calling order makes a difference.
> > > > > >> > We can think even to add single kfunc to store all the info for all the NIC
> > > > > >> > metadata (maybe via a helping struct) but it seems not scalable to me and we
> > > > > >> > are losing kfunc versatility.
> > > > > >>
> > > > > >> Yes, I agree we should have separate kfuncs for each metadata field.
> > > > > >> Which means it makes a lot of sense to just use the same setter API that
> > > > > >> we use for the user-registered metadata fields, but using reserved keys.
> > > > > >> So something like:
> > > > > >>
> > > > > >> #define BPF_METADATA_HW_HASH BIT(60)
> > > > > >> #define BPF_METADATA_HW_TIMESTAMP BIT(61)
> > > > > >> #define BPF_METADATA_HW_VLAN BIT(62)
> > > > > >> #define BPF_METADATA_RESERVED (0xffff << 48)
> > > > > >>
> > > > > >> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value);
> > > > > >>
> > > > > >>
> > > > > >> As for the internal representation, we can just have the kfunc do
> > > > > >> something like:
> > > > > >>
> > > > > >> int bpf_packet_metadata_set(field_id, value) {
> > > > > >> switch(field_id) {
> > > > > >> case BPF_METADATA_HW_HASH:
> > > > > >> pkt->xdp_hw_meta.hash = value;
> > > > > >> break;
> > > > > >> [...]
> > > > > >> default:
> > > > > >> /* do the key packing thing */
> > > > > >> }
> > > > > >> }
> > > > > >>
> > > > > >>
> > > > > >> that way the order of setting the HW fields doesn't matter, only the
> > > > > >> user-defined metadata.
> > > > > >
> > > > > > Can you expand on why we need the flexibility of picking the metadata fields
> > > > > > here? Presumably we are talking about the use-cases where the XDP program
> > > > > > is doing redirect/pass and it doesn't really know who's the final
> > > > > > consumer is (might be another xdp program or might be the xdp->skb
> > > > > > kernel case), so the only sensible option here seems to be store everything?
> > > > >
> > > > > For the same reason that we have separate kfuncs for each metadata field
> > > > > when getting it from the driver: XDP programs should have the
> > > > > flexibility to decide which pieces of metadata they need, and skip the
> > > > > overhead of stuff that is not needed.
> > > > >
> > > > > For instance, say an XDP program knows that nothing in the system uses
> > > > > timestamps; in that case, it can skip both the getter and the setter
> > > > > call for timestamps.
> >
> > Original RFC is talking about XDP -> XDP_REDIRECT -> skb use-case,
> > right? For this we pretty much know what kind of metadata we want to
> > preserve, so why not ship it in the existing metadata area and have
> > a kfunc that the xdp program will call prior to doing xdp_redirect?
> > This kfunc can do exactly what you're suggesting - skip the timestamp
> > if we know that the timestamping is off.
> >
> > Or have we moved to discussing some other use-cases? What am I missing
> > about the need for some other new mechanism?
> >
> > > > But doesn't it put us in the same place? Where the first (native) xdp program
> > > > needs to know which metadata the final consumer wants. At this point
> > > > why not propagate metadata layout as well?
> > > >
> > > > (or maybe I'm still missing what exact use-case we are trying to solve)
> > >
> > > There are two different use-cases for the metadata:
> > >
> > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a
> > > few well known fields, and only XDP can access them to set them as
> > > metadata, so storing them in a struct somewhere could make sense.
> > >
> > > * Arbitrary metadata used by services. Eg a TC filter could set a field
> > > describing which service a packet is for, and that could be reused for
> > > iptables, routing, socket dispatch...
> > > Similarly we could set a "packet_id" field that uniquely identifies a
> > > packet so we can trace it throughout the network stack (through
> > > clones, encap, decap, userspace services...).
> > > The skb->mark, but with more room, and better support for sharing it.
> > >
> > > We can only know the layout ahead of time for the first one. And they're
> > > similar enough in their requirements (need to be stored somewhere in the
> > > SKB, have a way of retrieving each one individually, that it seems to
> > > make sense to use a common API).
> >
> > Why not have the following layout then?
> >
> > +---------------+-------------------+----------------------------------------+------+
> > | more headroom | user-defined meta | hw-meta (potentially fixed skb format) | data |
> > +---------------+-------------------+----------------------------------------+------+
> > ^ ^
> > data_meta data
> >
> > You obviously still have a problem of communicating the layout if you
> > have some redirects in between, but you, in theory still have this
> > problem with user-defined metadata anyway (unless I'm missing
> > something).
> >
> > > > > I suppose we *could* support just a single call to set the skb meta,
> > > > > like:
> > > > >
> > > > > bpf_set_skb_meta(struct xdp_md *pkt, struct xdp_hw_meta *data);
> > > > >
> > > > > ...but in that case, we'd need to support some fields being unset
> > > > > anyway, and the program would have to populate the struct on the stack
> > > > > before performing the call. So it seems simpler to just have symmetry
> > > > > between the get (from HW) and set side? :)
> > > >
> > > > Why not simply bpf_set_skb_meta(struct xdp_md *pkt) and let it store
> > > > the metadata somewhere in xdp_md directly? (also presumably by
> > > > reusing most of the existing kfuncs/xmo_xxx helpers)
> > >
> > > If we store it in xdp_md, the metadata won't be available higher up the
> > > stack (or am I missing something?). I think one of the goals is to let
> > > things other than XDP access it (maybe even the network stack itself?).
> >
> > IIRC, xdp metadata gets copied to skb metadata, so it does propagate.
> > Although, it might have a detrimental effect on the gro, but I'm
> > assuming that is something that can be fixed separately.
>
> I was thinking about this today so I'm glad you brought it up.
>
> IIUC putting unique data in the metadata area will prevent GRO from
> working. I wonder if as a part of this work there's a cleaner way to
> indicate to XDP or GRO engine that some metadata should be ignored for
> coalescing purposes. Otherwise the final XDP prog before GRO would need
> to memset() all the relevant bytes to 0 (assuming that even works).
I'm assuming it is that way because there has to be a conscious decision
(TBD somewhere) about how to merge the metadata. IOW, there needs to be
some definition of 'ignored for coalescing purposes'. Do we ignore
the old metadata (the one that's already in the gro queue) or the new
metadata (in the skb)? What if there is a timestamp in the metadata?
In this case, depending on what you ignore, you get a different
timestamp.
Powered by blists - more mailing lists