[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87wmiysi37.fsf@toke.dk>
Date: Thu, 26 Sep 2024 14:41:16 +0200
From: Toke Høiland-Jørgensen <toke@...hat.com>
To: Arthur Fabre <afabre@...udflare.com>
Cc: Lorenzo Bianconi <lorenzo.bianconi@...hat.com>, Jesper Dangaard Brouer
<hawk@...nel.org>, Jakub Sitnicki <jakub@...udflare.com>, Alexander
Lobakin <aleksander.lobakin@...el.com>, Lorenzo Bianconi
<lorenzo@...nel.org>, bpf@...r.kernel.org, netdev@...r.kernel.org,
ast@...nel.org, daniel@...earbox.net, davem@...emloft.net,
kuba@...nel.org, john.fastabend@...il.com, edumazet@...gle.com,
pabeni@...hat.com, sdf@...ichev.me, tariqt@...dia.com, saeedm@...dia.com,
anthony.l.nguyen@...el.com, przemyslaw.kitszel@...el.com,
intel-wired-lan@...ts.osuosl.org, mst@...hat.com, jasowang@...hat.com,
mcoquelin.stm32@...il.com, alexandre.torgue@...s.st.com, kernel-team
<kernel-team@...udflare.com>, Yan Zhai <yan@...udflare.com>
Subject: Re: [RFC bpf-next 0/4] Add XDP rx hw hints support performing
XDP_REDIRECT
Arthur Fabre <afabre@...udflare.com> writes:
> On Sun, Sep 22, 2024 at 1:12 PM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>> FYI, we also had a discussion related to this at LPC on Friday, in this
>> session: https://lpc.events/event/18/contributions/1935/
>>
>> The context here was that Arthur and Jakub want to also support extended
>> rich metadata all the way through the SKB path, and are looking at the
>> same area used for XDP metadata to store it. So there's a need to manage
>> both the kernel's own usage of that area, and userspace/BPF usage of it.
>>
>> I'll try to summarise some of the points of that discussion (all
>> interpretations are my own, of course):
>>
>> - We want something that can be carried with a frame all the way from
>> the XDP layer, through all SKB layers and to userspace (to replace the
>> use of skb->mark for this purpose).
>>
>> - We want different applications running on the system (of which the
>> kernel itself if one, cf this discussion) to be able to share this
>> field, without having to have an out of band registry (like a Github
>> repository where applications can agree on which bits to use). Which
>> probably means that the kernel needs to be in the loop somehow to
>> explicitly allocate space in the metadata area and track offsets.
>>
>> - Having an explicit API to access this from userspace, without having
>> to go through BPF (i.e., a socket- or CMSG-based API) would be useful.
>>
>
> Thanks for looping us in, and the great summary Toke!
You're welcome :)
>> The TLV format was one of the suggestions in Arthur and Jakub's talk,
>> but AFAICT, there was not a lot of enthusiasm about this in the room
>> (myself included), because of the parsing overhead and complexity. I
>> believe the alternative that was seen as most favourable was a map
>> lookup-style API, where applications can request a metadata area of
>> arbitrary size and get an ID assigned that they can then use to set/get
>> values in the data path.
>>
>> So, sketching this out, this could be realised by something like:
>>
>> /* could be called from BPF, or through netlink or sysfs; may fail, if
>> * there is no more space
>> */
>> int metadata_id = register_packet_metadata_field(sizeof(struct my_meta));
>>
>> The ID is just an opaque identifier that can then be passed to
>> getter/setter functions (for both SKB and XDP), like:
>>
>> ret = bpf_set_packet_metadata_field(pkt, metadata_id,
>> &my_meta_value, sizeof(my_meta_value))
>>
>> ret = bpf_get_packet_metadata_field(pkt, metadata_id,
>> &my_meta_value, sizeof(my_meta_value))
>>
>>
>> On the kernel side, the implementation would track registered fields in
>> a global structure somewhere, say:
>>
>> struct pkt_metadata_entry {
>> int id;
>> u8 sz;
>> u8 offset;
>> u8 bit;
>> };
>>
>> struct pkt_metadata_registry { /* allocated as a system-wide global */
>> u8 num_entries;
>> u8 total_size;
>> struct pkt_metadata_entry entries[MAX_ENTRIES];
>> };
>>
>> struct xdp_rx_meta { /* at then end of xdp_frame */
>> u8 sz; /* set to pkt_metadata_registry->total_size on alloc */
>> u8 fields_set; /* bitmap of fields that have been set, see below */
>> u8 data[];
>> };
>>
>> int register_packet_metadata_field(u8 size) {
>> struct pkt_metadata_registry *reg = get_global_registry();
>> struct pkt_metadata_entry *entry;
>>
>> if (size + reg->total_size > MAX_METADATA_SIZE)
>> return -ENOSPC;
>>
>> entry = ®->entries[reg->num_entries++];
>> entry->id = assign_id();
>> entry->sz = size;
>> entry->offset = reg->total_size;
>> entry->bit = reg->num_entries - 1;
>> reg->total_size += size;
>>
>> return entry->id;
>> }
>>
>> int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void
>> *value, size_t sz)
>> {
>> struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
>>
>> if (!entry)
>> return -ENOENT;
>>
>> if (entry->sz != sz)
>> return -EINVAL; /* user error */
>>
>> if (frm->rx_meta.sz < entry->offset + sz)
>> return -EFAULT; /* entry allocated after xdp_frame was initialised */
>>
>> memcpy(&frm->rx_meta.data + entry->offset, value, sz);
>> frm->rx_meta.fields_set |= BIT(entry->bit);
>>
>> return 0;
>> }
>>
>> int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void
>> *value, size_t sz)
>> {
>> struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
>>
>> if (!entry)
>> return -ENOENT;
>>
>> if (entry->sz != sz)
>> return -EINVAL;
>>
>> if (frm->rx_meta.sz < entry->offset + sz)
>> return -EFAULT; /* entry allocated after xdp_frame was initialised */
>>
>> if (!(frm->rx_meta.fields_set & BIT(entry->bit)))
>> return -ENOENT;
>>
>> memcpy(value, &frm->rx_meta.data + entry->offset, sz);
>>
>> return 0;
>> }
>>
>> I'm hinting at some complications here (with the EFAULT return) that
>> needs to be resolved: there is no guarantee that a given packet will be
>> in sync with the current status of the registered metadata, so we need
>> explicit checks for this. If metadata entries are de-registered again
>> this also means dealing with holes and/or reshuffling the metadata
>> layout to reuse the released space (incidentally, this is the one place
>> where a TLV format would have advantages).
>>
>> The nice thing about an API like this, though, is that it's extensible,
>> and the kernel itself can be just another consumer of it for the
>> metadata fields Lorenzo is adding in this series. I.e., we could just
>> pre-define some IDs for metadata vlan, timestamp etc, and use the same
>> functions as above from within the kernel to set and get those values;
>> using the registry, there could even be an option to turn those off if
>> an application wants more space for its own usage. Or, alternatively, we
>> could keep the kernel-internal IDs hardcoded and always allocated, and
>> just use the getter/setter functions as the BPF API for accessing them.
>
> That's exactly what I'm thinking of too, a simple API like:
>
> get(u8 key, u8 len, void *val);
> set(u8 key, u8 len, void *val);
>
> With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.
>
> If a NIC doesn't support a certain well-known metadata, the key
> wouldn't be set, and get() would return ENOENT.
>
> I think this also lets us avoid having to "register" keys or bits of
> metadata with the kernel.
> We'd reserve some number of keys for hardware metadata.
Right, but how do you allocate space/offset for each key without an
explicit allocation step? You'd basically have to encode the list of IDs
in the metadata area itself, which implies a TLV format that you have to
walk on every access? The registry idea in my example above was
basically to avoid that...
> The remaining keys would be up to users. They'd have to allocate keys
> to services, and configure services to use those keys.
> This is similar to the way listening on a certain port works: only one
> service can use port 80 or 443, and that can typically beconfigured in
> a service's config file.
Right, well, port numbers *do* actually have an out of band service
registry (IANA), which I thought was what we wanted to avoid? ;)
> This side-steps the whole question of how to change the registered
> metadata for in-flight packets, and how to deal with different NICs
> with different hardware metadata.
>
> I think I've figured out a suitable encoding format, hopefully we'll
> have an RFC soon!
Alright, cool!
-Toke
Powered by blists - more mailing lists