[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7eb3e22a-c416-e898-dff0-1146d3cc82c0@linux.dev>
Date: Thu, 10 Nov 2022 11:04:28 -0800
From: Martin KaFai Lau <martin.lau@...ux.dev>
To: Toke Høiland-Jørgensen <toke@...hat.com>
Cc: Stanislav Fomichev <sdf@...gle.com>, ast@...nel.org,
daniel@...earbox.net, andrii@...nel.org, song@...nel.org,
yhs@...com, john.fastabend@...il.com, kpsingh@...nel.org,
haoluo@...gle.com, jolsa@...nel.org,
David Ahern <dsahern@...il.com>,
Jakub Kicinski <kuba@...nel.org>,
Willem de Bruijn <willemb@...gle.com>,
Jesper Dangaard Brouer <brouer@...hat.com>,
Anatoly Burakov <anatoly.burakov@...el.com>,
Alexander Lobakin <alexandr.lobakin@...el.com>,
Magnus Karlsson <magnus.karlsson@...il.com>,
Maryam Tahhan <mtahhan@...hat.com>, xdp-hints@...-project.net,
netdev@...r.kernel.org, bpf@...r.kernel.org
Subject: Re: [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp
metadata into skb context
On 11/10/22 6:19 AM, Toke Høiland-Jørgensen wrote:
> Martin KaFai Lau <martin.lau@...ux.dev> writes:
>
>> On 11/9/22 3:10 AM, Toke Høiland-Jørgensen wrote:
>>> Snipping a bit of context to reply to this bit:
>>>
>>>>>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
>>>>>>> sure it is solid enough by asking the xdp prog not to use the same random number
>>>>>>> in its own metadata + not to change the metadata through xdp->data_meta after
>>>>>>> calling bpf_xdp_metadata_export_to_skb().
>>>>>>
>>>>>> What do you think the usecase here might be? Or are you suggesting we
>>>>>> reject further access to data_meta after
>>>>>> bpf_xdp_metadata_export_to_skb somehow?
>>>>>>
>>>>>> If we want to let the programs override some of this
>>>>>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
>>>>>> more kfuncs instead of exposing the layout?
>>>>>>
>>>>>> bpf_xdp_metadata_export_to_skb(ctx);
>>>>>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
>>>
>>> There are several use cases for needing to access the metadata after
>>> calling bpf_xdp_metdata_export_to_skb():
>>>
>>> - Accessing the metadata after redirect (in a cpumap or devmap program,
>>> or on a veth device)
>>> - Transferring the packet+metadata to AF_XDP
>> fwiw, the xdp prog could also be more selective and only stores one of the hints
>> instead of the whole 'struct xdp_to_skb_metadata'.
>
> Yup, absolutely! In that sense, reusing the SKB format is mostly a
> convenience feature. However, lots of people consume AF_XDP through the
> default program installed by libxdp in the XSK setup code, and including
> custom metadata there is awkward. So having the metadata consumed by the
> stack as the "default subset" would enable easy consumption by
> non-advanced users, while advanced users can still do custom stuff by
> writing their own XDP program that calls the kfuncs.
>
>>> - Returning XDP_PASS, but accessing some of the metadata first (whether
>>> to read or change it)
>>>
>>> The last one could be solved by calling additional kfuncs, but that
>>> would be less efficient than just directly editing the struct which
>>> will be cache-hot after the helper returns.
>>
>> Yeah, it is more efficient to directly write if possible. I think this set
>> allows the direct reading and writing already through data_meta (as a _u8 *).
>
> Yup, totally fine with just keeping that capability :)
>
>>> And yeah, this will allow the XDP program to inject arbitrary metadata
>>> into the netstack; but it can already inject arbitrary *packet* data
>>> into the stack, so not sure if this is much of an additional risk? If it
>>> does lead to trivial crashes, we should probably harden the stack
>>> against that?
>>>
>>> As for the random number, Jesper and I discussed replacing this with the
>>> same BTF-ID scheme that he was using in his patch series. I.e., instead
>>> of just putting in a random number, we insert the BTF ID of the metadata
>>> struct at the end of it. This will allow us to support multiple
>>> different formats in the future (not just changing the layout, but
>>> having multiple simultaneous formats in the same kernel image), in case
>>> we run out of space.
>>
>> This seems a bit hypothetical. How much headroom does it usually have for the
>> xdp prog? Potentially the hints can use all the remaining space left after the
>> header encap and the current bpf_xdp_adjust_meta() usage?
>
> For the metadata consumed by the stack right now it's a bit
> hypothetical, yeah. However, there's a bunch of metadata commonly
> supported by hardware that the stack currently doesn't consume and that
> hopefully this feature will end up making more accessible. My hope is
> that the stack can also learn how to use this in the future, in which
> case we may run out of space. So I think of that bit mostly as
> future-proofing...
ic. in this case, Can the btf_id be added to 'struct xdp_to_skb_metadata' later
if it is indeed needed? The 'struct xdp_to_skb_metadata' is not in UAPI and
doing it with CO-RE is to give us flexibility to make this kind of changes in
the future.
>
>>> We should probably also have a flag set on the xdp_frame so the stack
>>> knows that the metadata area contains relevant-to-skb data, to guard
>>> against an XDP program accidentally hitting the "magic number" (BTF_ID)
>>> in unrelated stuff it puts into the metadata area.
>>
>> Yeah, I think having a flag is useful. The flag will be set at xdp_buff and
>> then transfer to the xdp_frame?
>
> Yeah, exactly!
>
>>>> After re-reading patch 6, have another question. The 'void
>>>> bpf_xdp_metadata_export_to_skb();' function signature. Should it at
>>>> least return ok/err? or even return a 'struct xdp_to_skb_metadata *'
>>>> pointer and the xdp prog can directly read (or even write) it?
>>>
>>> Hmm, I'm not sure returning a failure makes sense? Failure to read one
>>> or more fields just means that those fields will not be populated? We
>>> should probably have a flags field inside the metadata struct itself to
>>> indicate which fields are set or not, but I'm not sure returning an
>>> error value adds anything? Returning a pointer to the metadata field
>>> might be convenient for users (it would just be an alias to the
>>> data_meta pointer, but the verifier could know its size, so the program
>>> doesn't have to bounds check it).
>>
>> If some hints are not available, those hints should be initialized to
>> 0/CHECKSUM_NONE/...etc.
>
> The problem with that is that then we have to spend cycles writing
> eight bytes of zeroes into the checksum field :)
With a common 'struct xdp_to_skb_metadata', I am not sure how some of these zero
writes can be avoided. If the xdp prog wants to optimize, it can call
individual kfunc to get individual hints.
>
>> The xdp prog needs a direct way to tell hard failure when it cannot
>> write the meta area because of not enough space. Comparing
>> xdp->data_meta with xdp->data as a side effect is not intuitive.
>
> Yeah, hence a flags field so we can just see if setting each field
> succeeded?
How testing a flag is different from checking 0/invalid-value of a field? or
some fields just don't have an invalid value to check for like vlan_tci?
You meant a flags field as a return value or in the 'struct xdp_to_skb_metadata' ?
>
>> It is more than saving the bound check. With type info of 'struct
>> xdp_to_skb_metadata *', the verifier can do more checks like reading in the
>> middle of an integer member. The verifier could also limit write access only to
>> a few struct's members if it is needed.
>>
>> The returning 'struct xdp_to_skb_metadata *' should not be an alias to the
>> xdp->data_meta. They should actually point to different locations in the
>> headroom. bpf_xdp_metadata_export_to_skb() sets a flag in xdp_buff.
>> xdp->data_meta won't be changed and keeps pointing to the last
>> bpf_xdp_adjust_meta() location. The kernel will know if there is
>> xdp_to_skb_metadata before the xdp->data_meta when that bit is set in the
>> xdp_{buff,frame}. Would it work?
>
> Hmm, logically splitting the program metadata and the xdp_hints metadata
> (but having them share the same area) *could* work, I guess, I'm just
> not sure it's worth the extra complexity?
It shouldn't stop the existing xdp prog writing its own metadata from using the
the new bpf_xdp_metadata_export_to_skb().
>
>>>> A related question, why 'struct xdp_to_skb_metadata' needs
>>>> __randomize_layout?
>>>
>>> The __randomize_layout thing is there to force BPF programs to use CO-RE
>>> to access the field. This is to avoid the struct layout accidentally
>>> ossifying because people in practice rely on a particular layout, even
>>> though we tell them to use CO-RE. There are lots of examples of this
>>> happening in other domains (IP header options, TCP options, etc), and
>>> __randomize_layout seemed like a neat trick to enforce CO-RE usage :)
>>
>> I am not sure if it is necessary or helpful to only enforce __randomize_layout
>> in 'struct xdp_to_skb_metadata'. There are other CO-RE use cases (tracing and
>> non tracing) that already have direct access (reading and/or writing) to other
>> kernel structures.
>>
>> It is more important for the verifier to see the xdp prog accessing it as a
>> 'struct xdp_to_skb_metadata *' instead of xdp->data_meta which is a __u8 * so
>> that the verifier can enforce the rules of access.
>
> That only works inside the kernel, though. Since the metadata field can
> be copied wholesale to AF_XDP, having it randomized forces userspace
> consumers to also write code to deal with the layout being dynamic...
hm... I still don't see how useful it is, in particular you mentioned the libxdp
will install a xdp prog to write this default format (xdp_to_skb_metadata) and
likely libxdp will also provide some helpers to parse the xdp_to_skb_metadata
and the libxdp user should not need to know if CO-RE is used or not.
Considering it is a kernel internal struct, I think it is fine to keep it and
can be revisited later if needed. Lets get on to other things first :)
Powered by blists - more mailing lists