[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKH8qBsfzYmQ9SZXhFetf_zQPNmE_L=_H_rRxJEwZzNbqtoKJA@mail.gmail.com>
Date: Thu, 10 Nov 2022 10:52:21 -0800
From: Stanislav Fomichev <sdf@...gle.com>
To: Toke Høiland-Jørgensen <toke@...hat.com>
Cc: Martin KaFai Lau <martin.lau@...ux.dev>, ast@...nel.org,
daniel@...earbox.net, andrii@...nel.org, song@...nel.org,
yhs@...com, john.fastabend@...il.com, kpsingh@...nel.org,
haoluo@...gle.com, jolsa@...nel.org,
David Ahern <dsahern@...il.com>,
Jakub Kicinski <kuba@...nel.org>,
Willem de Bruijn <willemb@...gle.com>,
Jesper Dangaard Brouer <brouer@...hat.com>,
Anatoly Burakov <anatoly.burakov@...el.com>,
Alexander Lobakin <alexandr.lobakin@...el.com>,
Magnus Karlsson <magnus.karlsson@...il.com>,
Maryam Tahhan <mtahhan@...hat.com>, xdp-hints@...-project.net,
netdev@...r.kernel.org, bpf@...r.kernel.org
Subject: Re: [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp
metadata into skb context
On Thu, Nov 10, 2022 at 6:27 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>
> Stanislav Fomichev <sdf@...gle.com> writes:
>
> > On Wed, Nov 9, 2022 at 4:13 PM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> >>
> >> On 11/9/22 1:33 PM, Stanislav Fomichev wrote:
> >> > On Wed, Nov 9, 2022 at 10:22 AM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> >> >>
> >> >> On 11/9/22 3:10 AM, Toke Høiland-Jørgensen wrote:
> >> >>> Snipping a bit of context to reply to this bit:
> >> >>>
> >> >>>>>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
> >> >>>>>>> sure it is solid enough by asking the xdp prog not to use the same random number
> >> >>>>>>> in its own metadata + not to change the metadata through xdp->data_meta after
> >> >>>>>>> calling bpf_xdp_metadata_export_to_skb().
> >> >>>>>>
> >> >>>>>> What do you think the usecase here might be? Or are you suggesting we
> >> >>>>>> reject further access to data_meta after
> >> >>>>>> bpf_xdp_metadata_export_to_skb somehow?
> >> >>>>>>
> >> >>>>>> If we want to let the programs override some of this
> >> >>>>>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
> >> >>>>>> more kfuncs instead of exposing the layout?
> >> >>>>>>
> >> >>>>>> bpf_xdp_metadata_export_to_skb(ctx);
> >> >>>>>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
> >> >>>
> >> >>> There are several use cases for needing to access the metadata after
> >> >>> calling bpf_xdp_metdata_export_to_skb():
> >> >>>
> >> >>> - Accessing the metadata after redirect (in a cpumap or devmap program,
> >> >>> or on a veth device)
> >> >>> - Transferring the packet+metadata to AF_XDP
> >> >> fwiw, the xdp prog could also be more selective and only stores one of the hints
> >> >> instead of the whole 'struct xdp_to_skb_metadata'.
> >> >>
> >> >>> - Returning XDP_PASS, but accessing some of the metadata first (whether
> >> >>> to read or change it)
> >> >>>
> >> >>> The last one could be solved by calling additional kfuncs, but that
> >> >>> would be less efficient than just directly editing the struct which
> >> >>> will be cache-hot after the helper returns.
> >> >>
> >> >> Yeah, it is more efficient to directly write if possible. I think this set
> >> >> allows the direct reading and writing already through data_meta (as a _u8 *).
> >> >>
> >> >>>
> >> >>> And yeah, this will allow the XDP program to inject arbitrary metadata
> >> >>> into the netstack; but it can already inject arbitrary *packet* data
> >> >>> into the stack, so not sure if this is much of an additional risk? If it
> >> >>> does lead to trivial crashes, we should probably harden the stack
> >> >>> against that?
> >> >>>
> >> >>> As for the random number, Jesper and I discussed replacing this with the
> >> >>> same BTF-ID scheme that he was using in his patch series. I.e., instead
> >> >>> of just putting in a random number, we insert the BTF ID of the metadata
> >> >>> struct at the end of it. This will allow us to support multiple
> >> >>> different formats in the future (not just changing the layout, but
> >> >>> having multiple simultaneous formats in the same kernel image), in case
> >> >>> we run out of space.
> >> >>
> >> >> This seems a bit hypothetical. How much headroom does it usually have for the
> >> >> xdp prog? Potentially the hints can use all the remaining space left after the
> >> >> header encap and the current bpf_xdp_adjust_meta() usage?
> >> >>
> >> >>>
> >> >>> We should probably also have a flag set on the xdp_frame so the stack
> >> >>> knows that the metadata area contains relevant-to-skb data, to guard
> >> >>> against an XDP program accidentally hitting the "magic number" (BTF_ID)
> >> >>> in unrelated stuff it puts into the metadata area.
> >> >>
> >> >> Yeah, I think having a flag is useful. The flag will be set at xdp_buff and
> >> >> then transfer to the xdp_frame?
> >> >>
> >> >>>
> >> >>>> After re-reading patch 6, have another question. The 'void
> >> >>>> bpf_xdp_metadata_export_to_skb();' function signature. Should it at
> >> >>>> least return ok/err? or even return a 'struct xdp_to_skb_metadata *'
> >> >>>> pointer and the xdp prog can directly read (or even write) it?
> >> >>>
> >> >>> Hmm, I'm not sure returning a failure makes sense? Failure to read one
> >> >>> or more fields just means that those fields will not be populated? We
> >> >>> should probably have a flags field inside the metadata struct itself to
> >> >>> indicate which fields are set or not, but I'm not sure returning an
> >> >>> error value adds anything? Returning a pointer to the metadata field
> >> >>> might be convenient for users (it would just be an alias to the
> >> >>> data_meta pointer, but the verifier could know its size, so the program
> >> >>> doesn't have to bounds check it).
> >> >>
> >> >> If some hints are not available, those hints should be initialized to
> >> >> 0/CHECKSUM_NONE/...etc. The xdp prog needs a direct way to tell hard failure
> >> >> when it cannot write the meta area because of not enough space. Comparing
> >> >> xdp->data_meta with xdp->data as a side effect is not intuitive.
> >> >>
> >> >> It is more than saving the bound check. With type info of 'struct
> >> >> xdp_to_skb_metadata *', the verifier can do more checks like reading in the
> >> >> middle of an integer member. The verifier could also limit write access only to
> >> >> a few struct's members if it is needed.
> >> >>
> >> >> The returning 'struct xdp_to_skb_metadata *' should not be an alias to the
> >> >> xdp->data_meta. They should actually point to different locations in the
> >> >> headroom. bpf_xdp_metadata_export_to_skb() sets a flag in xdp_buff.
> >> >> xdp->data_meta won't be changed and keeps pointing to the last
> >> >> bpf_xdp_adjust_meta() location. The kernel will know if there is
> >> >> xdp_to_skb_metadata before the xdp->data_meta when that bit is set in the
> >> >> xdp_{buff,frame}. Would it work?
> >> >>
> >> >>>
> >> >>>> A related question, why 'struct xdp_to_skb_metadata' needs
> >> >>>> __randomize_layout?
> >> >>>
> >> >>> The __randomize_layout thing is there to force BPF programs to use CO-RE
> >> >>> to access the field. This is to avoid the struct layout accidentally
> >> >>> ossifying because people in practice rely on a particular layout, even
> >> >>> though we tell them to use CO-RE. There are lots of examples of this
> >> >>> happening in other domains (IP header options, TCP options, etc), and
> >> >>> __randomize_layout seemed like a neat trick to enforce CO-RE usage :)
> >> >>
> >> >> I am not sure if it is necessary or helpful to only enforce __randomize_layout
> >> >> in 'struct xdp_to_skb_metadata'. There are other CO-RE use cases (tracing and
> >> >> non tracing) that already have direct access (reading and/or writing) to other
> >> >> kernel structures.
> >> >>
> >> >> It is more important for the verifier to see the xdp prog accessing it as a
> >> >> 'struct xdp_to_skb_metadata *' instead of xdp->data_meta which is a __u8 * so
> >> >> that the verifier can enforce the rules of access.
> >> >>
> >> >>>
> >> >>>>>>> Does xdp_to_skb_metadata have a use case for XDP_PASS (like patch 7) or the
> >> >>>>>>> xdp_to_skb_metadata can be limited to XDP_REDIRECT only?
> >> >>>>>>
> >> >>>>>> XDP_PASS cases where we convert xdp_buff into skb in the drivers right
> >> >>>>>> now usually have C code to manually pull out the metadata (out of hw
> >> >>>>>> desc) and put it into skb.
> >> >>>>>>
> >> >>>>>> So, currently, if we're calling bpf_xdp_metadata_export_to_skb() for
> >> >>>>>> XDP_PASS, we're doing a double amount of work:
> >> >>>>>> skb_metadata_import_from_xdp first, then custom driver code second.
> >> >>>>>>
> >> >>>>>> In theory, maybe we should completely skip drivers custom parsing when
> >> >>>>>> there is a prog with BPF_F_XDP_HAS_METADATA?
> >> >>>>>> Then both xdp->skb paths (XDP_PASS+XDP_REDIRECT) will be bpf-driven
> >> >>>>>> and won't require any mental work (plus, the drivers won't have to
> >> >>>>>> care either in the future).
> >> >>>>>> > WDYT?
> >> >>>>>
> >> >>>>>
> >> >>>>> Yeah, not sure if it can solely depend on BPF_F_XDP_HAS_METADATA but it makes
> >> >>>>> sense to only use the hints (if ever written) from xdp prog especially if it
> >> >>>>> will eventually support xdp prog changing some of the hints in the future. For
> >> >>>>> now, I think either way is fine since they are the same and the xdp prog is sort
> >> >>>>> of doing extra unnecessary work anyway by calling
> >> >>>>> bpf_xdp_metadata_export_to_skb() with XDP_PASS and knowing nothing can be
> >> >>>>> changed now.
> >> >>>
> >> >>> I agree it would be best if the drivers also use the XDP metadata (if
> >> >>> present) on XDP_PASS. Longer term my hope is we can make the XDP
> >> >>> metadata support the only thing drivers need to implement (i.e., have
> >> >>> the stack call into that code even when no XDP program is loaded), but
> >> >>> for now just for consistency (and allowing the XDP program to update the
> >> >>> metadata), we should probably at least consume it on XDP_PASS.
> >> >>>
> >> >>> -Toke
> >> >>>
> >> >
> >> > Not to derail the discussion (left the last message intact on top,
> >> > feel free to continue), but to summarize. The proposed changes seem to
> >> > be:
> >> >
> >> > 1. bpf_xdp_metadata_export_to_skb() should return pointer to "struct
> >> > xdp_to_skb_metadata"
> >> > - This should let bpf programs change the metadata passed to the skb
> >> >
> >> > 2. "struct xdp_to_skb_metadata" should have its btf_id as the first
> >> > __u32 member (and remove the magic)
> >> > - This is for the redirect case where the end users, including
> >> > AF_XDP, can parse this metadata from btf_id
> >>
> >> I think Toke's idea is to put the btf_id at the end of xdp_to_skb_metadata. I
> >> can see why the end is needed for the userspace AF_XDP because, afaict, AF_XDP
> >> rx_desc currently cannot tell if there is metadata written by the xdp prog or
> >> not. However, if the 'has_skb_metadata' bit can also be passed to the AF_XDP
> >> rx_desc->options, the btf_id may as well be not needed now. However, the btf_id
> >> and other future new members can be added to the xdp_to_skb_metadata later if
> >> there is a need.
> >>
> >> For the kernel and xdp prog, a bit in the xdp->flags should be enough to get to
> >> the xdp_to_skb_metadata. The xdp prog will use CO-RE to access the members in
> >> xdp_to_skb_metadata.
> >
> > Ack, good points on putting it at the end.
> > Regarding bit in desc->options vs btf_id: since it seems that btf_id
> > is useful anyway, let's start with that? We can add a bit later on if
> > it turns out using metadata is problematic otherwise.
>
> I think the bit is mostly useful so that the stack can know that the
> metadata has been set before consuming it (to guard against regular
> xdp_metadata usage accidentally hitting the "right" BTF ID). I don't
> think it needs to be exposed to the XDP programs themselves.
SG! A flag for xdp_buff/frame and a kfunc to query it for bpf.
> >> > - This, however, is not all the metadata that the device can
> >> > support, but a much narrower set that the kernel is expected to use
> >> > for skb construction
> >> >
> >> > 3. __randomize_layout isn't really helping, CO-RE will trigger
> >> > regardless; maybe only the case where it matters is probably AF_XDP,
> >> > so still useful?
>
> Yeah, see my response to Martin, I think the randomisation is useful for
> AF_XDP transfer.
SG. Let's keep it for now. Worst case, if it hurts, we can remove it later...
> >> > 4. The presence of the metadata generated by
> >> > bpf_xdp_metadata_export_to_skb should be indicated by a flag in
> >> > xdp_{buff,frame}->flags
> >> > - Assuming exposing it via xdp_md->has_skb_metadata is ok?
> >>
> >> probably __bpf_md_ptr(struct xdp_to_skb_metadata *, skb_metadata) and the type
> >> will be PTR_TO_BTF_ID_OR_NULL.
> >
> > Oh, that seems even better than returning it from
> > bpf_xdp_metadata_export_to_skb.
> > bpf_xdp_metadata_export_to_skb can return true/false and the rest goes
> > via default verifier ctx resolution mechanism..
> > (returning ptr from a kfunc seems to be a bit complicated right now)
>
> See my response to John in the other thread about mixing stable UAPI (in
> xdp_md) and unstable BTF structures in the xdp_md struct: I think this
> is confusing and would prefer a kfunc.
SG!
> >> > - Since the programs probably need to do the following:
> >> >
> >> > if (xdp_md->has_skb_metadata) {
> >> > access/change skb metadata by doing struct xdp_to_skb_metadata *p
> >> > = data_meta;
> >>
> >> and directly access/change xdp->skb_metadata instead of using xdp->data_meta.
> >
> > Ack.
> >
> >> > } else {
> >> > use kfuncs
> >> > }
> >> >
> >> > 5. Support the case where we keep program's metadata and kernel's
> >> > xdp_to_skb_metadata
> >> > - skb_metadata_import_from_xdp() will "consume" it by mem-moving the
> >> > rest of the metadata over it and adjusting the headroom
> >>
> >> I was thinking the kernel's xdp_to_skb_metadata is always before the program's
> >> metadata. xdp prog should usually work in this order also: read/write headers,
> >> write its own metadata, call bpf_xdp_metadata_export_to_skb(), and return
> >> XDP_PASS/XDP_REDIRECT. When it is XDP_PASS, the kernel just needs to pop the
> >> xdp_to_skb_metadata and pass the remaining program's metadata to the bpf-tc.
> >>
> >> For the kernel and xdp prog, I don't think it matters where the
> >> xdp_to_skb_metadata is. However, the xdp->data_meta (program's metadata) has to
> >> be before xdp->data because of the current data_meta and data comparison usage
> >> in the xdp prog.
> >>
> >> The order of the kernel's xdp_to_skb_metadata and the program's metadata
> >> probably only matters to the userspace AF_XDP. However, I don't see how AF_XDP
> >> supports the program's metadata now. afaict, it can only work now if there is
> >> some sort of contract between them or the AF_XDP currently does not use the
> >> program's metadata. Either way, we can do the mem-moving only for AF_XDP and it
> >> should be a no op if there is no program's metadata? This behavior could also
> >> be configurable through setsockopt?
> >
> > Agreed on all of the above. For now it seems like the safest thing to
> > do is to put xdp_to_skb_metadata last to allow af_xdp to properly
> > locate btf_id.
> > Let's see if Toke disagrees :-)
>
> As I replied to Martin, I'm not sure it's worth the complexity to
> logically split the SKB metadata from the program's own metadata (as
> opposed to just reusing the existing data_meta pointer)?
I'd gladly keep my current requirement where it's either or, but not both :-)
We can relax it later if required?
> However, if we do, the layout that makes most sense to me is putting the
> skb metadata before the program metadata, like:
>
> --------------
> | skb_metadata
> --------------
> | data_meta
> --------------
> | data
> --------------
>
> Not sure if that's what you meant? :)
I was suggesting the other way around: |custom meta|skb_metadata|data|
(but, as Martin points out, consuming skb_metadata in the kernel
becomes messier)
af_xdp can check whether skb_metdata is present by looking at data -
offsetof(struct skb_metadata, btf_id).
progs that know how to handle custom metadata, will look at data -
sizeof(skb_metadata)
Otherwise, if it's the other way around, how do we find skb_metadata
in a redirected frame?
Let's say we have |skb_metadata|custom meta|data|, how does the final
program find skb_metadata?
All the progs have to agree on the sizeof(tc/custom meta), right?
Powered by blists - more mailing lists