[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e9f3ccc7-750b-4d60-ae03-ac493b766b56@linux.dev>
Date: Thu, 23 Oct 2025 19:32:44 -0700
From: Martin KaFai Lau <martin.lau@...ux.dev>
To: Jakub Sitnicki <jakub@...udflare.com>
Cc: "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
John Fastabend <john.fastabend@...il.com>,
Stanislav Fomichev <sdf@...ichev.me>, Alexei Starovoitov <ast@...nel.org>,
Andrii Nakryiko <andrii@...nel.org>, Eduard Zingerman <eddyz87@...il.com>,
Song Liu <song@...nel.org>, Yonghong Song <yonghong.song@...ux.dev>,
KP Singh <kpsingh@...nel.org>, Hao Luo <haoluo@...gle.com>,
Jiri Olsa <jolsa@...nel.org>, Arthur Fabre <arthur@...hurfabre.com>,
bpf@...r.kernel.org, netdev@...r.kernel.org, kernel-team@...udflare.com
Subject: Re: [PATCH bpf-next v2 11/15] selftests/bpf: Expect unclone to
preserve skb metadata
On 10/23/25 4:55 AM, Jakub Sitnicki wrote:
> On Wed, Oct 22, 2025 at 04:12 PM -07, Martin KaFai Lau wrote:
>> On 10/19/25 5:45 AM, Jakub Sitnicki wrote:
>>> @@ -447,12 +448,14 @@ int clone_dynptr_empty_on_meta_slice_write(struct __sk_buff *ctx)
>>> /*
>>> * Check that skb_meta dynptr is read-only before prog writes to packet payload
>>> - * using dynptr_write helper. Applies only to cloned skbs.
>>> + * using dynptr_write helper, and becomes read-write afterwards. Applies only to
>>> + * cloned skbs.
>>> */
>>> SEC("tc")
>>> -int clone_dynptr_rdonly_before_data_dynptr_write(struct __sk_buff *ctx)
>>> +int clone_dynptr_rdonly_before_data_dynptr_write_then_rw(struct __sk_buff *ctx)
>>> {
>>> struct bpf_dynptr data, meta;
>>> + __u8 meta_have[META_SIZE];
>>> const struct ethhdr *eth;
>>> bpf_dynptr_from_skb(ctx, 0, &data);
>>> @@ -465,15 +468,23 @@ int clone_dynptr_rdonly_before_data_dynptr_write(struct __sk_buff *ctx)
>>> /* Expect read-only metadata before unclone */
>>> bpf_dynptr_from_skb_meta(ctx, 0, &meta);
>>> - if (!bpf_dynptr_is_rdonly(&meta) || bpf_dynptr_size(&meta) != META_SIZE)
>>> + if (!bpf_dynptr_is_rdonly(&meta))
>>
>> Can the bpf_dynptr_set_rdonly() be lifted from the bpf_dynptr_from_skb_meta()?
>>
>> iiuc, the remaining thing left should be handling a cloned skb in
>> __bpf_dynptr_write()? The __bpf_skb_store_bytes() is using
>> bpf_try_make_writable, so maybe something similar can be done for the
>> BPF_DYNPTR_TYPE_SKB_META?
>
> I'm with you. This is not user-friendly at all currently.
>
> This patch set has already gotten quite long so how about I split out
> the pskb_expand_head patch (#1) and the related selftest change (patch
> #11) from this series, expand it to lift bpf_dynptr_set_rdonly()
> limitation for skb_meta dynptr, and do that first in a dedicated series?
A followup on lifting the bpf_dynptr_set_rdonly is fine and keep this
set as is. Just want to check if there is anything stopping it. However,
imo, having one or two patches over is fine. The set is not difficult to
follow.
>
>>
>>> + goto out;
>>> +
>>> + bpf_dynptr_read(meta_have, META_SIZE, &meta, 0, 0);
>>> + if (!check_metadata(meta_have))
>>> goto out;
>>> /* Helper write to payload will unclone the packet */
>>> bpf_dynptr_write(&data, offsetof(struct ethhdr, h_proto), "x", 1, 0);
>>> - /* Expect no metadata after unclone */
>>> + /* Expect r/w metadata after unclone */
>>> bpf_dynptr_from_skb_meta(ctx, 0, &meta);
>>> - if (bpf_dynptr_is_rdonly(&meta) || bpf_dynptr_size(&meta) != 0)
>>> + if (bpf_dynptr_is_rdonly(&meta))
>>
>> then it does not have to rely on the bpf_dynptr_write(&data, ...) above to make
>> the metadata writable.
>>
>> I have a high level question about the set. I assume the skb_data_move() in
>> patch 2 will be useful in the future to preserve the metadata across the
>> stack. Preserving the metadata across different tc progs (which this set does)
>> is nice to have but it is not the end goal. Can you shed some light on the plan
>> for building on top of this set?
> Right. Starting at the highest level, I want to work toward preserving
> the metadata on RX path first (ongoing), forward path next, and TX path
> last.
>
> On RX path, the end game is for sk_filter prog to be able to access
> metadata thru dynptr. For that we need to know where the metadata
> resides. I see two ways how we can tackle that:
>
> A) We keep relying on metadata being in front of skb_mac_header().
>
> Fun fact - if you don't call any TC BPF helpers that touch
> skb->mac_header and don't have any tunnel or tagging devices on RX
> path, this works out of the box today. But we need to make sure that
> any call site that changes the MAC header offset, moves the
> metadata. I expect this approach will be a pain on TX path.
>
> ... or ...
>
> B) We track the metadata offset separately from MAC header offset
>
> This requires additional state, we need to store the metadata offset
> somewhere. However, in exchange for a couple bytes we gain some
> benefits:
>
> 1. We don't need to move the metadata after skb_pull.
>
> 2. We only need to move the metadata for skb_push if there's not
> enough space left, that is the gap between skb->data and where
> metadata ends is too small.
>
> (This means that anyone who is not using skb->data_meta on RX path
> but the skb_meta dynptr instead, can avoid any memmove's of the
> metadata itself.)
I don't think I get this part. For example,
bpf_dynptr_slice_rdwr(&meta_dynptr) should be treated like
skb->data_meta also?
>
> 3. We can place the metadata at skb->head, which plays nicely with TX
> path, where we need the headroom for pushing headers.
Having a way to separately track the metadata start/end is useful.
An unrelated dumb/lazy question, is it possible/lot-of-changes to put
the metadata in the head (or after xdp_frame?) in the RX path?
>
> I've been trying out how (B) plays out when safe-proofing the tunnel &
> tagging devices, your VLANs and GREs, to preserve the metadata.
>
> To that end I've added a new u16 field in skb_shinfo to track
> meta_end. There a 4B hole there currently and we load the whole
> cacheline from skb_shinf to access meta_len anyway.
>
> Once I had that, I could modify the skb_data_move() to relocate the
> metadata only if necessary, which looks like so:
>
> static inline void skb_data_move(struct sk_buff *skb, const int len,
> const unsigned int n)
> {
> const u8 meta_len = skb_metadata_len(skb);
> u8 *meta, *meta_end;
>
> if (!len || (!n && !meta_len))
> return;
>
> if (!meta_len)
> goto no_metadata;
>
> /* Not enough headroom left for metadata. Drop it. */
> if (WARN_ON_ONCE(meta_len > skb_headroom(skb))) {
> skb_metadata_clear(skb);
> goto no_metadata;
> }
>
> meta_end = skb_metadata_end(skb);
> meta = meta_end - meta_len;
>
> /* Metadata in front of data before push/pull. Keep it that way. */
> if (meta_end == skb->data - len) {
> memmove(meta + len, meta, meta_len + n);
> skb_shinfo(skb)->meta_end += len;
> return;
> }
>
> if (len < 0) {
> /* Data pushed. Move metadata to the top. */
> memmove(skb->head, meta, meta_len);
> skb_shinfo(skb)->meta_end = meta_len;
> }
> no_metadata:
> memmove(skb->data, skb->data - len, n);
> }
>
> The goal is for RX path is to hit everwhere just the last memmove(),
> since we will be usually pulling from skb->data, if you're not using the
> skb->data_meta pseudo-pointer in your TC(X) BPF programs.
>
> There are some verifier changes needed to keep skb->data_meta
> working. We need to move the metadata back in front of the MAC header
> before a TC(X) prog that uses skb->data_meta runs, or things break.
>
> Early code for that is also available for a preview. I've pushed it to:
>
> https://github.com/jsitnicki/linux/commits/skb-meta/safeproof-netdevs/
Thanks. I will take a look.
Powered by blists - more mailing lists