[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87ms5hvnlk.fsf@cloudflare.com>
Date: Thu, 23 Oct 2025 13:55:51 +0200
From: Jakub Sitnicki <jakub@...udflare.com>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: "David S. Miller" <davem@...emloft.net>, Eric Dumazet
<edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
<pabeni@...hat.com>, Simon Horman <horms@...nel.org>, Daniel Borkmann
<daniel@...earbox.net>, John Fastabend <john.fastabend@...il.com>,
Stanislav Fomichev <sdf@...ichev.me>, Alexei Starovoitov
<ast@...nel.org>, Andrii Nakryiko <andrii@...nel.org>, Eduard Zingerman
<eddyz87@...il.com>, Song Liu <song@...nel.org>, Yonghong Song
<yonghong.song@...ux.dev>, KP Singh <kpsingh@...nel.org>, Hao Luo
<haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>, Arthur Fabre
<arthur@...hurfabre.com>, bpf@...r.kernel.org, netdev@...r.kernel.org,
kernel-team@...udflare.com
Subject: Re: [PATCH bpf-next v2 11/15] selftests/bpf: Expect unclone to
preserve skb metadata
On Wed, Oct 22, 2025 at 04:12 PM -07, Martin KaFai Lau wrote:
> On 10/19/25 5:45 AM, Jakub Sitnicki wrote:
>> @@ -447,12 +448,14 @@ int clone_dynptr_empty_on_meta_slice_write(struct __sk_buff *ctx)
>> /*
>> * Check that skb_meta dynptr is read-only before prog writes to packet payload
>> - * using dynptr_write helper. Applies only to cloned skbs.
>> + * using dynptr_write helper, and becomes read-write afterwards. Applies only to
>> + * cloned skbs.
>> */
>> SEC("tc")
>> -int clone_dynptr_rdonly_before_data_dynptr_write(struct __sk_buff *ctx)
>> +int clone_dynptr_rdonly_before_data_dynptr_write_then_rw(struct __sk_buff *ctx)
>> {
>> struct bpf_dynptr data, meta;
>> + __u8 meta_have[META_SIZE];
>> const struct ethhdr *eth;
>> bpf_dynptr_from_skb(ctx, 0, &data);
>> @@ -465,15 +468,23 @@ int clone_dynptr_rdonly_before_data_dynptr_write(struct __sk_buff *ctx)
>> /* Expect read-only metadata before unclone */
>> bpf_dynptr_from_skb_meta(ctx, 0, &meta);
>> - if (!bpf_dynptr_is_rdonly(&meta) || bpf_dynptr_size(&meta) != META_SIZE)
>> + if (!bpf_dynptr_is_rdonly(&meta))
>
> Can the bpf_dynptr_set_rdonly() be lifted from the bpf_dynptr_from_skb_meta()?
>
> iiuc, the remaining thing left should be handling a cloned skb in
> __bpf_dynptr_write()? The __bpf_skb_store_bytes() is using
> bpf_try_make_writable, so maybe something similar can be done for the
> BPF_DYNPTR_TYPE_SKB_META?
I'm with you. This is not user-friendly at all currently.
This patch set has already gotten quite long so how about I split out
the pskb_expand_head patch (#1) and the related selftest change (patch
#11) from this series, expand it to lift bpf_dynptr_set_rdonly()
limitation for skb_meta dynptr, and do that first in a dedicated series?
>
>> + goto out;
>> +
>> + bpf_dynptr_read(meta_have, META_SIZE, &meta, 0, 0);
>> + if (!check_metadata(meta_have))
>> goto out;
>> /* Helper write to payload will unclone the packet */
>> bpf_dynptr_write(&data, offsetof(struct ethhdr, h_proto), "x", 1, 0);
>> - /* Expect no metadata after unclone */
>> + /* Expect r/w metadata after unclone */
>> bpf_dynptr_from_skb_meta(ctx, 0, &meta);
>> - if (bpf_dynptr_is_rdonly(&meta) || bpf_dynptr_size(&meta) != 0)
>> + if (bpf_dynptr_is_rdonly(&meta))
>
> then it does not have to rely on the bpf_dynptr_write(&data, ...) above to make
> the metadata writable.
>
> I have a high level question about the set. I assume the skb_data_move() in
> patch 2 will be useful in the future to preserve the metadata across the
> stack. Preserving the metadata across different tc progs (which this set does)
> is nice to have but it is not the end goal. Can you shed some light on the plan
> for building on top of this set?
Right. Starting at the highest level, I want to work toward preserving
the metadata on RX path first (ongoing), forward path next, and TX path
last.
On RX path, the end game is for sk_filter prog to be able to access
metadata thru dynptr. For that we need to know where the metadata
resides. I see two ways how we can tackle that:
A) We keep relying on metadata being in front of skb_mac_header().
Fun fact - if you don't call any TC BPF helpers that touch
skb->mac_header and don't have any tunnel or tagging devices on RX
path, this works out of the box today. But we need to make sure that
any call site that changes the MAC header offset, moves the
metadata. I expect this approach will be a pain on TX path.
... or ...
B) We track the metadata offset separately from MAC header offset
This requires additional state, we need to store the metadata offset
somewhere. However, in exchange for a couple bytes we gain some
benefits:
1. We don't need to move the metadata after skb_pull.
2. We only need to move the metadata for skb_push if there's not
enough space left, that is the gap between skb->data and where
metadata ends is too small.
(This means that anyone who is not using skb->data_meta on RX path
but the skb_meta dynptr instead, can avoid any memmove's of the
metadata itself.)
3. We can place the metadata at skb->head, which plays nicely with TX
path, where we need the headroom for pushing headers.
I've been trying out how (B) plays out when safe-proofing the tunnel &
tagging devices, your VLANs and GREs, to preserve the metadata.
To that end I've added a new u16 field in skb_shinfo to track
meta_end. There a 4B hole there currently and we load the whole
cacheline from skb_shinf to access meta_len anyway.
Once I had that, I could modify the skb_data_move() to relocate the
metadata only if necessary, which looks like so:
static inline void skb_data_move(struct sk_buff *skb, const int len,
const unsigned int n)
{
const u8 meta_len = skb_metadata_len(skb);
u8 *meta, *meta_end;
if (!len || (!n && !meta_len))
return;
if (!meta_len)
goto no_metadata;
/* Not enough headroom left for metadata. Drop it. */
if (WARN_ON_ONCE(meta_len > skb_headroom(skb))) {
skb_metadata_clear(skb);
goto no_metadata;
}
meta_end = skb_metadata_end(skb);
meta = meta_end - meta_len;
/* Metadata in front of data before push/pull. Keep it that way. */
if (meta_end == skb->data - len) {
memmove(meta + len, meta, meta_len + n);
skb_shinfo(skb)->meta_end += len;
return;
}
if (len < 0) {
/* Data pushed. Move metadata to the top. */
memmove(skb->head, meta, meta_len);
skb_shinfo(skb)->meta_end = meta_len;
}
no_metadata:
memmove(skb->data, skb->data - len, n);
}
The goal is for RX path is to hit everwhere just the last memmove(),
since we will be usually pulling from skb->data, if you're not using the
skb->data_meta pseudo-pointer in your TC(X) BPF programs.
There are some verifier changes needed to keep skb->data_meta
working. We need to move the metadata back in front of the MAC header
before a TC(X) prog that uses skb->data_meta runs, or things break.
Early code for that is also available for a preview. I've pushed it to:
https://github.com/jsitnicki/linux/commits/skb-meta/safeproof-netdevs/
Thanks,
-jkbs
Powered by blists - more mailing lists