netdev - Re: [PATCH bpf-next v2 11/15] selftests/bpf: Expect unclone to preserve skb metadata

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87ecqsux46.fsf@cloudflare.com>
Date: Fri, 24 Oct 2025 17:40:09 +0200
From: Jakub Sitnicki <jakub@...udflare.com>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: "David S. Miller" <davem@...emloft.net>,  Eric Dumazet
 <edumazet@...gle.com>,  Jakub Kicinski <kuba@...nel.org>,  Paolo Abeni
 <pabeni@...hat.com>,  Simon Horman <horms@...nel.org>,  Daniel Borkmann
 <daniel@...earbox.net>,  John Fastabend <john.fastabend@...il.com>,
  Stanislav Fomichev <sdf@...ichev.me>,  Alexei Starovoitov
 <ast@...nel.org>,  Andrii Nakryiko <andrii@...nel.org>,  Eduard Zingerman
 <eddyz87@...il.com>,  Song Liu <song@...nel.org>,  Yonghong Song
 <yonghong.song@...ux.dev>,  KP Singh <kpsingh@...nel.org>,  Hao Luo
 <haoluo@...gle.com>,  Jiri Olsa <jolsa@...nel.org>,  Arthur Fabre
 <arthur@...hurfabre.com>,  bpf@...r.kernel.org,  netdev@...r.kernel.org,
  kernel-team@...udflare.com, Jesper Dangaard Brouer <hawk@...nel.org>
Subject: Re: [PATCH bpf-next v2 11/15] selftests/bpf: Expect unclone to
 preserve skb metadata

On Thu, Oct 23, 2025 at 07:32 PM -07, Martin KaFai Lau wrote:
> On 10/23/25 4:55 AM, Jakub Sitnicki wrote:
>> On Wed, Oct 22, 2025 at 04:12 PM -07, Martin KaFai Lau wrote:
>>> On 10/19/25 5:45 AM, Jakub Sitnicki wrote:
>>>> @@ -447,12 +448,14 @@ int clone_dynptr_empty_on_meta_slice_write(struct __sk_buff *ctx)
>>>>      /*
>>>>     * Check that skb_meta dynptr is read-only before prog writes to packet payload
>>>> - * using dynptr_write helper. Applies only to cloned skbs.
>>>> + * using dynptr_write helper, and becomes read-write afterwards. Applies only to
>>>> + * cloned skbs.
>>>>     */
>>>>    SEC("tc")
>>>> -int clone_dynptr_rdonly_before_data_dynptr_write(struct __sk_buff *ctx)
>>>> +int clone_dynptr_rdonly_before_data_dynptr_write_then_rw(struct __sk_buff *ctx)
>>>>    {
>>>>    	struct bpf_dynptr data, meta;
>>>> +	__u8 meta_have[META_SIZE];
>>>>    	const struct ethhdr *eth;
>>>>      	bpf_dynptr_from_skb(ctx, 0, &data);
>>>> @@ -465,15 +468,23 @@ int clone_dynptr_rdonly_before_data_dynptr_write(struct __sk_buff *ctx)
>>>>      	/* Expect read-only metadata before unclone */
>>>>    	bpf_dynptr_from_skb_meta(ctx, 0, &meta);
>>>> -	if (!bpf_dynptr_is_rdonly(&meta) || bpf_dynptr_size(&meta) != META_SIZE)
>>>> +	if (!bpf_dynptr_is_rdonly(&meta))
>>>
>>> Can the bpf_dynptr_set_rdonly() be lifted from the bpf_dynptr_from_skb_meta()?
>>>
>>> iiuc, the remaining thing left should be handling a cloned skb in
>>> __bpf_dynptr_write()? The __bpf_skb_store_bytes() is using
>>> bpf_try_make_writable, so maybe something similar can be done for the
>>> BPF_DYNPTR_TYPE_SKB_META?
>> I'm with you. This is not user-friendly at all currently.
>> This patch set has already gotten quite long so how about I split out
>> the pskb_expand_head patch (#1) and the related selftest change (patch
>> #11) from this series, expand it to lift bpf_dynptr_set_rdonly()
>> limitation for skb_meta dynptr, and do that first in a dedicated series?
>
> A followup on lifting the bpf_dynptr_set_rdonly is fine and keep this set as
> is. Just want to check if there is anything stopping it. However, imo, having
> one or two patches over is fine. The set is not difficult to follow.

All right. I will pile that one on. 16 makes it a nice even number.

[...]

>>> I have a high level question about the set. I assume the skb_data_move() in
>>> patch 2 will be useful in the future to preserve the metadata across the
>>> stack. Preserving the metadata across different tc progs (which this set does)
>>> is nice to have but it is not the end goal. Can you shed some light on the plan
>>> for building on top of this set?
>> Right. Starting at the highest level, I want to work toward preserving
>> the metadata on RX path first (ongoing), forward path next, and TX path
>> last.
>> On RX path, the end game is for sk_filter prog to be able to access
>> metadata thru dynptr. For that we need to know where the metadata
>> resides. I see two ways how we can tackle that:
>> A) We keep relying on metadata being in front of skb_mac_header().
>>     Fun fact - if you don't call any TC BPF helpers that touch
>>     skb->mac_header and don't have any tunnel or tagging devices on RX
>>     path, this works out of the box today. But we need to make sure that
>>     any call site that changes the MAC header offset, moves the
>>     metadata. I expect this approach will be a pain on TX path.
>> ... or ...
>> B) We track the metadata offset separately from MAC header offset
>>     This requires additional state, we need to store the metadata offset
>>     somewhere. However, in exchange for a couple bytes we gain some
>>     benefits:
>>     1. We don't need to move the metadata after skb_pull.
>>     2. We only need to move the metadata for skb_push if there's not
>>       enough space left, that is the gap between skb->data and where
>>       metadata ends is too small.
>>       (This means that anyone who is not using skb->data_meta on RX path
>>       but the skb_meta dynptr instead, can avoid any memmove's of the
>>       metadata itself.)
>
>
> I don't think I get this part. For example, bpf_dynptr_slice_rdwr(&meta_dynptr)
> should be treated like
> skb->data_meta also?

That's the thing. With dynptr we don't care where the metadata is
located. Hence, no need to move it before the prog runs, even if there
is a gap between the metadata and the MAC header, say, because of GRE
decap. If we track metadata separately the skb_metadata_end() could
become:

static inline void *skb_metadata_end(const struct sk_buff *skb)
{
	return skb->head + skb_shinfo(skb)->meta_end;
}

[..]

> Having a way to separately track the metadata start/end is useful.
> An unrelated dumb/lazy question, is it possible/lot-of-changes to put the
> metadata in the head (or after xdp_frame?) in the RX path?

We've been asking ourselves the same theoretical question. There are
at least a couple challenges to retrofit such change:

1. You'd need a way to track where the metadata ends in XDP. As I hear
   from Jesper, XDP metadata was intentionally placed right in front of
   the packet to avoid computing/loading another pointer.

2. You'd be moving the contents when growing the metadata with
   bpf_xdp_adjust_meta. Or you'd need to add a way to resize it by
   moving the end.

Not really something we've considered attacking ATM.

My gut feeling is that it will do us good to leave the metadata close to
the MAC header during the initial adoption phase to catch all the call
sites that push headers without moving the metadata and need fixing.

[...]