netdev - Re: [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from MAC header offset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87ecnzj49h.fsf@cloudflare.com>
Date: Thu, 08 Jan 2026 20:25:30 +0100
From: Jakub Sitnicki <jakub@...udflare.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: bpf@...r.kernel.org,  netdev@...r.kernel.org,  "David S. Miller"
 <davem@...emloft.net>,  Eric Dumazet <edumazet@...gle.com>,  Paolo Abeni
 <pabeni@...hat.com>,  Alexei Starovoitov <ast@...nel.org>,  Daniel
 Borkmann <daniel@...earbox.net>,  Jesper Dangaard Brouer
 <hawk@...nel.org>,  John Fastabend <john.fastabend@...il.com>,  Stanislav
 Fomichev <sdf@...ichev.me>,  Simon Horman <horms@...nel.org>,  Andrii
 Nakryiko <andrii@...nel.org>,  Martin KaFai Lau <martin.lau@...ux.dev>,
  Eduard Zingerman <eddyz87@...il.com>,  Song Liu <song@...nel.org>,
  Yonghong Song <yonghong.song@...ux.dev>,  KP Singh <kpsingh@...nel.org>,
  Hao Luo <haoluo@...gle.com>,  Jiri Olsa <jolsa@...nel.org>,
  kernel-team@...udflare.com
Subject: Re: [PATCH bpf-next v3 00/17] Decouple skb metadata tracking from
 MAC header offset

On Thu, Jan 08, 2026 at 07:47 AM -08, Jakub Kicinski wrote:
> On Wed, 07 Jan 2026 15:28:00 +0100 Jakub Sitnicki wrote:
>> This series continues the effort to provide reliable access to xdp/skb
>> metadata from BPF context on the receive path. We have recently talked
>> about it at Plumbers [1].
>> 
>> Currently skb metadata location is tied to the MAC header offset:
>> 
>>   [headroom][metadata][MAC hdr][L3 pkt]
>>                       ^
>>                       skb_metadata_end = head + mac_header
>> 
>> This design breaks on L2 decapsulation (VLAN, GRE, etc.) when the MAC
>> offset is reset. The naive fix is to memmove metadata on every decap path,
>> but we can avoid this cost by tracking metadata position independently.
>> 
>> Introduce a dedicated meta_end field in skb_shared_info that records where
>> metadata ends relative to skb->head:
>> 
>>   [headroom][metadata][gap][MAC hdr][L3 pkt]
>>                      ^
>>                      skb_metadata_end = head + meta_end
>>                      
>> This allows BPF dynptr access (bpf_dynptr_from_skb_meta()) to work without
>> memmove. For skb->data_meta pointer access, which expects metadata
>> immediately before skb->data, make the verifier inject realignment code in
>> TC BPF prologue.
>
> I don't understand what semantics for the buffer layout you're trying
> to establish, we now have "headroom" and "gap"?
>
> 	[headroom][metadata][gap][packet]
>
> You're not solving the encap side either, skb_push() will still happily
> encroach on the metadata. Feel like duct tape, we can't fundamentally
> update the layout of the skb without updating all the helpers.
> metadata works perfectly fine for its intended use case - passing info
> about the frame from XDP offload to XDP and then to TC.

<joke>
Man, that makes me think I'm a terrible speaker.
Or you just missed my talk :-)
</joke>

You're right that metadata passing between XDP and TC works well in the
simple case when both BPF progs are attached to the same device.

However, the metadata doesn't get cleared when skb undergoes L2
decapsulation. So if you attach the TC prog to the tunnel device, things
break. (VLAN being the exception since we handle it specially in
skb_reorder_vlan_header).

Here's a simple example using veth + GRE tunnel:

# ip netns add tx
# ip link add rx type veth peer name tx netns tx
# ip link set dev rx up
# ip -n tx link set dev tx up
# ip addr add 10.0.0.1/24 dev rx
# ip -n tx addr add 10.0.0.2/24 dev tx
# ip tun add decap mode gre local 10.0.0.1 remote 10.0.0.2
# ip -n tx tun add encap mode gre local 10.0.0.2 remote 10.0.0.1
# ip link set dev decap up
# ip -n tx link set dev encap up
# ip addr add 192.0.2.1/24 dev decap
# ip -n tx addr add 192.0.2.2/24 dev encap
# cat progs.bpf.c
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

SEC("xdp")
int set_tag(struct xdp_md *ctx)
{
        __u8 *data;
        __u32 *tag;

        if (bpf_xdp_adjust_meta(ctx, -sizeof(*tag)))
                return XDP_DROP;

        data = (typeof(data))(unsigned long)ctx->data;
        tag = (typeof(tag))(unsigned long)ctx->data_meta;

        if (tag + 1 > data)
                return XDP_DROP;

        *tag = 42;
        return XDP_PASS;
}

SEC("tcx/ingress")
int get_tag(struct __sk_buff *ctx)
{
        __u8 *data = (typeof(data))(unsigned long)ctx->data;
        __u32 *tag = (typeof(tag))(unsigned long)ctx->data_meta;

        if (tag + 1 > data)
                return TCX_DROP;

        bpf_printk("%u\n", *tag);
        return TCX_PASS;
}

const char _license[] SEC("license") = "GPL";
# bpftool prog loadall progs.bpf.o /sys/fs/bpf
# bpftool net attach xdp pinned /sys/fs/bpf/set_tag dev rx
# bpftool net attach tcx_ingress pinned /sys/fs/bpf/get_tag dev rx
# ip netns exec tx ping -c1 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=0.074 ms

--- 192.0.2.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.074/0.074/0.074/0.000 ms
# bpftool prog tracelog
     ksoftirqd/3-35      [003] ..s1.   592.051827: bpf_trace_printk: 42

            ping-251     [003] ..s2.   614.979169: bpf_trace_printk: 42

          <idle>-0       [003] ..s2.   620.212369: bpf_trace_printk: 42

^C# bpftool net attach tcx_ingress pinned /sys/fs/bpf/get_tag dev decap
# ip netns exec tx ping -c1 192.0.2.1
PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data.
64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=0.043 ms

--- 192.0.2.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.043/0.043/0.043/0.000 ms
# bpftool prog tracelog
            ping-263     [003] ..s2.   681.823981: bpf_trace_printk: 524288

^C#

When reading metadata from TC attached to the GRE device, we get garbage
(524288 = 0x0008_0000, which is actually the GRE header).

This happens because after GRE decapsulation, skb_metadata_end points to
the inner MAC header location:
 
  [headroom][metadata][outer MAC][outer IP][GRE][inner MAC][...]
                                                ^
                                                skb_metadata_end =
                                                head + mac_header
 
So we effectively already have a [headroom][metadata][gap][packet]
layout today, with the access being broken.


Regarding skb_push() and encapsulation - you're absolutely right that
this needs to be addressed. I mention it briefly in the cover letter,
but I should have been clearer about the plan:

This series focuses on fixing the decap/skb_pull path first. The
encap/skb_push side is on the roadmap [1], and patch 10 updates
skb_postpush_data_move() helper to be plugged after skb_push() in
preparation for handling metadata movement on push.

I'm just tackling stuff one by one, though I'm happy to prepare an RFC
that covers both sides, if you'd prefer to see the complete picture
upfront before making the call. Let me know what works best for you.

If it is the overall approach that feels wrong, I'm definitely open to
discussing alternatives.

Thanks,
-jkbs

[1] Tenative roadmap presented at LPC:

* [x] Add bpf_dynptr_from_skb_meta()
* [x] Make TC bpf_skb_*() helpers preserve metadata (bug fix)
* [-] Make L2 decap preserve metadata (bug fix)  [work in progress]
* [ ] Allow-list bpf_dynptr_from_skb_meta() for other BPF prog types
* [ ] uAPI read access to metadata 
-=-=- Milestone: Metadata on RX path -=-=-
* [ ] [FWD path] Make L2 encap preserve metadata (bug fix)
* [ ] [TX path] uAPI write access to metadata
* [ ] [TX path] Alloc metadata from other BPF prog types