[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aPKiC0jZV6kLBvIq@boxer>
Date: Fri, 17 Oct 2025 22:08:58 +0200
From: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
To: Toke Høiland-Jørgensen <toke@...hat.com>
CC: <bpf@...r.kernel.org>, <ast@...nel.org>, <daniel@...earbox.net>,
<hawk@...nel.org>, <ilias.apalodimas@...aro.org>, <lorenzo@...nel.org>,
<kuba@...nel.org>, <netdev@...r.kernel.org>, <magnus.karlsson@...el.com>,
<andrii@...nel.org>, <stfomichev@...il.com>, <aleksander.lobakin@...el.com>
Subject: Re: [PATCH v2 bpf 2/2] veth: update mem type in xdp_buff
On Fri, Oct 17, 2025 at 08:12:57PM +0200, Toke Høiland-Jørgensen wrote:
> Toke Høiland-Jørgensen <toke@...hat.com> writes:
>
> > Maciej Fijalkowski <maciej.fijalkowski@...el.com> writes:
> >
> >> On Fri, Oct 17, 2025 at 06:33:54PM +0200, Toke Høiland-Jørgensen wrote:
> >>> Maciej Fijalkowski <maciej.fijalkowski@...el.com> writes:
> >>>
> >>> > Veth calls skb_pp_cow_data() which makes the underlying memory to
> >>> > originate from system page_pool. For CONFIG_DEBUG_VM=y and XDP program
> >>> > that uses bpf_xdp_adjust_tail(), following splat was observed:
> >>> >
> >>> > [ 32.204881] BUG: Bad page state in process test_progs pfn:11c98b
> >>> > [ 32.207167] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11c98b
> >>> > [ 32.210084] flags: 0x1fffe0000000000(node=0|zone=1|lastcpupid=0x7fff)
> >>> > [ 32.212493] raw: 01fffe0000000000 dead000000000040 ff11000123c9b000 0000000000000000
> >>> > [ 32.218056] raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000
> >>> > [ 32.220900] page dumped because: page_pool leak
> >>> > [ 32.222636] Modules linked in: bpf_testmod(O) bpf_preload
> >>> > [ 32.224632] CPU: 6 UID: 0 PID: 3612 Comm: test_progs Tainted: G O 6.17.0-rc5-gfec474d29325 #6969 PREEMPT
> >>> > [ 32.224638] Tainted: [O]=OOT_MODULE
> >>> > [ 32.224639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> >>> > [ 32.224641] Call Trace:
> >>> > [ 32.224644] <IRQ>
> >>> > [ 32.224646] dump_stack_lvl+0x4b/0x70
> >>> > [ 32.224653] bad_page.cold+0xbd/0xe0
> >>> > [ 32.224657] __free_frozen_pages+0x838/0x10b0
> >>> > [ 32.224660] ? skb_pp_cow_data+0x782/0xc30
> >>> > [ 32.224665] bpf_xdp_shrink_data+0x221/0x530
> >>> > [ 32.224668] ? skb_pp_cow_data+0x6d1/0xc30
> >>> > [ 32.224671] bpf_xdp_adjust_tail+0x598/0x810
> >>> > [ 32.224673] ? xsk_destruct_skb+0x321/0x800
> >>> > [ 32.224678] bpf_prog_004ac6bb21de57a7_xsk_xdp_adjust_tail+0x52/0xd6
> >>> > [ 32.224681] veth_xdp_rcv_skb+0x45d/0x15a0
> >>> > [ 32.224684] ? get_stack_info_noinstr+0x16/0xe0
> >>> > [ 32.224688] ? veth_set_channels+0x920/0x920
> >>> > [ 32.224691] ? get_stack_info+0x2f/0x80
> >>> > [ 32.224693] ? unwind_next_frame+0x3af/0x1df0
> >>> > [ 32.224697] veth_xdp_rcv.constprop.0+0x38a/0xbe0
> >>> > [ 32.224700] ? common_startup_64+0x13e/0x148
> >>> > [ 32.224703] ? veth_xdp_rcv_one+0xcd0/0xcd0
> >>> > [ 32.224706] ? stack_trace_save+0x84/0xa0
> >>> > [ 32.224709] ? stack_depot_save_flags+0x28/0x820
> >>> > [ 32.224713] ? __resched_curr.constprop.0+0x332/0x3b0
> >>> > [ 32.224716] ? timerqueue_add+0x217/0x320
> >>> > [ 32.224719] veth_poll+0x115/0x5e0
> >>> > [ 32.224722] ? veth_xdp_rcv.constprop.0+0xbe0/0xbe0
> >>> > [ 32.224726] ? update_load_avg+0x1cb/0x12d0
> >>> > [ 32.224730] ? update_cfs_group+0x121/0x2c0
> >>> > [ 32.224733] __napi_poll+0xa0/0x420
> >>> > [ 32.224736] net_rx_action+0x901/0xe90
> >>> > [ 32.224740] ? run_backlog_napi+0x50/0x50
> >>> > [ 32.224743] ? clockevents_program_event+0x1cc/0x280
> >>> > [ 32.224746] ? hrtimer_interrupt+0x31e/0x7c0
> >>> > [ 32.224749] handle_softirqs+0x151/0x430
> >>> > [ 32.224752] do_softirq+0x3f/0x60
> >>> > [ 32.224755] </IRQ>
> >>> >
> >>> > It's because xdp_rxq with mem model set to MEM_TYPE_PAGE_SHARED was used
> >>> > when initializing xdp_buff.
> >>> >
> >>> > Fix this by using new helper xdp_convert_skb_to_buff() that, besides
> >>> > init/prepare xdp_buff, will check if page used for linear part of
> >>> > xdp_buff comes from page_pool. We assume that linear data and frags will
> >>> > have same memory provider as currently XDP API does not provide us a way
> >>> > to distinguish it (the mem model is registered for *whole* Rx queue and
> >>> > here we speak about single buffer granularity).
> >>> >
> >>> > In order to meet expected skb layout by new helper, pull the mac header
> >>> > before conversion from skb to xdp_buff.
> >>> >
> >>> > However, that is not enough as before releasing xdp_buff out of veth via
> >>> > XDP_{TX,REDIRECT}, mem type on xdp_rxq associated with xdp_buff is
> >>> > restored to its original model. We need to respect previous setting at
> >>> > least until buff is converted to frame, as frame carries the mem_type.
> >>> > Add a page_pool variant of veth_xdp_get() so that we avoid refcount
> >>> > underflow when draining page frag.
> >>> >
> >>> > Fixes: 0ebab78cbcbf ("net: veth: add page_pool for page recycling")
> >>> > Reported-by: Alexei Starovoitov <ast@...nel.org>
> >>> > Closes: https://lore.kernel.org/bpf/CAADnVQ+bBofJDfieyOYzSmSujSfJwDTQhiz3aJw7hE+4E2_iPA@mail.gmail.com/
> >>> > Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
> >>> > ---
> >>> > drivers/net/veth.c | 43 +++++++++++++++++++++++++++----------------
> >>> > 1 file changed, 27 insertions(+), 16 deletions(-)
> >>> >
> >>> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> >>> > index a3046142cb8e..eeeee7bba685 100644
> >>> > --- a/drivers/net/veth.c
> >>> > +++ b/drivers/net/veth.c
> >>> > @@ -733,7 +733,7 @@ static void veth_xdp_rcv_bulk_skb(struct veth_rq *rq, void **frames,
> >>> > }
> >>> > }
> >>> >
> >>> > -static void veth_xdp_get(struct xdp_buff *xdp)
> >>> > +static void veth_xdp_get_shared(struct xdp_buff *xdp)
> >>> > {
> >>> > struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> >>> > int i;
> >>> > @@ -746,12 +746,33 @@ static void veth_xdp_get(struct xdp_buff *xdp)
> >>> > __skb_frag_ref(&sinfo->frags[i]);
> >>> > }
> >>> >
> >>> > +static void veth_xdp_get_pp(struct xdp_buff *xdp)
> >>> > +{
> >>> > + struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> >>> > + int i;
> >>> > +
> >>> > + page_pool_ref_page(virt_to_page(xdp->data));
> >>> > + if (likely(!xdp_buff_has_frags(xdp)))
> >>> > + return;
> >>> > +
> >>> > + for (i = 0; i < sinfo->nr_frags; i++) {
> >>> > + skb_frag_t *frag = &sinfo->frags[i];
> >>> > +
> >>> > + page_pool_ref_page(netmem_to_page(frag->netmem));
> >>> > + }
> >>> > +}
> >>> > +
> >>> > +static void veth_xdp_get(struct xdp_buff *xdp)
> >>> > +{
> >>> > + xdp->rxq->mem.type == MEM_TYPE_PAGE_POOL ?
> >>> > + veth_xdp_get_pp(xdp) : veth_xdp_get_shared(xdp);
> >>> > +}
> >>> > +
> >>> > static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >>> > struct xdp_buff *xdp,
> >>> > struct sk_buff **pskb)
> >>> > {
> >>> > struct sk_buff *skb = *pskb;
> >>> > - u32 frame_sz;
> >>> >
> >>> > if (skb_shared(skb) || skb_head_is_locked(skb) ||
> >>> > skb_shinfo(skb)->nr_frags ||
> >>> > @@ -762,19 +783,9 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq,
> >>> > skb = *pskb;
> >>> > }
> >>> >
> >>> > - /* SKB "head" area always have tailroom for skb_shared_info */
> >>> > - frame_sz = skb_end_pointer(skb) - skb->head;
> >>> > - frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> >>> > - xdp_init_buff(xdp, frame_sz, &rq->xdp_rxq);
> >>> > - xdp_prepare_buff(xdp, skb->head, skb_headroom(skb),
> >>> > - skb_headlen(skb), true);
> >>> > + __skb_pull(*pskb, skb->data - skb_mac_header(skb));
> >>>
> >>> veth_xdp_rcv_skb() does:
> >>>
> >>> __skb_push(skb, skb->data - skb_mac_header(skb));
> >>> if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
> >>>
> >>> so how about just getting rid of that push instead of doing the opposite
> >>> pull straight after? :)
> >>
> >> Hi Toke,
> >>
> >> I believe this is done so we get a proper headroom representation which is
> >> needed for XDP_PACKET_HEADROOM comparison. Maybe we could be smarter here
> >> and for example subtract mac header length? However I wanted to preserve
> >> old behavior.
> >
> > Yeah, basically what we want is to check if the mac_header offset is
> > larger than the headroom. So the check could just be:
> >
> > skb->mac_header < XDP_PACKET_HEADROOM
> >
> > however, it may be better to use the helper? Since that makes sure we
> > keep hitting the DEBUG_NET_WARN_ON_ONCE inside the helper... So:
> >
> > skb_mac_header(skb) - skb->head < XDP_PACKET_HEADROOM
> >
> > or, equivalently:
> >
> > skb_headroom(skb) - skb_mac_offset(skb) < XDP_PACKET_HEADROOM
> >
> > I think the first one is probably more readable, since skb_mac_offset()
> > is negative here, so the calculation looks off...
>
> Wait, veth_xdp_rcv_skb() calls skb_reset_mac_header() further down, so
> it expects skb->data to point to the mac header. So getting rid of the
Oof. Correct.
> __skb_push() is not a good idea; but neither is doing the __skb_pull() as
> your patch does currently.
>
> How about just making xdp_convert_skb_to_buff() agnostic to where
> skb->data is?
>
> headroom = skb_mac_header(skb) - skb->head;
> data_len = skb->data + skb->len - skb_mac_header(skb);
could we just use skb->tail - skb_mac_header(skb) here?
anyways, i'm gonna try out your suggestion on weekend or on monday and
will send a v3. maybe input from someone else in different time zones will
land in by tomorrow. thanks again:)
> xdp_prepare_buff(xdp, skb->head, headroom, data_len, true);
>
> should work in both cases, no?
>
> -Toke
>
>
Powered by blists - more mailing lists