netdev - Re: [BUG] mlx5_core memory management issue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aqti6c3imnaffenkgnnw5tnmjwrzw7g7pwbt47bvbgar2c4rbv@af4mch7msf3w>
Date: Tue, 12 Aug 2025 15:44:04 +0000
From: Dragos Tatulea <dtatulea@...dia.com>
To: Chris Arges <carges@...udflare.com>
Cc: netdev@...r.kernel.org, bpf@...r.kernel.org, 
	kernel-team <kernel-team@...udflare.com>, Jesper Dangaard Brouer <hawk@...nel.org>, tariqt@...dia.com, 
	saeedm@...dia.com, Leon Romanovsky <leon@...nel.org>, 
	Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Alexei Starovoitov <ast@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, John Fastabend <john.fastabend@...il.com>, 
	Simon Horman <horms@...nel.org>, Andrew Rzeznik <arzeznik@...udflare.com>, 
	Yan Zhai <yan@...udflare.com>
Subject: Re: [BUG] mlx5_core memory management issue

Hi Chris,

On Mon, Aug 11, 2025 at 08:37:56AM +0000, Dragos Tatulea wrote:
> Hi Chris,
> 
> Sorry for the late reply, I was on holiday.
> 
> On Thu, Aug 07, 2025 at 11:45:40AM -0500, Chris Arges wrote:
> > On 2025-07-24 17:01:16, Dragos Tatulea wrote:
> > > On Wed, Jul 23, 2025 at 01:48:07PM -0500, Chris Arges wrote:
> > > > 
> > > > Ok, we can reproduce this problem!
> > > > 
> > > > I tried to simplify this reproducer, but it seems like what's needed is:
> > > > - xdp program attached to mlx5 NIC
> > > > - cpumap redirect
> > > > - device redirect (map or just bpf_redirect)
> > > > - frame gets turned into an skb
> > > > Then from another machine send many flows of UDP traffic to trigger the problem.
> > > > 
> > > > I've put together a program that reproduces the issue here:
> > > > - https://github.com/arges/xdp-redirector
> > > >
> > > Much appreciated! I fumbled around initially, not managing to get
> > > traffic to the xdp_devmap stage. But further debugging revealed that GRO
> > > needs to be enabled on the veth devices for XDP redir to work to the
> > > xdp_devmap. After that I managed to reproduce your issue.
> > > 
> > > Now I can start looking into it.
> > > 
> > 
> > Dragos,
> > 
> > There was a similar reference counting issue identified in:
> > https://lore.kernel.org/all/20250801170754.2439577-1-kuba@kernel.org/
> > 
> > Part of the commit message mentioned:
> > > Unfortunately for fbnic since commit f7dc3248dcfb ("skbuff: Optimization
> > > of SKB coalescing for page pool") core _may_ actually take two extra
> > > pp refcounts, if one of them is returned before driver gives up the bias
> > > the ret < 0 check in page_pool_unref_netmem() will trigger.
> > 
> > In order to help debug the mlx5 issue caused by xdp redirection, I built a
> > kernel with commit f7dc3248dcfb reverted, but unfortunately I was still able
> > to reproduce the issue.
> Thanks for trying this.
> 
> > 
> > I am happy to try some other experiments, or if there are other ideas you have.
> >
> I am actively debugging the issue but progress is slow as it is not an
> easy one. So far I have been able to trace it back to the fact that the
> page_pool is returning the same page twice on allocation without having a
> release in between. As this is quite weird, I think I still have to
> trace it back a few more steps to find the actual issue.
>
Ok, so I think I've found the issue: there's some place which recycles
pages to the page_pool cache directly while running from a different CPU
than it should.

This happens when dropping frames during the __dev_flush() of the device
map from the cpumap cpu. Here's the call graph:
-> cpu_map_bpf_prog_run()
  -> xdp_do_flush (on redirects)
    -> __dev_flush()
      -> bq_xmit_all()
        -> xdp_return_frame_rx_napi() (called on drop)
          -> page_pool_put_full_netmem(pp, page, true) (always set to
	  true)

So normally xdp_do_flush() is called by the driver which happens from
the right NAPI context. But for cpumap + redirect it is called from the
cpumap CPU. So returning frames in this countext should be done with the
"no direct" flag set.

Could you try the below patch and check if you still get the crash? The
patch fixes specifically this flow, but I wonder if there are similar
places where this protection is missing.

Patch:

---
 kernel/bpf/devmap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 482d284a1553..484216c7454d 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -408,8 +408,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
        /* If not all frames have been transmitted, it is our
         * responsibility to free them
         */
+       xdp_set_return_frame_no_direct();
        for (i = sent; unlikely(i < to_send); i++)
                xdp_return_frame_rx_napi(bq->q[i]);
+       xdp_clear_return_frame_no_direct();
 
 out:
        bq->count = 0;
-- 
2.50.1