[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK3+h2xQFeVtkPb+Sr1k+E0Fre+8hi_QfWYd3ueK-2B1FgJmGA@mail.gmail.com>
Date: Mon, 1 Sep 2025 10:56:19 -0700
From: Vincent Li <vincent.mc.li@...il.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: netdev@...r.kernel.org, xdp-newbies@...r.kernel.org,
loongarch@...ts.linux.dev, Dragos Tatulea <dtatulea@...dia.com>,
Furong Xu <0x1207@...il.com>, Maxime Coquelin <mcoquelin.stm32@...il.com>,
Alexandre Torgue <alexandre.torgue@...s.st.com>, Huacai Chen <chenhuacai@...nel.org>,
Jakub Kicinski <kuba@...nel.org>, Mina Almasry <almasrymina@...gle.com>,
Philipp Stanner <phasta@...nel.org>, Ilias Apalodimas <ilias.apalodimas@...aro.org>,
Qunqin Zhao <zhaoqunqin@...ngson.cn>, Yanteng Si <si.yanteng@...ux.dev>,
Andrew Lunn <andrew+netdev@...n.ch>
Subject: Re: [BUG?] driver stmmac reports page_pool_release_retry() stalled
pool shutdown every minute
Hi Jesper,
Thank you for your input!
On Mon, Sep 1, 2025 at 2:23 AM Jesper Dangaard Brouer <hawk@...nel.org> wrote:
>
> Hi Vincent,
>
> Thanks for reporting.
> Please see my instruction inlined below.
> Will appreciate if you reply inline below to my questions.
>
>
> On 01/09/2025 04.47, Vincent Li wrote:
> > Hi,
> >
> > I noticed once I attached a XDP program to a dwmac-loongson-pci
> > network device on a loongarch PC, the kernel logs stalled pool message
> > below every minute, it seems not to affect network traffic though. it
> > does not seem to be architecture dependent, so I decided to report
> > this to netdev and XDP mailing list in case there is a bug in stmmac
> > related network device with XDP.
> >
>
> Dragos (Cc'ed) gave a very detailed talk[1] about debugging page_pool
> leaks, that I highly recommend:
> [1]
> https://netdevconf.info/0x19/sessions/tutorial/diagnosing-page-pool-leaks.html
>
> Before doing kernel debugging with drgn, I have some easier steps, I
> want you to perform on your hardware (I cannot reproduce given I don't
> have this hardware).
I watched the video and slide, I would have difficulty running drgn
since the loongfire OS [0] I am running does not have proper python
support. loongfire is a port of IPFire for LoongArch architecture. The
kernel is upstream stable release 6.15.9 with a backport of LoongArch
BPF trampoline for supporting xdp-tools. I run loongfire on a
LoongArch PC for my home Internet. I tried to reproduce this issue on
the LoongArch PC with a Fedora desktop OS release with the same kernel
6.15.9, I can't reproduce the issue, not sure if this is only
reproducible for firewall/router like Linux OS with stmmac device.
>
> First step is to check is a socket have unprocessed packets stalled in
> it receive-queue (Recv-Q). Use command 'netstat -tapenu' and look at
> column "Recv-Q". If any socket/application have not emptied it's Recv-Q
> try to restart this service and see if the "stalled pool shutdown" goes
> away.
the Recv-Q shows 0 from 'netstat -tapenu'
[root@...ngfire ~]# netstat -tapenu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address
State User Inode PID/Program name
tcp 0 0 127.0.0.1:8953 0.0.0.0:*
LISTEN 0 10283 1896/unbound
tcp 0 0 0.0.0.0:53 0.0.0.0:*
LISTEN 0 10281 1896/unbound
tcp 0 0 0.0.0.0:22 0.0.0.0:*
LISTEN 0 8708 2823/sshd: /usr/sbi
tcp 0 272 192.168.9.1:22 192.168.9.13:58660
ESTABLISHED 0 8754 3004/sshd-session:
tcp6 0 0 :::81 :::*
LISTEN 0 7828 2841/httpd
tcp6 0 0 :::444 :::*
LISTEN 0 7832 2841/httpd
tcp6 0 0 :::1013 :::*
LISTEN 0 7836 2841/httpd
tcp6 0 0 10.0.0.229:444 192.168.9.13:58762
TIME_WAIT 0 0 -
udp 0 0 0.0.0.0:53 0.0.0.0:*
0 10280 1896/unbound
udp 0 0 0.0.0.0:67 0.0.0.0:*
0 10647 2803/dhcpd
udp 0 0 10.0.0.229:68 0.0.0.0:*
0 8644 2659/dhcpcd: [BOOTP
udp 0 0 10.0.0.229:123 0.0.0.0:*
0 8679 2757/ntpd
udp 0 0 192.168.9.1:123 0.0.0.0:*
0 8678 2757/ntpd
udp 0 0 127.0.0.1:123 0.0.0.0:*
0 8677 2757/ntpd
udp 0 0 0.0.0.0:123 0.0.0.0:*
0 8670 2757/ntpd
udp 0 0 0.0.0.0:514 0.0.0.0:*
0 5689 1864/syslogd
udp6 0 0 :::123 :::*
0 8667 2757/ntpd
> Second step is compiling kernel with CONFIG_DEBUG_VM enabled. This will
> warn us if the driver leaked the a page_pool controlled page, without
> first "releasing" is correctly. See commit dba1b8a7ab68 ("mm/page_pool:
> catch page_pool memory leaks") for how the warning will look like.
> (p.s. this CONFIG_DEBUG_VM have surprisingly low-overhead, as long as
> you don't select any sub-options, so we choose to run with this in
> production).
>
I added CONFIG_DEBUG_VM and recompiled the kernel, but no kernel
warning message about page leak, maybe false positive?
[root@...ngfire ~]# grep 'CONFIG_DEBUG_VM=y' /boot/config-6.15.9-ipfire
CONFIG_DEBUG_VM=y
[root@...ngfire ~]# grep -E 'MEM_TYPE_PAGE_POOL|stalled' /var/log/kern.log
Sep 1 10:23:19 loongfire kernel: [ 7.484986] dwmac-loongson-pci
0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
Sep 1 10:26:44 loongfire kernel: [ 212.514302] dwmac-loongson-pci
0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
Sep 1 10:27:44 loongfire kernel: [ 272.911878]
page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 60
sec
Sep 1 10:28:44 loongfire kernel: [ 333.327876]
page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 120
sec
Sep 1 10:29:45 loongfire kernel: [ 393.743877]
page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 181
sec
> Third step is doing kernel debugging like Dragos did in [1].
>
> What kernel version are you using?
kernel 6.15.9
>
> In kernel v6.8 we (Kuba) silenced some of the cases. See commit
> be0096676e23 ("net: page_pool: mute the periodic warning for visible
> page pools").
> To Jakub/kuba can you remind us how to use the netlink tools that can
> help us inspect the page_pools active on the system?
>
>
> > xdp-filter load green0
> >
>
> Most drivers change memory model and reset the RX rings, when attaching
> XDP. So, it makes sense that the existing page_pool instances (per RXq)
> are freed and new allocated. Revealing any leaked or unprocessed
> page_pool pages.
>
>
> > Aug 31 19:19:06 loongfire kernel: [200871.855044] dwmac-loongson-pci 0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
> > Aug 31 19:19:07 loongfire kernel: [200872.810587] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200399 sec
>
> It is very weird that a stall time of 200399 sec is reported. This
> indicate that this have been happening *before* the xdp-filter was
> attached. The uptime "200871.855044" indicate leak happened 472 sec
> after booting this system.
>
Not sure if I pasted the previous log message correctly, but this time
the log I pasted should be correct,
> Have you seen these dmesg logs before attaching XDP?
I didn't see such a log before attaching XDP.
>
> This will help us know if this page_pool became "invisible" according to
> Kuba's change, if you run kernel >= v6.8.
>
>
> > Aug 31 19:20:07 loongfire kernel: [200933.226488] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200460 sec
> > Aug 31 19:21:08 loongfire kernel: [200993.642391]
> > page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
> > 200520 sec
> > Aug 31 19:22:08 loongfire kernel: [201054.058292]
> > page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
> > 200581 sec
> >
>
> Cc'ed some people that might have access to this hardware, can any of
> you reproduce?
>
> --Jesper
[0]: https://github.com/vincentmli/loongfire
Powered by blists - more mailing lists