netdev - Re: [BUG?] driver stmmac reports page_pool_release_retry() stalled pool shutdown every minute

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK3+h2xQFeVtkPb+Sr1k+E0Fre+8hi_QfWYd3ueK-2B1FgJmGA@mail.gmail.com>
Date: Mon, 1 Sep 2025 10:56:19 -0700
From: Vincent Li <vincent.mc.li@...il.com>
To: Jesper Dangaard Brouer <hawk@...nel.org>
Cc: netdev@...r.kernel.org, xdp-newbies@...r.kernel.org, 
	loongarch@...ts.linux.dev, Dragos Tatulea <dtatulea@...dia.com>, 
	Furong Xu <0x1207@...il.com>, Maxime Coquelin <mcoquelin.stm32@...il.com>, 
	Alexandre Torgue <alexandre.torgue@...s.st.com>, Huacai Chen <chenhuacai@...nel.org>, 
	Jakub Kicinski <kuba@...nel.org>, Mina Almasry <almasrymina@...gle.com>, 
	Philipp Stanner <phasta@...nel.org>, Ilias Apalodimas <ilias.apalodimas@...aro.org>, 
	Qunqin Zhao <zhaoqunqin@...ngson.cn>, Yanteng Si <si.yanteng@...ux.dev>, 
	Andrew Lunn <andrew+netdev@...n.ch>
Subject: Re: [BUG?] driver stmmac reports page_pool_release_retry() stalled
 pool shutdown every minute

Hi Jesper,

Thank you for your input!

On Mon, Sep 1, 2025 at 2:23 AM Jesper Dangaard Brouer <hawk@...nel.org> wrote:
>
> Hi Vincent,
>
> Thanks for reporting.
> Please see my instruction inlined below.
> Will appreciate if you reply inline below to my questions.
>
>
> On 01/09/2025 04.47, Vincent Li wrote:
> > Hi,
> >
> > I noticed once I attached a XDP program to a dwmac-loongson-pci
> > network device on a loongarch PC, the kernel logs stalled pool message
> > below every minute, it seems  not to affect network traffic though. it
> > does not seem to be architecture dependent, so I decided to report
> > this to netdev and XDP mailing list in case there is a bug in stmmac
> > related network device with XDP.
> >
>
> Dragos (Cc'ed) gave a very detailed talk[1] about debugging page_pool
> leaks, that I highly recommend:
>   [1]
> https://netdevconf.info/0x19/sessions/tutorial/diagnosing-page-pool-leaks.html
>
> Before doing kernel debugging with drgn, I have some easier steps, I
> want you to perform on your hardware (I cannot reproduce given I don't
> have this hardware).

I watched the video and slide, I would have difficulty running drgn
since the loongfire OS [0] I am running does not have proper python
support. loongfire is a port of IPFire for LoongArch architecture. The
kernel is upstream stable release 6.15.9  with a backport of LoongArch
BPF trampoline for supporting xdp-tools. I run loongfire on a
LoongArch PC for my home Internet. I tried to reproduce this issue on
the LoongArch PC with a Fedora desktop OS release with the same kernel
6.15.9, I can't reproduce the issue, not sure if this is only
reproducible for firewall/router like Linux OS with stmmac device.

>
> First step is to check is a socket have unprocessed packets stalled in
> it receive-queue (Recv-Q).  Use command 'netstat -tapenu' and look at
> column "Recv-Q".  If any socket/application have not emptied it's Recv-Q
> try to restart this service and see if the "stalled pool shutdown" goes
> away.

the Recv-Q shows 0 from  'netstat -tapenu'

 [root@...ngfire ~]#  netstat -tapenu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address
State       User       Inode      PID/Program name
tcp        0      0 127.0.0.1:8953          0.0.0.0:*
LISTEN      0          10283      1896/unbound
tcp        0      0 0.0.0.0:53              0.0.0.0:*
LISTEN      0          10281      1896/unbound
tcp        0      0 0.0.0.0:22              0.0.0.0:*
LISTEN      0          8708       2823/sshd: /usr/sbi
tcp        0    272 192.168.9.1:22          192.168.9.13:58660
ESTABLISHED 0          8754       3004/sshd-session:
tcp6       0      0 :::81                   :::*
LISTEN      0          7828       2841/httpd
tcp6       0      0 :::444                  :::*
LISTEN      0          7832       2841/httpd
tcp6       0      0 :::1013                 :::*
LISTEN      0          7836       2841/httpd
tcp6       0      0 10.0.0.229:444          192.168.9.13:58762
TIME_WAIT   0          0          -
udp        0      0 0.0.0.0:53              0.0.0.0:*
         0          10280      1896/unbound
udp        0      0 0.0.0.0:67              0.0.0.0:*
         0          10647      2803/dhcpd
udp        0      0 10.0.0.229:68           0.0.0.0:*
         0          8644       2659/dhcpcd: [BOOTP
udp        0      0 10.0.0.229:123          0.0.0.0:*
         0          8679       2757/ntpd
udp        0      0 192.168.9.1:123         0.0.0.0:*
         0          8678       2757/ntpd
udp        0      0 127.0.0.1:123           0.0.0.0:*
         0          8677       2757/ntpd
udp        0      0 0.0.0.0:123             0.0.0.0:*
         0          8670       2757/ntpd
udp        0      0 0.0.0.0:514             0.0.0.0:*
         0          5689       1864/syslogd
udp6       0      0 :::123                  :::*
         0          8667       2757/ntpd

> Second step is compiling kernel with CONFIG_DEBUG_VM enabled. This will
> warn us if the driver leaked the a page_pool controlled page, without
> first "releasing" is correctly.  See commit dba1b8a7ab68 ("mm/page_pool:
> catch page_pool memory leaks") for how the warning will look like.
>   (p.s. this CONFIG_DEBUG_VM have surprisingly low-overhead, as long as
> you don't select any sub-options, so we choose to run with this in
> production).
>

I added CONFIG_DEBUG_VM and recompiled the kernel, but no kernel
warning message about page leak, maybe false positive?

[root@...ngfire ~]# grep 'CONFIG_DEBUG_VM=y' /boot/config-6.15.9-ipfire

CONFIG_DEBUG_VM=y

[root@...ngfire ~]# grep -E 'MEM_TYPE_PAGE_POOL|stalled' /var/log/kern.log

Sep  1 10:23:19 loongfire kernel: [    7.484986] dwmac-loongson-pci
0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
Sep  1 10:26:44 loongfire kernel: [  212.514302] dwmac-loongson-pci
0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
Sep  1 10:27:44 loongfire kernel: [  272.911878]
page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 60
sec
Sep  1 10:28:44 loongfire kernel: [  333.327876]
page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 120
sec
Sep  1 10:29:45 loongfire kernel: [  393.743877]
page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 181
sec

> Third step is doing kernel debugging like Dragos did in [1].
>
> What kernel version are you using?

kernel 6.15.9

>
> In kernel v6.8 we (Kuba) silenced some of the cases.  See commit
> be0096676e23 ("net: page_pool: mute the periodic warning for visible
> page pools").
> To Jakub/kuba can you remind us how to use the netlink tools that can
> help us inspect the page_pools active on the system?
>
>
> > xdp-filter load green0
> >
>
> Most drivers change memory model and reset the RX rings, when attaching
> XDP.  So, it makes sense that the existing page_pool instances (per RXq)
> are freed and new allocated.  Revealing any leaked or unprocessed
> page_pool pages.
>
>
> > Aug 31 19:19:06 loongfire kernel: [200871.855044] dwmac-loongson-pci 0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
> > Aug 31 19:19:07 loongfire kernel: [200872.810587] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200399 sec
>
> It is very weird that a stall time of 200399 sec is reported. This
> indicate that this have been happening *before* the xdp-filter was
> attached. The uptime "200871.855044" indicate leak happened 472 sec
> after booting this system.
>

Not sure if I pasted the previous log message correctly, but this time
the log I pasted should be correct,

> Have you seen these dmesg logs before attaching XDP?

I didn't see such a log before attaching XDP.

>
> This will help us know if this page_pool became "invisible" according to
> Kuba's change, if you run kernel >= v6.8.
>
>
> > Aug 31 19:20:07 loongfire kernel: [200933.226488] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200460 sec
> > Aug 31 19:21:08 loongfire kernel: [200993.642391]
> > page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
> > 200520 sec
> > Aug 31 19:22:08 loongfire kernel: [201054.058292]
> > page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
> > 200581 sec
> >
>
> Cc'ed some people that might have access to this hardware, can any of
> you reproduce?
>
> --Jesper

[0]: https://github.com/vincentmli/loongfire