lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0b697d49-adbb-48d5-bbfa-f90c79fb3a4d@kernel.org>
Date: Mon, 1 Sep 2025 11:23:03 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Vincent Li <vincent.mc.li@...il.com>, netdev@...r.kernel.org,
 xdp-newbies@...r.kernel.org, loongarch@...ts.linux.dev,
 Dragos Tatulea <dtatulea@...dia.com>, Furong Xu <0x1207@...il.com>
Cc: Maxime Coquelin <mcoquelin.stm32@...il.com>,
 Alexandre Torgue <alexandre.torgue@...s.st.com>,
 Huacai Chen <chenhuacai@...nel.org>, Jakub Kicinski <kuba@...nel.org>,
 Mina Almasry <almasrymina@...gle.com>, Philipp Stanner <phasta@...nel.org>,
 Ilias Apalodimas <ilias.apalodimas@...aro.org>,
 Qunqin Zhao <zhaoqunqin@...ngson.cn>, Yanteng Si <si.yanteng@...ux.dev>,
 Andrew Lunn <andrew+netdev@...n.ch>
Subject: Re: [BUG?] driver stmmac reports page_pool_release_retry() stalled
 pool shutdown every minute

Hi Vincent,

Thanks for reporting.
Please see my instruction inlined below.
Will appreciate if you reply inline below to my questions.


On 01/09/2025 04.47, Vincent Li wrote:
> Hi,
> 
> I noticed once I attached a XDP program to a dwmac-loongson-pci
> network device on a loongarch PC, the kernel logs stalled pool message
> below every minute, it seems  not to affect network traffic though. it
> does not seem to be architecture dependent, so I decided to report
> this to netdev and XDP mailing list in case there is a bug in stmmac
> related network device with XDP.
> 

Dragos (Cc'ed) gave a very detailed talk[1] about debugging page_pool
leaks, that I highly recommend:
  [1] 
https://netdevconf.info/0x19/sessions/tutorial/diagnosing-page-pool-leaks.html

Before doing kernel debugging with drgn, I have some easier steps, I
want you to perform on your hardware (I cannot reproduce given I don't
have this hardware).

First step is to check is a socket have unprocessed packets stalled in
it receive-queue (Recv-Q).  Use command 'netstat -tapenu' and look at
column "Recv-Q".  If any socket/application have not emptied it's Recv-Q
try to restart this service and see if the "stalled pool shutdown" goes
away.

Second step is compiling kernel with CONFIG_DEBUG_VM enabled. This will
warn us if the driver leaked the a page_pool controlled page, without
first "releasing" is correctly.  See commit dba1b8a7ab68 ("mm/page_pool:
catch page_pool memory leaks") for how the warning will look like.
  (p.s. this CONFIG_DEBUG_VM have surprisingly low-overhead, as long as
you don't select any sub-options, so we choose to run with this in
production).

Third step is doing kernel debugging like Dragos did in [1].

What kernel version are you using?

In kernel v6.8 we (Kuba) silenced some of the cases.  See commit
be0096676e23 ("net: page_pool: mute the periodic warning for visible
page pools").
To Jakub/kuba can you remind us how to use the netlink tools that can
help us inspect the page_pools active on the system?


> xdp-filter load green0
> 

Most drivers change memory model and reset the RX rings, when attaching
XDP.  So, it makes sense that the existing page_pool instances (per RXq)
are freed and new allocated.  Revealing any leaked or unprocessed
page_pool pages.


> Aug 31 19:19:06 loongfire kernel: [200871.855044] dwmac-loongson-pci 0000:00:03.0 green0: Register MEM_TYPE_PAGE_POOL RxQ-0
> Aug 31 19:19:07 loongfire kernel: [200872.810587] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200399 sec

It is very weird that a stall time of 200399 sec is reported. This
indicate that this have been happening *before* the xdp-filter was
attached. The uptime "200871.855044" indicate leak happened 472 sec
after booting this system.

Have you seen these dmesg logs before attaching XDP?

This will help us know if this page_pool became "invisible" according to
Kuba's change, if you run kernel >= v6.8.


> Aug 31 19:20:07 loongfire kernel: [200933.226488] page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight 200460 sec
> Aug 31 19:21:08 loongfire kernel: [200993.642391]
> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
> 200520 sec
> Aug 31 19:22:08 loongfire kernel: [201054.058292]
> page_pool_release_retry() stalled pool shutdown: id 9, 1 inflight
> 200581 sec
> 

Cc'ed some people that might have access to this hardware, can any of
you reproduce?

--Jesper

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ