lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <15c40e1a-4207-4de5-9a22-e991368310f3@linux.dev>
Date: Wed, 21 Jan 2026 10:17:51 +0800
From: Leon Hwang <leon.hwang@...ux.dev>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, Jesper Dangaard Brouer <hawk@...nel.org>,
 Ilias Apalodimas <ilias.apalodimas@...aro.org>,
 Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu
 <mhiramat@...nel.org>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
 "David S . Miller" <davem@...emloft.net>, Eric Dumazet
 <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
 Simon Horman <horms@...nel.org>, kerneljasonxing@...il.com,
 lance.yang@...ux.dev, jiayuan.chen@...ux.dev, linux-kernel@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org, Leon Huang Fu <leon.huangfu@...pee.com>
Subject: Re: [PATCH net-next v4] page_pool: Add page_pool_release_stalled
 tracepoint



On 21/1/26 07:29, Jakub Kicinski wrote:
> On Tue, 20 Jan 2026 11:16:20 +0800 Leon Hwang wrote:
>> I encountered the 'pr_warn()' messages during Mellanox NIC flapping on a
>> system using the 'mlx5_core' driver (kernel 6.6). The root cause turned
>> out to be an application-level issue: the IBM/sarama “Client SeekBroker
>> Connection Leak” [1].
> 
> The scenario you are describing matches the situations we run into 
> at Meta. With the upstream kernel you can find that the pages are
> leaking based on stats, and if you care use drgn to locate them
> (in the recv queue).
> 

Thanks, that makes sense.

drgn indeed sounds helpful for locating the pages once it is confirmed
that the inflight pages are being held by the socket receive queue.

Before reaching that point, however, it was quite difficult to pinpoint
where those inflight pages were stuck. I was wondering whether there is
any other handy tool or method to help locate them earlier.

> The 6.6 kernel did not have page pool stats. I feel quite odd about
> adding more uAPI because someone is running a 2+ years old kernel 
> and doesn't have access to the already existing facilities.

After checking the code again, I realized that the 6.6 kernel does
have page pool stats support.

Unfortunately, CONFIG_PAGE_POOL_STATS was not enabled in our Shopee
deployment, which is why those facilities were not available to us.

In any case, I understand your concern. I won’t pursue adding this
tracepoint further if it’s not something you’d like to see upstream.

Thanks,
Leon


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ