[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <15c40e1a-4207-4de5-9a22-e991368310f3@linux.dev>
Date: Wed, 21 Jan 2026 10:17:51 +0800
From: Leon Hwang <leon.hwang@...ux.dev>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, Jesper Dangaard Brouer <hawk@...nel.org>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>,
Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu
<mhiramat@...nel.org>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
"David S . Miller" <davem@...emloft.net>, Eric Dumazet
<edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>, kerneljasonxing@...il.com,
lance.yang@...ux.dev, jiayuan.chen@...ux.dev, linux-kernel@...r.kernel.org,
linux-trace-kernel@...r.kernel.org, Leon Huang Fu <leon.huangfu@...pee.com>
Subject: Re: [PATCH net-next v4] page_pool: Add page_pool_release_stalled
tracepoint
On 21/1/26 07:29, Jakub Kicinski wrote:
> On Tue, 20 Jan 2026 11:16:20 +0800 Leon Hwang wrote:
>> I encountered the 'pr_warn()' messages during Mellanox NIC flapping on a
>> system using the 'mlx5_core' driver (kernel 6.6). The root cause turned
>> out to be an application-level issue: the IBM/sarama “Client SeekBroker
>> Connection Leak” [1].
>
> The scenario you are describing matches the situations we run into
> at Meta. With the upstream kernel you can find that the pages are
> leaking based on stats, and if you care use drgn to locate them
> (in the recv queue).
>
Thanks, that makes sense.
drgn indeed sounds helpful for locating the pages once it is confirmed
that the inflight pages are being held by the socket receive queue.
Before reaching that point, however, it was quite difficult to pinpoint
where those inflight pages were stuck. I was wondering whether there is
any other handy tool or method to help locate them earlier.
> The 6.6 kernel did not have page pool stats. I feel quite odd about
> adding more uAPI because someone is running a 2+ years old kernel
> and doesn't have access to the already existing facilities.
After checking the code again, I realized that the 6.6 kernel does
have page pool stats support.
Unfortunately, CONFIG_PAGE_POOL_STATS was not enabled in our Shopee
deployment, which is why those facilities were not available to us.
In any case, I understand your concern. I won’t pursue adding this
tracepoint further if it’s not something you’d like to see upstream.
Thanks,
Leon
Powered by blists - more mailing lists