[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aXNAjOjkCqOESf5y@pengutronix.de>
Date: Fri, 23 Jan 2026 10:34:04 +0100
From: Oleksij Rempel <o.rempel@...gutronix.de>
To: Mohsin Bashir <mohsin.bashr@...il.com>
Cc: netdev@...r.kernel.org, alexanderduyck@...com, alok.a.tiwari@...cle.com,
andrew+netdev@...n.ch, andrew@...n.ch, chuck.lever@...cle.com,
davem@...emloft.net, donald.hunter@...il.com, edumazet@...gle.com,
gal@...dia.com, horms@...nel.org, idosch@...dia.com,
jacob.e.keller@...el.com, kernel-team@...a.com,
kory.maincent@...tlin.com, kuba@...nel.org, lee@...ger.us,
pabeni@...hat.com, vadim.fedorenko@...ux.dev, kernel@...gutronix.de
Subject: Re: [PATCH net-next 0/3] net: ethtool: Track TX pause storm
Hi Mohsin, hi Jakub!
On Thu, Jan 22, 2026 at 11:21:55AM -0800, Mohsin Bashir wrote:
> With TX pause enabled, if a device cannot deliver received frames to
> the stack (e.g., during a system hang), it may generate excessive pause
> frames causing a pause storm. This series updates the uAPI to track TX
> pause storm events as part of the pause stats (p1), adds pause storm
> protection support for fbnic (p2), and leverages p1 to provide
> observability into these events for fbnic (p3).
>
> Mohsin Bashir (3):
> net: ethtool: Track pause storm events
> eth: fbnic: Add protection against pause storm
> eth: fbnic: Fetch TX pause storm stats
>
Very interesting series! Before the Christmas break, I started looking into a
similar issue on different hardware (LAN78xx USB Ethernet), which has counters
but no automatic pause storm mitigation in silicon. For reference, here is the
RFC patch set I'm working on:
https://lore.kernel.org/all/20260123090741.1566469-1-o.rempel@pengutronix.de/
In my implementation, I added the detection logic to the driver but exposed it
via the devlink health interface.
I have a few questions regarding the definitions and the architectural approach
here.
- Definition of "Pause Storm"
It seems fbnic defines a storm as a continuous assertion of the pause signal
(duration-based). The code defines FBNIC_RXB_PAUSE_STORM_UNIT_WR with a 10us
granularity and a 500ms threshold.
Does this hardware detector handle "flapping" (high rate of distinct pause
frames) or is it strictly for a stuck-high condition (deadlock/hang)? If the RX
queue drains slightly and fills up again every 100ms, does the internal timer
reset? If so, a "stuttering" storm might bypass this protection. Except the
primary goal of this feature is to prevent the link partner's TX watchdog from
firing (saving the neighbor from a reset).
- Policy and Tunability
The current implementation enforces an auto-recovery policy that is invisible
to userspace. By resetting the storm protection in fbnic_service_task (runs
every 1s), you effectively enforce a ~50% duty cycle on a hung system.
Since this logic decides when to stop flow control and start dropping packets,
shouldn't these thresholds be configurable?
I understand that in a complete OS crash, software cannot intervene, and we
rely entirely on the hardware's 500ms safety breaker. However, for partial
stalls (livelocks, soft lockups) where the service task might still run, the
hardcoded "force normal" policy might not be ideal.
Furthermore, having the ability to disable auto-recovery is critical for
testing. If we want to develop/validate a driver for the link partner that
detects incoming pause storms, we need a way to generate a sustained storm from
this device without the safety breaker constantly resetting it. A "controllable
reference" is very valuable here.
In my LAN78xx work, I used the devlink reporter's .recover callback. This lets
userspace control the policy (devlink health set ... recover
true/false) - determining whether the driver should auto-reset the MAC/PHY or
wait for manual intervention. Baking the "force normal" logic into the service
task removes this control.
Architecture:
This feels very similar to the existing ndo_tx_timeout watchdog, but for the RX
side. We are essentially trying to handle an "RX Timeout" or "RX Stall" where
the device cannot push packets to the host.
I assume the 500ms hardware threshold is specifically chosen to be well below
the standard 5-second ndo_tx_timeout of link partners, preventing a hung node
from causing watchdog resets on its neighbors. I've observed cases where
triggering a full ndo_tx_timeout reset on a stalled device leaves the hardware
in a partially reconfigured "zombie" state, unable to receive anything
thereafter.
Did you consider if we should approach this as a central networking mechanism
(an "RX Watchdog")? I am debating this for my own driver:
- Option A: Export needed metrics/events to userspace (Devlink/Ethtool) and let
a userspace agent decide when to reset.
- Option B: Handle it entirely in the kernel (like this patch), but potentially
standardize the "RX Stall" behavior so we don't end up with different
heuristics in every driver.
Best Regards,
Oleksij
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |
Powered by blists - more mailing lists