[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260123104031.16d914e4@kernel.org>
Date: Fri, 23 Jan 2026 10:40:31 -0800
From: Jakub Kicinski <kuba@...nel.org>
To: Oleksij Rempel <o.rempel@...gutronix.de>
Cc: Mohsin Bashir <mohsin.bashr@...il.com>, netdev@...r.kernel.org,
alexanderduyck@...com, alok.a.tiwari@...cle.com, andrew+netdev@...n.ch,
andrew@...n.ch, chuck.lever@...cle.com, davem@...emloft.net,
donald.hunter@...il.com, edumazet@...gle.com, gal@...dia.com,
horms@...nel.org, idosch@...dia.com, jacob.e.keller@...el.com,
kernel-team@...a.com, kory.maincent@...tlin.com, lee@...ger.us,
pabeni@...hat.com, vadim.fedorenko@...ux.dev, kernel@...gutronix.de
Subject: Re: [PATCH net-next 0/3] net: ethtool: Track TX pause storm
On Fri, 23 Jan 2026 12:28:13 +0100 Oleksij Rempel wrote:
> Here is a TL;DR summary of my questions regarding the pause storm logic
> :)
Eh, did you get AI to help write the full version? :) So much text :)
> - Does the 500ms hardware timer reset on "flapping" pause signals? If so,
> a stuttering storm might still crash the link partner (tx watchdog
> timeout).
Yes any discontinuity resets AFAIU, Mohsin keep me honest.
There's a conflict here between respecting user configuration (pause
enabled) vs safety of the network. We're trying to err on the side of
respecting the config. We haven't seen any stutter, yet.
> - The auto-recovery (service task) enforces a fixed policy. Can we make
> this configurable? I used devlink health (.recover) to let userspace
> decide between auto-reset or manual intervention.
There is already a tunable for this exact feature but for PFC:
ETHTOOL_PFC_PREVENTION_TOUT. Should be trivial to add the same thing for
non-PFC pause. But we didn't want to open the uAPI can of warms unless
there's a clear ask and consensus. We don't need tuning (or so we
think), and there was some talk about not adding uAPI for fbnic because
it's a "private device".
> - Should we standardize an "RX Watchdog" mechanism in the core instead of
> or in addition to driver-specific stats?
Our primary use case is machine is hard-wedged. Either Linux crash, or
kexec died, or UEFI issue. So it must be the device that implements the
logic.
Florian was proposing a hook to auto-disable pause from the crash
notifier. It sounds like your use case is closer to that?
> - If main case where we will run in to tx pause storm is OS crash, what
> instance will be able to read this stats? Are they preserved on reboot
> or kexec?
Good question! I was wondering the same thing. In the end I couldn't
figure out which behavior would be less confusing. We want to make sure
that the stat never increments on a live system, if the machines come
out of boot with non-zero value some alerting system could fire.
OTOH as you say we may want to know that it did happen while machine
was out. So IDK. The fbnic implementation starts with 0.
Powered by blists - more mailing lists