[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260123141527.358506c6@kernel.org>
Date: Fri, 23 Jan 2026 14:15:27 -0800
From: Jakub Kicinski <kuba@...nel.org>
To: Oleksij Rempel <o.rempel@...gutronix.de>
Cc: Mohsin Bashir <mohsin.bashr@...il.com>, netdev@...r.kernel.org,
alexanderduyck@...com, alok.a.tiwari@...cle.com, andrew+netdev@...n.ch,
andrew@...n.ch, chuck.lever@...cle.com, davem@...emloft.net,
donald.hunter@...il.com, edumazet@...gle.com, gal@...dia.com,
horms@...nel.org, idosch@...dia.com, jacob.e.keller@...el.com,
kernel-team@...a.com, kory.maincent@...tlin.com, lee@...ger.us,
pabeni@...hat.com, vadim.fedorenko@...ux.dev
Subject: Re: [PATCH net-next 1/3] net: ethtool: Track pause storm events
On Fri, 23 Jan 2026 22:27:19 +0100 Oleksij Rempel wrote:
> > + -
> > + name: tx-pause-storm-events
> > + type: u64
> > + doc: >-
> > + TX pause storm event count. Increments each time device
> > + detects that its pause assertion condition has been true
> > + for too long for normal operation. As a result, the device
> > + has temporarily disabled its own Pause TX function to
> > + protect the network from itself.
> > + This counter should never increment under normal overload
> > + conditions; it indicates catastrophic failure like an OS
> > + crash. The rate of incrementing is implementation specific.
>
> Hm, we already have the tx pause frame counters. So, the anomaly is
> visible to the user anyway (even if it isn't explicitly labeled as an
> anomaly).
We are trying to prove a negative here, that's why we need a new
counter. As the doc says this counter should indicate that storm
is never actually detected under normal conditions. Another thing
to keep in mind is that we're talking about metric collection at scale,
so every 1min to 5min.
> What is not visible to the user is when HW or SW disables flow control.
> Maybe that is what the counter should represent and be named? Would
> tx-pause-auto-disabled-events make sense?
According to our existing uAPI for PFC pause storm is the term of art.
> The reason I do not like tx-pause-storm-events is that the meaning is
> device specific; the user has to read the device manual to know what it
> actually means.
>
> tx-pause-auto-disabled-events can be reused in more cases - every time
> we try to pause flow control for some reason.
TBH I feel like you may be overestimating your ability to do anything
like that in the SW here. The silicon can do this cycle-accurate, FIFO
pressure never relieved. In SW you have to poll, and if you can poll
why not just read the packets from the fifo and let the pipe move?
On the "device manual" point, pause frames as an estimate of congestion
are also quite useless device to device. You have to "read the manual".
Different devices use different pause quanta so to speak.
Powered by blists - more mailing lists