[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250718174737.1d1177cd@kernel.org>
Date: Fri, 18 Jul 2025 17:47:37 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: Tariq Toukan <tariqt@...dia.com>
Cc: Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller"
<davem@...emloft.net>, Jiri Pirko <jiri@...nulli.us>, Jiri Pirko
<jiri@...dia.com>, Saeed Mahameed <saeed@...nel.org>, Gal Pressman
<gal@...dia.com>, "Leon Romanovsky" <leon@...nel.org>, Shahar Shitrit
<shshitrit@...dia.com>, "Donald Hunter" <donald.hunter@...il.com>, Jonathan
Corbet <corbet@....net>, "Brett Creeley" <brett.creeley@....com>, Michael
Chan <michael.chan@...adcom.com>, Pavan Chebbi <pavan.chebbi@...adcom.com>,
Cai Huoqing <cai.huoqing@...ux.dev>, Tony Nguyen
<anthony.l.nguyen@...el.com>, "Przemek Kitszel"
<przemyslaw.kitszel@...el.com>, Sunil Goutham <sgoutham@...vell.com>, Linu
Cherian <lcherian@...vell.com>, Geetha sowjanya <gakula@...vell.com>, Jerin
Jacob <jerinj@...vell.com>, hariprasad <hkelam@...vell.com>, "Subbaraya
Sundeep" <sbhatta@...vell.com>, Saeed Mahameed <saeedm@...dia.com>, Mark
Bloch <mbloch@...dia.com>, Ido Schimmel <idosch@...dia.com>, Petr Machata
<petrm@...dia.com>, Manish Chopra <manishc@...vell.com>,
<netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<linux-doc@...r.kernel.org>, <intel-wired-lan@...ts.osuosl.org>,
<linux-rdma@...r.kernel.org>
Subject: Re: [PATCH net-next 0/5] Expose grace period delay for devlink
health reporter
On Thu, 17 Jul 2025 19:07:17 +0300 Tariq Toukan wrote:
> Currently, the devlink health reporter initiates the grace period
> immediately after recovering an error, which blocks further recovery
> attempts until the grace period concludes. Since additional errors
> are not generally expected during this short interval, any new error
> reported during the grace period is not only rejected but also causes
> the reporter to enter an error state that requires manual intervention.
>
> This approach poses a problem in scenarios where a single root cause
> triggers multiple related errors in quick succession - for example,
> a PCI issue affecting multiple hardware queues. Because these errors
> are closely related and occur rapidly, it is more effective to handle
> them together rather than handling only the first one reported and
> blocking any subsequent recovery attempts. Furthermore, setting the
> reporter to an error state in this context can be misleading, as these
> multiple errors are manifestations of a single underlying issue, making
> it unlike the general case where additional errors are not expected
> during the grace period.
>
> To resolve this, introduce a configurable grace period delay attribute
> to the devlink health reporter. This delay starts when the first error
> is recovered and lasts for a user-defined duration. Once this grace
> period delay expires, the actual grace period begins. After the grace
> period ends, a new reported error will start the same flow again.
>
> Timeline summary:
>
> ----|--------|------------------------------/----------------------/--
> error is error is grace period delay grace period
> reported recovered (recoveries allowed) (recoveries blocked)
>
> With grace period delay, create a time window during which recovery
> attempts are permitted, allowing all reported errors to be handled
> sequentially before the grace period starts. Once the grace period
> begins, it prevents any further error recoveries until it ends.
We are rate limiting recoveries, the "networking solution" to the
problem you're describing would be to introduce a burst size.
Some kind of poor man's token bucket filter.
Could you say more about what designs were considered and why this
one was chosen?
Powered by blists - more mailing lists