linux-kernel - Re: [PATCH net-next 0/5] Expose grace period delay for devlink health reporter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250718174737.1d1177cd@kernel.org>
Date: Fri, 18 Jul 2025 17:47:37 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: Tariq Toukan <tariqt@...dia.com>
Cc: Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
 Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller"
 <davem@...emloft.net>, Jiri Pirko <jiri@...nulli.us>, Jiri Pirko
 <jiri@...dia.com>, Saeed Mahameed <saeed@...nel.org>, Gal Pressman
 <gal@...dia.com>, "Leon Romanovsky" <leon@...nel.org>, Shahar Shitrit
 <shshitrit@...dia.com>, "Donald Hunter" <donald.hunter@...il.com>, Jonathan
 Corbet <corbet@....net>, "Brett Creeley" <brett.creeley@....com>, Michael
 Chan <michael.chan@...adcom.com>, Pavan Chebbi <pavan.chebbi@...adcom.com>,
 Cai Huoqing <cai.huoqing@...ux.dev>, Tony Nguyen
 <anthony.l.nguyen@...el.com>, "Przemek Kitszel"
 <przemyslaw.kitszel@...el.com>, Sunil Goutham <sgoutham@...vell.com>, Linu
 Cherian <lcherian@...vell.com>, Geetha sowjanya <gakula@...vell.com>, Jerin
 Jacob <jerinj@...vell.com>, hariprasad <hkelam@...vell.com>, "Subbaraya
 Sundeep" <sbhatta@...vell.com>, Saeed Mahameed <saeedm@...dia.com>, Mark
 Bloch <mbloch@...dia.com>, Ido Schimmel <idosch@...dia.com>, Petr Machata
 <petrm@...dia.com>, Manish Chopra <manishc@...vell.com>,
 <netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
 <linux-doc@...r.kernel.org>, <intel-wired-lan@...ts.osuosl.org>,
 <linux-rdma@...r.kernel.org>
Subject: Re: [PATCH net-next 0/5] Expose grace period delay for devlink
 health reporter

On Thu, 17 Jul 2025 19:07:17 +0300 Tariq Toukan wrote:
> Currently, the devlink health reporter initiates the grace period
> immediately after recovering an error, which blocks further recovery
> attempts until the grace period concludes. Since additional errors
> are not generally expected during this short interval, any new error
> reported during the grace period is not only rejected but also causes
> the reporter to enter an error state that requires manual intervention.
> 
> This approach poses a problem in scenarios where a single root cause
> triggers multiple related errors in quick succession - for example,
> a PCI issue affecting multiple hardware queues. Because these errors
> are closely related and occur rapidly, it is more effective to handle
> them together rather than handling only the first one reported and
> blocking any subsequent recovery attempts. Furthermore, setting the
> reporter to an error state in this context can be misleading, as these
> multiple errors are manifestations of a single underlying issue, making
> it unlike the general case where additional errors are not expected
> during the grace period.
> 
> To resolve this, introduce a configurable grace period delay attribute
> to the devlink health reporter. This delay starts when the first error
> is recovered and lasts for a user-defined duration. Once this grace
> period delay expires, the actual grace period begins. After the grace
> period ends, a new reported error will start the same flow again.
> 
> Timeline summary:
> 
> ----|--------|------------------------------/----------------------/--
> error is  error is    grace period delay          grace period
> reported  recovered  (recoveries allowed)     (recoveries blocked)
> 
> With grace period delay, create a time window during which recovery
> attempts are permitted, allowing all reported errors to be handled
> sequentially before the grace period starts. Once the grace period
> begins, it prevents any further error recoveries until it ends.

We are rate limiting recoveries, the "networking solution" to the
problem you're describing would be to introduce a burst size.
Some kind of poor man's token bucket filter.

Could you say more about what designs were considered and why this
one was chosen?