[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3bf6714b-46d7-45ad-9d15-f5ce9d4b74e4@gmail.com>
Date: Sun, 27 Jul 2025 14:00:11 +0300
From: Tariq Toukan <ttoukan.linux@...il.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Tariq Toukan <tariqt@...dia.com>, Eric Dumazet <edumazet@...gle.com>,
Paolo Abeni <pabeni@...hat.com>, Andrew Lunn <andrew+netdev@...n.ch>,
"David S. Miller" <davem@...emloft.net>, Jiri Pirko <jiri@...nulli.us>,
Jiri Pirko <jiri@...dia.com>, Saeed Mahameed <saeed@...nel.org>,
Gal Pressman <gal@...dia.com>, Leon Romanovsky <leon@...nel.org>,
Shahar Shitrit <shshitrit@...dia.com>,
Donald Hunter <donald.hunter@...il.com>, Jonathan Corbet <corbet@....net>,
Brett Creeley <brett.creeley@....com>,
Michael Chan <michael.chan@...adcom.com>,
Pavan Chebbi <pavan.chebbi@...adcom.com>, Cai Huoqing
<cai.huoqing@...ux.dev>, Tony Nguyen <anthony.l.nguyen@...el.com>,
Przemek Kitszel <przemyslaw.kitszel@...el.com>,
Sunil Goutham <sgoutham@...vell.com>, Linu Cherian <lcherian@...vell.com>,
Geetha sowjanya <gakula@...vell.com>, Jerin Jacob <jerinj@...vell.com>,
hariprasad <hkelam@...vell.com>, Subbaraya Sundeep <sbhatta@...vell.com>,
Saeed Mahameed <saeedm@...dia.com>, Mark Bloch <mbloch@...dia.com>,
Ido Schimmel <idosch@...dia.com>, Petr Machata <petrm@...dia.com>,
Manish Chopra <manishc@...vell.com>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
intel-wired-lan@...ts.osuosl.org, linux-rdma@...r.kernel.org
Subject: Re: [PATCH net-next 0/5] Expose grace period delay for devlink health
reporter
On 25/07/2025 3:10, Jakub Kicinski wrote:
> On Thu, 24 Jul 2025 13:46:08 +0300 Tariq Toukan wrote:
>> Design alternatives considered:
>>
>> 1. Recover all queues upon any error:
>> A brute-force approach that recovers all queues on any error.
>> While simple, it is overly aggressive and disrupts unaffected queues
>> unnecessarily. Also, because this is handled entirely within the
>> driver, it leads to a driver-specific implementation rather than a
>> generic one.
>>
>> 2. Per-queue reporter:
>> This design would isolate recovery handling per SQ or RQ, effectively
>> removing interdependencies between queues. While conceptually clean,
>> it introduces significant scalability challenges as the number of
>> queues grows, as well as synchronization challenges across multiple
>> reporters.
>>
>> 3. Error aggregation with delayed handling:
>> Errors arriving during the grace period are saved and processed after
>> it ends. While addressing the issue of related errors whose recovery
>> is aborted as grace period started, this adds complexity due to
>> synchronization needs and contradicts the assumption that no errors
>> should occur during a healthy system’s grace period. Also, this
>> breaks the important role of grace period in preventing an infinite
>> loop of immediate error detection following recovery. In such cases
>> we want to stop.
>>
>> 4. Allowing a fixed burst of errors before starting grace period:
>> Allows a set number of recoveries before the grace period begins.
>> However, it also requires limiting the error reporting window.
>> To keep the design simple, the burst threshold becomes redundant.
>
> We're talking about burst on order of 100s, right?
It can be, typically up to O(num_cpus).
> The implementation
> is quite simple, store an array the size of burst in which you can
> save recovery timestamps (in a circular fashion). On error, count
> how many entries are in the past N msecs.
>
I get your suggestion. I agree that it's also pretty simple to
implement, and that it tolerates bursts.
However, I think it softens the grace period role too much. It has an
important disadvantage, as it tolerates non-bursts as well. It lacks the
"burstness" distinguishability.
IMO current grace_period has multiple goals, among them:
a. let the auto-recovery mechanism handle errors as long as they are
followed by some long-enough "healthy" intervals.
b. break infinite loop of auto-recoveries, when the "healthy" interval
is not long enough. Raise a flag to mark the need for admin intervention.
In your proposal, the above doesn't hold.
It won't prevent the infinite auto-recovery loop for a buggy system that
has a constant rate of up to X failures in N msecs.
One can argue that this can be addressed by increasing the grace_period.
i.e. a current system with grace_period=N is intuitively moved to
burst_size=X and grace_period=X*N.
But increasing the grace_period by such a large factor has
over-enforcement and hurts legitimate auto-recoveries.
Again, the main point is, it lacks the ability to properly distinguish
between 1. a "burst" followed by a healthy interval, and 2. a buggy
system with a rate of repeated errors.
> It's a clear generalization of current scheme which can be thought of
> as having an array of size 1 (only one most recent recovery time is
> saved).
>
It is a simple generalization indeed.
But I don't agree it's a better filter.
>> The grace period delay design was chosen for its simplicity and
>> precision in addressing the problem at hand. It effectively captures
>> the temporal correlation of related errors and aligns with the original
>> intent of the grace period as a stabilization window where further
>> errors are unexpected, and if they do occur, they indicate an abnormal
>> system state.
>
> Admittedly part of what I find extremely confusing when thinking about
> this API is that the period when recovery is **not** allowed is called
> "grace period".
Absolutely.
We discussed this exact same insight internally. The existing name is
confusing, but we won't propose modifying it at this point.
> Now we add something called "grace period delay" in
> some places in the code referred to as "reporter_delay"..
>
> It may be more palatable if we named the first period "error burst
> period" and, well, the later I suppose it's too late to rename..
It can be named after what it achieves (allows handling of more errors)
or what it is (a shift of the grace_period). I'm fine with both, don't
have strong preference.
I'd call it grace_period in case we didn't have one already :)
Please let me know what name you prefer.
Powered by blists - more mailing lists