[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bca3440c-9279-58a6-377f-6a4fdcccdf1f@nvidia.com>
Date: Tue, 9 Mar 2021 16:06:49 +0200
From: Eran Ben Elisha <eranbe@...dia.com>
To: Jakub Kicinski <kuba@...nel.org>
CC: <netdev@...r.kernel.org>, <jiri@...nulli.us>, <saeedm@...dia.com>,
<andrew.gospodarek@...adcom.com>, <jacob.e.keller@...el.com>,
<guglielmo.morandin@...adcom.com>, <eugenem@...com>,
<eranbe@...lanox.com>
Subject: Re: [RFC] devlink: health: add remediation type
On 3/8/2021 7:59 PM, Jakub Kicinski wrote:
> On Mon, 8 Mar 2021 09:16:00 -0800 Jakub Kicinski wrote:
>>>> + DLH_REMEDY_BAD_PART,
>>> BAD_PART probably indicates that the reporter (or any command line
>>> execution) cannot recover the issue.
>>> As the suggested remedy is static per reporter's recover method, it
>>> doesn't make sense for one to set a recover method that by design cannot
>>> recover successfully.
>>>
>>> Maybe we should extend devlink_health_reporter_state with POWER_CYCLE,
>>> REIMAGE and BAD_PART? To indicate the user that for a successful
>>> recovery, it should run a non-devlink-health operation?
>>
>> Hm, export and extend devlink_health_reporter_state? I like that idea.
>
> Trying to type it up it looks less pretty than expected.
>
> Let's looks at some examples.
>
> A queue reporter, say "rx", resets the queue dropping all outstanding
> buffers. As previously mentioned when the normal remediation fails user
> is expected to power cycle the machine or maybe swap the card. The
> device itself does not have a crystal ball.
Not sure, reopen the queue, or reinit the driver might also be good in
case of issue in the SW/HW queue context for example. But I agree that
RX reporter can't tell from its perspective what further escalation is
needed in case its local defined operations failed.
>
> A management FW reporter "fw", has a auto recovery of FW reset
> (REMEDY_RESET). On failure -> power cycle.
>
> An "io" reporter (PCI link had to be trained down) can only return
> a hardware failure (we should probably have a HW failure other than
> BAD_PART for this).
>
> Flash reporters - the device will know if the flash had a bad block
> or the entire part is bad, so probably can have 2 reporters for this.
>
> Most of the reporters would only report one "action" that can be
> performed to fix them. The cartesian product of ->recovery types vs
> manual recovery does not seem necessary. And drivers would get bloated
> with additional boilerplate of returning ERROR_NEED_POWER_CYCLE for
> _all_ cases with ->recovery. Because what else would the fix be if
> software-initiated reset didn't work?
>
OK, I see your point.
If I got you right, this is the conclusions so far:
1. Each reporter with recover callback will have to supply a remedy
definition.
2. We shouldn't have POWER_CYCLE, REIMAGE and BAD_PART as a remedy,
because these are not valid reporter recover flows in any case.
3. If a reporter will fail to recover, its status shall remain as error,
and it is out of the reporter's scope to advise the administrator on
further actions.
Powered by blists - more mailing lists