netdev - Re: [RFC] devlink: health: add remediation type

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <bca3440c-9279-58a6-377f-6a4fdcccdf1f@nvidia.com>
Date:   Tue, 9 Mar 2021 16:06:49 +0200
From:   Eran Ben Elisha <eranbe@...dia.com>
To:     Jakub Kicinski <kuba@...nel.org>
CC:     <netdev@...r.kernel.org>, <jiri@...nulli.us>, <saeedm@...dia.com>,
        <andrew.gospodarek@...adcom.com>, <jacob.e.keller@...el.com>,
        <guglielmo.morandin@...adcom.com>, <eugenem@...com>,
        <eranbe@...lanox.com>
Subject: Re: [RFC] devlink: health: add remediation type



On 3/8/2021 7:59 PM, Jakub Kicinski wrote:
> On Mon, 8 Mar 2021 09:16:00 -0800 Jakub Kicinski wrote:
>>>> +	DLH_REMEDY_BAD_PART,
>>> BAD_PART probably indicates that the reporter (or any command line
>>> execution) cannot recover the issue.
>>> As the suggested remedy is static per reporter's recover method, it
>>> doesn't make sense for one to set a recover method that by design cannot
>>> recover successfully.
>>>
>>> Maybe we should extend devlink_health_reporter_state with POWER_CYCLE,
>>> REIMAGE and BAD_PART? To indicate the user that for a successful
>>> recovery, it should run a non-devlink-health operation?
>>
>> Hm, export and extend devlink_health_reporter_state? I like that idea.
> 
> Trying to type it up it looks less pretty than expected.
> 
> Let's looks at some examples.
> 
> A queue reporter, say "rx", resets the queue dropping all outstanding
> buffers. As previously mentioned when the normal remediation fails user
> is expected to power cycle the machine or maybe swap the card. The
> device itself does not have a crystal ball.

Not sure, reopen the queue, or reinit the driver might also be good in 
case of issue in the SW/HW queue context for example. But I agree that 
RX reporter can't tell from its perspective what further escalation is 
needed in case its local defined operations failed.

> 
> A management FW reporter "fw", has a auto recovery of FW reset
> (REMEDY_RESET). On failure -> power cycle.
> 
> An "io" reporter (PCI link had to be trained down) can only return
> a hardware failure (we should probably have a HW failure other than
> BAD_PART for this).
> 
> Flash reporters - the device will know if the flash had a bad block
> or the entire part is bad, so probably can have 2 reporters for this.
> 
> Most of the reporters would only report one "action" that can be
> performed to fix them. The cartesian product of ->recovery types vs
> manual recovery does not seem necessary. And drivers would get bloated
> with additional boilerplate of returning ERROR_NEED_POWER_CYCLE for
> _all_ cases with ->recovery. Because what else would the fix be if
> software-initiated reset didn't work?
> 

OK, I see your point.

If I got you right, this is the conclusions so far:
1. Each reporter with recover callback will have to supply a remedy 
definition.
2. We shouldn't have POWER_CYCLE, REIMAGE and BAD_PART as a remedy, 
because these are not valid reporter recover flows in any case.
3. If a reporter will fail to recover, its status shall remain as error, 
and it is out of the reporter's scope to advise the administrator on 
further actions.