netdev - Re: [RFC] devlink: health: add remediation type

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210309145209.0e05608d@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date:   Tue, 9 Mar 2021 14:52:09 -0800
From:   Jakub Kicinski <kuba@...nel.org>
To:     Eran Ben Elisha <eranbe@...dia.com>
Cc:     <netdev@...r.kernel.org>, <jiri@...nulli.us>, <saeedm@...dia.com>,
        <andrew.gospodarek@...adcom.com>, <jacob.e.keller@...el.com>,
        <guglielmo.morandin@...adcom.com>, <eugenem@...com>,
        <eranbe@...lanox.com>
Subject: Re: [RFC] devlink: health: add remediation type

On Tue, 9 Mar 2021 16:06:49 +0200 Eran Ben Elisha wrote:
> On 3/8/2021 7:59 PM, Jakub Kicinski wrote:
> >> Hm, export and extend devlink_health_reporter_state? I like that idea.  
> > 
> > Trying to type it up it looks less pretty than expected.
> > 
> > Let's looks at some examples.
> > 
> > A queue reporter, say "rx", resets the queue dropping all outstanding
> > buffers. As previously mentioned when the normal remediation fails user
> > is expected to power cycle the machine or maybe swap the card. The
> > device itself does not have a crystal ball.  
> 
> Not sure, reopen the queue, or reinit the driver might also be good in 
> case of issue in the SW/HW queue context for example. But I agree that 
> RX reporter can't tell from its perspective what further escalation is 
> needed in case its local defined operations failed.

Right, the point being if normal remediation fails collect a full
system dump and do the big hammer remediation (power cycle or reinit 
if user wants to try that).

> > A management FW reporter "fw", has a auto recovery of FW reset
> > (REMEDY_RESET). On failure -> power cycle.
> > 
> > An "io" reporter (PCI link had to be trained down) can only return
> > a hardware failure (we should probably have a HW failure other than
> > BAD_PART for this).
> > 
> > Flash reporters - the device will know if the flash had a bad block
> > or the entire part is bad, so probably can have 2 reporters for this.
> > 
> > Most of the reporters would only report one "action" that can be
> > performed to fix them. The cartesian product of ->recovery types vs
> > manual recovery does not seem necessary. And drivers would get bloated
> > with additional boilerplate of returning ERROR_NEED_POWER_CYCLE for
> > _all_ cases with ->recovery. Because what else would the fix be if
> > software-initiated reset didn't work?
> 
> OK, I see your point.
> 
> If I got you right, this is the conclusions so far:
> 1. Each reporter with recover callback will have to supply a remedy 
> definition.
> 2. We shouldn't have POWER_CYCLE, REIMAGE and BAD_PART as a remedy, 
> because these are not valid reporter recover flows in any case.
> 3. If a reporter will fail to recover, its status shall remain as error, 
> and it is out of the reporter's scope to advise the administrator on 
> further actions.

I was actually intending to go back to the original proposal, mostly 
as is (plus he KICK).

Indeed the intent is that if local remediation fails or is unavailable
and reporter is in failed state - power cycle or other manual
intervention is needed. So we can drop the POWER_CYCLE remedy and leave
it implicit.

But how are you suggesting we handle BAD_PART and REIMAGE? Still
extending the health status or a separate mechanism than dl-health?