lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210309145209.0e05608d@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date:   Tue, 9 Mar 2021 14:52:09 -0800
From:   Jakub Kicinski <kuba@...nel.org>
To:     Eran Ben Elisha <eranbe@...dia.com>
Cc:     <netdev@...r.kernel.org>, <jiri@...nulli.us>, <saeedm@...dia.com>,
        <andrew.gospodarek@...adcom.com>, <jacob.e.keller@...el.com>,
        <guglielmo.morandin@...adcom.com>, <eugenem@...com>,
        <eranbe@...lanox.com>
Subject: Re: [RFC] devlink: health: add remediation type

On Tue, 9 Mar 2021 16:06:49 +0200 Eran Ben Elisha wrote:
> On 3/8/2021 7:59 PM, Jakub Kicinski wrote:
> >> Hm, export and extend devlink_health_reporter_state? I like that idea.  
> > 
> > Trying to type it up it looks less pretty than expected.
> > 
> > Let's looks at some examples.
> > 
> > A queue reporter, say "rx", resets the queue dropping all outstanding
> > buffers. As previously mentioned when the normal remediation fails user
> > is expected to power cycle the machine or maybe swap the card. The
> > device itself does not have a crystal ball.  
> 
> Not sure, reopen the queue, or reinit the driver might also be good in 
> case of issue in the SW/HW queue context for example. But I agree that 
> RX reporter can't tell from its perspective what further escalation is 
> needed in case its local defined operations failed.

Right, the point being if normal remediation fails collect a full
system dump and do the big hammer remediation (power cycle or reinit 
if user wants to try that).

> > A management FW reporter "fw", has a auto recovery of FW reset
> > (REMEDY_RESET). On failure -> power cycle.
> > 
> > An "io" reporter (PCI link had to be trained down) can only return
> > a hardware failure (we should probably have a HW failure other than
> > BAD_PART for this).
> > 
> > Flash reporters - the device will know if the flash had a bad block
> > or the entire part is bad, so probably can have 2 reporters for this.
> > 
> > Most of the reporters would only report one "action" that can be
> > performed to fix them. The cartesian product of ->recovery types vs
> > manual recovery does not seem necessary. And drivers would get bloated
> > with additional boilerplate of returning ERROR_NEED_POWER_CYCLE for
> > _all_ cases with ->recovery. Because what else would the fix be if
> > software-initiated reset didn't work?
> 
> OK, I see your point.
> 
> If I got you right, this is the conclusions so far:
> 1. Each reporter with recover callback will have to supply a remedy 
> definition.
> 2. We shouldn't have POWER_CYCLE, REIMAGE and BAD_PART as a remedy, 
> because these are not valid reporter recover flows in any case.
> 3. If a reporter will fail to recover, its status shall remain as error, 
> and it is out of the reporter's scope to advise the administrator on 
> further actions.

I was actually intending to go back to the original proposal, mostly 
as is (plus he KICK).

Indeed the intent is that if local remediation fails or is unavailable
and reporter is in failed state - power cycle or other manual
intervention is needed. So we can drop the POWER_CYCLE remedy and leave
it implicit.

But how are you suggesting we handle BAD_PART and REIMAGE? Still
extending the health status or a separate mechanism than dl-health?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ