netdev - Re: [RFC] devlink: health: add remediation type

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210308095950.3cede742@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date:   Mon, 8 Mar 2021 09:59:50 -0800
From:   Jakub Kicinski <kuba@...nel.org>
To:     Eran Ben Elisha <eranbe@...dia.com>
Cc:     <netdev@...r.kernel.org>, <jiri@...nulli.us>, <saeedm@...dia.com>,
        <andrew.gospodarek@...adcom.com>, <jacob.e.keller@...el.com>,
        <guglielmo.morandin@...adcom.com>, <eugenem@...com>,
        <eranbe@...lanox.com>
Subject: Re: [RFC] devlink: health: add remediation type

On Mon, 8 Mar 2021 09:16:00 -0800 Jakub Kicinski wrote:
> > > +	DLH_REMEDY_BAD_PART,    
> > BAD_PART probably indicates that the reporter (or any command line 
> > execution) cannot recover the issue.
> > As the suggested remedy is static per reporter's recover method, it 
> > doesn't make sense for one to set a recover method that by design cannot 
> > recover successfully.
> > 
> > Maybe we should extend devlink_health_reporter_state with POWER_CYCLE, 
> > REIMAGE and BAD_PART? To indicate the user that for a successful 
> > recovery, it should run a non-devlink-health operation?  
> 
> Hm, export and extend devlink_health_reporter_state? I like that idea.

Trying to type it up it looks less pretty than expected.

Let's looks at some examples.

A queue reporter, say "rx", resets the queue dropping all outstanding
buffers. As previously mentioned when the normal remediation fails user
is expected to power cycle the machine or maybe swap the card. The
device itself does not have a crystal ball.

A management FW reporter "fw", has a auto recovery of FW reset
(REMEDY_RESET). On failure -> power cycle.

An "io" reporter (PCI link had to be trained down) can only return 
a hardware failure (we should probably have a HW failure other than
BAD_PART for this).

Flash reporters - the device will know if the flash had a bad block 
or the entire part is bad, so probably can have 2 reporters for this.

Most of the reporters would only report one "action" that can be
performed to fix them. The cartesian product of ->recovery types vs
manual recovery does not seem necessary. And drivers would get bloated
with additional boilerplate of returning ERROR_NEED_POWER_CYCLE for
_all_ cases with ->recovery. Because what else would the fix be if
software-initiated reset didn't work?