netdev - Re: [RFC] devlink: health: add remediation type

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <f242ed68-d31b-527d-562f-c5a35123861a@intel.com>
Date:   Tue, 9 Mar 2021 15:44:22 -0800
From:   Jacob Keller <jacob.e.keller@...el.com>
To:     Jakub Kicinski <kuba@...nel.org>,
        Eran Ben Elisha <eranbe@...dia.com>
Cc:     netdev@...r.kernel.org, jiri@...nulli.us, saeedm@...dia.com,
        andrew.gospodarek@...adcom.com, guglielmo.morandin@...adcom.com,
        eugenem@...com, eranbe@...lanox.com
Subject: Re: [RFC] devlink: health: add remediation type

On 3/9/2021 2:52 PM, Jakub Kicinski wrote:
> On Tue, 9 Mar 2021 16:18:58 +0200 Eran Ben Elisha wrote:
>>>> DLH_REMEDY_LOCAL_FIX: associated component will undergo a local
>>>> un-harmful fix attempt.
>>>> (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover())  
>>>
>>> Should we make it more specific? Maybe DLH_REMEDY_STALL: device stall
>>> detected, resumed by re-trigerring processing, without reset?  
>>
>> Sounds good.
> 
> FWIW I ended up calling it:
> 
> + * @DLH_REMEDY_KICK: device stalled, processing will be re-triggered
> 
>>>> The assumption here is that a reporter's recovery function has one
>>>> remedy. But it can have few remedies and escalate between them. Did you
>>>> consider a bitmask?  
>>>
>>> Yes, I tried to explain in the commit message. If we wanted to support
>>> escalating remediations we'd also need separate counters etc. I think
>>> having a health reporter per remediation should actually work fairly
>>> well.  
>>
>> That would require reporter's recovery procedure failure to trigger 
>> health flow for other reporter.
>> So we can find ourselves with 2 RX reporters, sharing the same diagnose 
>> and dump callbacks, and each has other recovery flow.
>> Seems a bit counterintuitive.
> 
> Let's talk about particular cases. Otherwise it's too easy to
> misunderstand each other. I can't think of any practical case
> where escalation makes sense.
> 
>> Maybe, per reporter, exposing a counter per each supported remedy is not 
>> that bad?
> 
> It's a large change to the uAPI, and it makes vendors more likely 
> to lump different problems under a single reporter (although I take
> your point that it may cause over-splitting, but if we have to choose
> between the two my preference is "too granular").
> 


I also prefer keeping it more granular and forcing only a single
"remedy" per reporter. If that remedy fails, I liked the thought of
possibly having some way to indicate possible "hammer" remedies as some
sort of extended status.

i.e. our reporter can try one known to be effective remedy
automatically, and then if it fails it could somehow report an extended
status that indicates "we still failed to recover, and we think the
problem might be fixed with RELOAD/REBOOT/REIMAGE"

But I would keep those 3 larger remedies that require user intervention
out of the set of regular remedies, and more as some other way to
indicate they might help?

I really don't think escalation makes a lot of sense because it's far
more complicated and as an administrator I am not sure I want a remedy
which could have larger impacts like resetting the device if that could
cause other issues...

Thanks,
Jake