netdev - Re: [RFC] devlink: health: add remediation type

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210309145206.43091cdb@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date:   Tue, 9 Mar 2021 14:52:06 -0800
From:   Jakub Kicinski <kuba@...nel.org>
To:     Eran Ben Elisha <eranbe@...dia.com>
Cc:     <netdev@...r.kernel.org>, <jiri@...nulli.us>, <saeedm@...dia.com>,
        <andrew.gospodarek@...adcom.com>, <jacob.e.keller@...el.com>,
        <guglielmo.morandin@...adcom.com>, <eugenem@...com>,
        <eranbe@...lanox.com>
Subject: Re: [RFC] devlink: health: add remediation type

On Tue, 9 Mar 2021 16:18:58 +0200 Eran Ben Elisha wrote:
> >> DLH_REMEDY_LOCAL_FIX: associated component will undergo a local
> >> un-harmful fix attempt.
> >> (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover())  
> > 
> > Should we make it more specific? Maybe DLH_REMEDY_STALL: device stall
> > detected, resumed by re-trigerring processing, without reset?  
> 
> Sounds good.

FWIW I ended up calling it:

+ * @DLH_REMEDY_KICK: device stalled, processing will be re-triggered

> >> The assumption here is that a reporter's recovery function has one
> >> remedy. But it can have few remedies and escalate between them. Did you
> >> consider a bitmask?  
> > 
> > Yes, I tried to explain in the commit message. If we wanted to support
> > escalating remediations we'd also need separate counters etc. I think
> > having a health reporter per remediation should actually work fairly
> > well.  
> 
> That would require reporter's recovery procedure failure to trigger 
> health flow for other reporter.
> So we can find ourselves with 2 RX reporters, sharing the same diagnose 
> and dump callbacks, and each has other recovery flow.
> Seems a bit counterintuitive.

Let's talk about particular cases. Otherwise it's too easy to
misunderstand each other. I can't think of any practical case
where escalation makes sense.

> Maybe, per reporter, exposing a counter per each supported remedy is not 
> that bad?

It's a large change to the uAPI, and it makes vendors more likely 
to lump different problems under a single reporter (although I take
your point that it may cause over-splitting, but if we have to choose
between the two my preference is "too granular").