netdev - Re: [RFC] devlink: health: add remediation type

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210308091600.5f686fcd@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date:   Mon, 8 Mar 2021 09:16:00 -0800
From:   Jakub Kicinski <kuba@...nel.org>
To:     Eran Ben Elisha <eranbe@...dia.com>
Cc:     <netdev@...r.kernel.org>, <jiri@...nulli.us>, <saeedm@...dia.com>,
        <andrew.gospodarek@...adcom.com>, <jacob.e.keller@...el.com>,
        <guglielmo.morandin@...adcom.com>, <eugenem@...com>,
        <eranbe@...lanox.com>
Subject: Re: [RFC] devlink: health: add remediation type

On Sun, 7 Mar 2021 17:59:58 +0200 Eran Ben Elisha wrote:
> On 3/6/2021 4:42 AM, Jakub Kicinski wrote:
> > Currently devlink health does not give user any clear information
> > of what kind of remediation ->recover callback will perform. This
> > makes it difficult to understand the impact of enabling auto-
> > -remediation, and the severity of the error itself.
> > 
> > To allow users to make more informed decision, as well as stretch
> > the applicability of devlink health beyond what can be fixed by
> > resetting things - add a new remediation type attribute.
> > 
> > Note that we only allow one remediation type per reporter, this
> > is intentional. devlink health is not built for mixing issues
> > of different severity into one reporter since it only maintains
> > one dump, of the first event and a single error counter.
> > Nudging vendors towards categorizing issues beyond coarse
> > groups is an added bonus.
> > 
> > Signed-off-by: Jakub Kicinski <kuba@...nel.org>
> > ---
> >   include/net/devlink.h        |  2 ++
> >   include/uapi/linux/devlink.h | 30 ++++++++++++++++++++++++++++++
> >   net/core/devlink.c           |  7 +++++--
> >   3 files changed, 37 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/net/devlink.h b/include/net/devlink.h
> > index 853420db5d32..70b5fd6a8c0d 100644
> > --- a/include/net/devlink.h
> > +++ b/include/net/devlink.h
> > @@ -664,6 +664,7 @@ enum devlink_health_reporter_state {
> >   /**
> >    * struct devlink_health_reporter_ops - Reporter operations
> >    * @name: reporter name
> > + * remedy: severity of the remediation required
> >    * @recover: callback to recover from reported error
> >    *           if priv_ctx is NULL, run a full recover
> >    * @dump: callback to dump an object
> > @@ -674,6 +675,7 @@ enum devlink_health_reporter_state {
> >   
> >   struct devlink_health_reporter_ops {
> >   	char *name;
> > +	enum devlink_health_reporter_remedy remedy;
> >   	int (*recover)(struct devlink_health_reporter *reporter,
> >   		       void *priv_ctx, struct netlink_ext_ack *extack);
> >   	int (*dump)(struct devlink_health_reporter *reporter,
> > diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
> > index f6008b2fa60f..bd490c5536b1 100644
> > --- a/include/uapi/linux/devlink.h
> > +++ b/include/uapi/linux/devlink.h
> > @@ -534,6 +534,9 @@ enum devlink_attr {
> >   	DEVLINK_ATTR_RELOAD_ACTION_STATS,       /* nested */
> >   
> >   	DEVLINK_ATTR_PORT_PCI_SF_NUMBER,	/* u32 */
> > +
> > +	DEVLINK_ATTR_HEALTH_REPORTER_REMEDY,	/* u32 */
> > +
> >   	/* add new attributes above here, update the policy in devlink.c */
> >   
> >   	__DEVLINK_ATTR_MAX,
> > @@ -608,4 +611,31 @@ enum devlink_port_fn_opstate {
> >   	DEVLINK_PORT_FN_OPSTATE_ATTACHED,
> >   };
> >   
> > +/**
> > + * enum devlink_health_reporter_remedy - severity of remediation procedure
> > + * @DLH_REMEDY_NONE: transient error, no remediation required  
> DLH_REMEDY_LOCAL_FIX: associated component will undergo a local 
> un-harmful fix attempt.
> (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover())

Should we make it more specific? Maybe DLH_REMEDY_STALL: device stall
detected, resumed by re-trigerring processing, without reset?

> > + * @DLH_REMEDY_COMP_RESET: associated device component (e.g. device queue)
> > + *			will be reset
> > + * @DLH_REMEDY_RESET: full device reset, will result in temporary unavailability
> > + *			of the device, device configuration should not be lost
> > + * @DLH_REMEDY_REINIT: device will be reinitialized and configuration lost
> > + * @DLH_REMEDY_POWER_CYCLE: device requires a power cycle to recover
> > + * @DLH_REMEDY_REIMAGE: device needs to be reflashed
> > + * @DLH_REMEDY_BAD_PART: indication of failing hardware, device needs to be
> > + *			replaced
> > + *
> > + * Used in %DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, categorizes the health reporter
> > + * by the severity of the required remediation, and indicates the remediation
> > + * type to the user if it can't be applied automatically (e.g. "reimage").
> > + */  
> The assumption here is that a reporter's recovery function has one 
> remedy. But it can have few remedies and escalate between them. Did you 
> consider a bitmask?

Yes, I tried to explain in the commit message. If we wanted to support
escalating remediations we'd also need separate counters etc. I think
having a health reporter per remediation should actually work fairly
well.

The major case where escalations would be immediately useful would be
cases where normal remediation failed therefore we need a hard reset.
But in those cases I think having the health reporter in a failed state
should be a sufficient indication to automation to take a machine out
of service and power cycle it.

> > +enum devlink_health_reporter_remedy {
> > +	DLH_REMEDY_NONE = 1,
> > +	DLH_REMEDY_COMP_RESET,
> > +	DLH_REMEDY_RESET,
> > +	DLH_REMEDY_REINIT,
> > +	DLH_REMEDY_POWER_CYCLE,
> > +	DLH_REMEDY_REIMAGE,  
> In general, I don't expect a reported to perform POWER_CYCLE or REIMAGE 
> as part of the recovery.

Right, these are extending the use of health reporters beyond what can
be remediated automatically.

> > +	DLH_REMEDY_BAD_PART,  
> BAD_PART probably indicates that the reporter (or any command line 
> execution) cannot recover the issue.
> As the suggested remedy is static per reporter's recover method, it 
> doesn't make sense for one to set a recover method that by design cannot 
> recover successfully.
> 
> Maybe we should extend devlink_health_reporter_state with POWER_CYCLE, 
> REIMAGE and BAD_PART? To indicate the user that for a successful 
> recovery, it should run a non-devlink-health operation?

Hm, export and extend devlink_health_reporter_state? I like that idea.

> > diff --git a/net/core/devlink.c b/net/core/devlink.c
> > index 737b61c2976e..73eb665070b9 100644
> > --- a/net/core/devlink.c
> > +++ b/net/core/devlink.c
> > @@ -6095,7 +6095,8 @@ __devlink_health_reporter_create(struct devlink *devlink,
> >   {
> >   	struct devlink_health_reporter *reporter;
> >   
> > -	if (WARN_ON(graceful_period && !ops->recover))
> > +	if (WARN_ON(graceful_period && !ops->recover) ||
> > +	    WARN_ON(!ops->remedy))  
> Here you fail every reported that doesn't have remedy set, not only the 
> ones with recovery callback

Right, DLH_REMEDY_NONE doesn't require a callback. Some health
issues can be remedied by the device e.g. ECC errors or something
along those lines. Obviously non-RFC patch will have to come with
driver changes.

> >   		return ERR_PTR(-EINVAL);
> >   
> >   	reporter = kzalloc(sizeof(*reporter), GFP_KERNEL);
> > @@ -6263,7 +6264,9 @@ devlink_nl_health_reporter_fill(struct sk_buff *msg,
> >   	if (!reporter_attr)
> >   		goto genlmsg_cancel;
> >   	if (nla_put_string(msg, DEVLINK_ATTR_HEALTH_REPORTER_NAME,
> > -			   reporter->ops->name))
> > +			   reporter->ops->name) ||
> > +	    nla_put_u32(msg, DEVLINK_ATTR_HEALTH_REPORTER_REMEDY,
> > +			reporter->ops->remedy))  
> Why not new if clause like all other attributes later.

Eh.

> >   		goto reporter_nest_cancel;
> >   	if (nla_put_u8(msg, DEVLINK_ATTR_HEALTH_REPORTER_STATE,
> >   		       reporter->health_state))
> >