[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <25fb28f1-03f9-8307-8d9b-22f81f94dfcd@nvidia.com>
Date: Tue, 9 Mar 2021 16:18:58 +0200
From: Eran Ben Elisha <eranbe@...dia.com>
To: Jakub Kicinski <kuba@...nel.org>
CC: <netdev@...r.kernel.org>, <jiri@...nulli.us>, <saeedm@...dia.com>,
<andrew.gospodarek@...adcom.com>, <jacob.e.keller@...el.com>,
<guglielmo.morandin@...adcom.com>, <eugenem@...com>,
<eranbe@...lanox.com>
Subject: Re: [RFC] devlink: health: add remediation type
On 3/8/2021 7:16 PM, Jakub Kicinski wrote:
> On Sun, 7 Mar 2021 17:59:58 +0200 Eran Ben Elisha wrote:
>> On 3/6/2021 4:42 AM, Jakub Kicinski wrote:
>>> Currently devlink health does not give user any clear information
>>> of what kind of remediation ->recover callback will perform. This
>>> makes it difficult to understand the impact of enabling auto-
>>> -remediation, and the severity of the error itself.
>>>
>>> To allow users to make more informed decision, as well as stretch
>>> the applicability of devlink health beyond what can be fixed by
>>> resetting things - add a new remediation type attribute.
>>>
>>> Note that we only allow one remediation type per reporter, this
>>> is intentional. devlink health is not built for mixing issues
>>> of different severity into one reporter since it only maintains
>>> one dump, of the first event and a single error counter.
>>> Nudging vendors towards categorizing issues beyond coarse
>>> groups is an added bonus.
>>>
>>> Signed-off-by: Jakub Kicinski <kuba@...nel.org>
>>> ---
>>> include/net/devlink.h | 2 ++
>>> include/uapi/linux/devlink.h | 30 ++++++++++++++++++++++++++++++
>>> net/core/devlink.c | 7 +++++--
>>> 3 files changed, 37 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/net/devlink.h b/include/net/devlink.h
>>> index 853420db5d32..70b5fd6a8c0d 100644
>>> --- a/include/net/devlink.h
>>> +++ b/include/net/devlink.h
>>> @@ -664,6 +664,7 @@ enum devlink_health_reporter_state {
>>> /**
>>> * struct devlink_health_reporter_ops - Reporter operations
>>> * @name: reporter name
>>> + * remedy: severity of the remediation required
>>> * @recover: callback to recover from reported error
>>> * if priv_ctx is NULL, run a full recover
>>> * @dump: callback to dump an object
>>> @@ -674,6 +675,7 @@ enum devlink_health_reporter_state {
>>>
>>> struct devlink_health_reporter_ops {
>>> char *name;
>>> + enum devlink_health_reporter_remedy remedy;
>>> int (*recover)(struct devlink_health_reporter *reporter,
>>> void *priv_ctx, struct netlink_ext_ack *extack);
>>> int (*dump)(struct devlink_health_reporter *reporter,
>>> diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
>>> index f6008b2fa60f..bd490c5536b1 100644
>>> --- a/include/uapi/linux/devlink.h
>>> +++ b/include/uapi/linux/devlink.h
>>> @@ -534,6 +534,9 @@ enum devlink_attr {
>>> DEVLINK_ATTR_RELOAD_ACTION_STATS, /* nested */
>>>
>>> DEVLINK_ATTR_PORT_PCI_SF_NUMBER, /* u32 */
>>> +
>>> + DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, /* u32 */
>>> +
>>> /* add new attributes above here, update the policy in devlink.c */
>>>
>>> __DEVLINK_ATTR_MAX,
>>> @@ -608,4 +611,31 @@ enum devlink_port_fn_opstate {
>>> DEVLINK_PORT_FN_OPSTATE_ATTACHED,
>>> };
>>>
>>> +/**
>>> + * enum devlink_health_reporter_remedy - severity of remediation procedure
>>> + * @DLH_REMEDY_NONE: transient error, no remediation required
>> DLH_REMEDY_LOCAL_FIX: associated component will undergo a local
>> un-harmful fix attempt.
>> (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover())
>
> Should we make it more specific? Maybe DLH_REMEDY_STALL: device stall
> detected, resumed by re-trigerring processing, without reset?
Sounds good.
>
>>> + * @DLH_REMEDY_COMP_RESET: associated device component (e.g. device queue)
>>> + * will be reset
>>> + * @DLH_REMEDY_RESET: full device reset, will result in temporary unavailability
>>> + * of the device, device configuration should not be lost
>>> + * @DLH_REMEDY_REINIT: device will be reinitialized and configuration lost
>>> + * @DLH_REMEDY_POWER_CYCLE: device requires a power cycle to recover
>>> + * @DLH_REMEDY_REIMAGE: device needs to be reflashed
>>> + * @DLH_REMEDY_BAD_PART: indication of failing hardware, device needs to be
>>> + * replaced
>>> + *
>>> + * Used in %DEVLINK_ATTR_HEALTH_REPORTER_REMEDY, categorizes the health reporter
>>> + * by the severity of the required remediation, and indicates the remediation
>>> + * type to the user if it can't be applied automatically (e.g. "reimage").
>>> + */
>> The assumption here is that a reporter's recovery function has one
>> remedy. But it can have few remedies and escalate between them. Did you
>> consider a bitmask?
>
> Yes, I tried to explain in the commit message. If we wanted to support
> escalating remediations we'd also need separate counters etc. I think
> having a health reporter per remediation should actually work fairly
> well.
That would require reporter's recovery procedure failure to trigger
health flow for other reporter.
So we can find ourselves with 2 RX reporters, sharing the same diagnose
and dump callbacks, and each has other recovery flow.
Seems a bit counterintuitive.
Maybe, per reporter, exposing a counter per each supported remedy is not
that bad?
>
> The major case where escalations would be immediately useful would be
> cases where normal remediation failed therefore we need a hard reset.
> But in those cases I think having the health reporter in a failed state
> should be a sufficient indication to automation to take a machine out
> of service and power cycle it.
>
>>> +enum devlink_health_reporter_remedy {
>>> + DLH_REMEDY_NONE = 1,
>>> + DLH_REMEDY_COMP_RESET,
>>> + DLH_REMEDY_RESET,
>>> + DLH_REMEDY_REINIT,
>>> + DLH_REMEDY_POWER_CYCLE,
>>> + DLH_REMEDY_REIMAGE,
>> In general, I don't expect a reported to perform POWER_CYCLE or REIMAGE
>> as part of the recovery.
>
> Right, these are extending the use of health reporters beyond what can
> be remediated automatically.
>
>>> + DLH_REMEDY_BAD_PART,
>> BAD_PART probably indicates that the reporter (or any command line
>> execution) cannot recover the issue.
>> As the suggested remedy is static per reporter's recover method, it
>> doesn't make sense for one to set a recover method that by design cannot
>> recover successfully.
>>
>> Maybe we should extend devlink_health_reporter_state with POWER_CYCLE,
>> REIMAGE and BAD_PART? To indicate the user that for a successful
>> recovery, it should run a non-devlink-health operation?
>
> Hm, export and extend devlink_health_reporter_state? I like that idea.
>
>>> diff --git a/net/core/devlink.c b/net/core/devlink.c
>>> index 737b61c2976e..73eb665070b9 100644
>>> --- a/net/core/devlink.c
>>> +++ b/net/core/devlink.c
>>> @@ -6095,7 +6095,8 @@ __devlink_health_reporter_create(struct devlink *devlink,
>>> {
>>> struct devlink_health_reporter *reporter;
>>>
>>> - if (WARN_ON(graceful_period && !ops->recover))
>>> + if (WARN_ON(graceful_period && !ops->recover) ||
>>> + WARN_ON(!ops->remedy))
>> Here you fail every reported that doesn't have remedy set, not only the
>> ones with recovery callback
>
> Right, DLH_REMEDY_NONE doesn't require a callback. Some health
> issues can be remedied by the device e.g. ECC errors or something
> along those lines. Obviously non-RFC patch will have to come with
> driver changes.
>
>>> return ERR_PTR(-EINVAL);
>>>
>>> reporter = kzalloc(sizeof(*reporter), GFP_KERNEL);
>>> @@ -6263,7 +6264,9 @@ devlink_nl_health_reporter_fill(struct sk_buff *msg,
>>> if (!reporter_attr)
>>> goto genlmsg_cancel;
>>> if (nla_put_string(msg, DEVLINK_ATTR_HEALTH_REPORTER_NAME,
>>> - reporter->ops->name))
>>> + reporter->ops->name) ||
>>> + nla_put_u32(msg, DEVLINK_ATTR_HEALTH_REPORTER_REMEDY,
>>> + reporter->ops->remedy))
>> Why not new if clause like all other attributes later.
>
> Eh.
>
Currently, each nla_put_* is under its own if clause.
>>> goto reporter_nest_cancel;
>>> if (nla_put_u8(msg, DEVLINK_ATTR_HEALTH_REPORTER_STATE,
>>> reporter->health_state))
>>>
>
Powered by blists - more mailing lists