[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8d61628c-9ca7-13ac-2dcd-97ecc9378a9e@nvidia.com>
Date: Thu, 11 Mar 2021 16:23:09 +0200
From: Eran Ben Elisha <eranbe@...dia.com>
To: <f242ed68-d31b-527d-562f-c5a35123861a@...el.com>,
<netdev@...r.kernel.org>
CC: <jiri@...nulli.us>, <saeedm@...dia.com>,
<andrew.gospodarek@...adcom.com>, <jacob.e.keller@...el.com>,
<guglielmo.morandin@...adcom.com>, <eugenem@...com>,
<eranbe@...lanox.com>, Jakub Kicinski <kuba@...nel.org>,
Aya Levin <ayal@...dia.com>, Moshe Shemesh <moshe@...dia.com>
Subject: Re: [RFC net-next v2 3/3] devlink: add more failure modes
On 3/11/2021 5:26 AM, Jakub Kicinski wrote:
>>> Pending vendors adding the right reporters. <<
Would you like Nvidia to reply with the remedy per reporter or to
actually prepare the patch?
>
> Extend the applicability of devlink health reporters
> beyond what can be locally remedied. Add failure modes
> which require re-flashing the NVM image or HW changes.
>
> The expectation is that driver will call
> devlink_health_reporter_state_update() to put hardware
> health reporters into bad state.
>
> Signed-off-by: Jakub Kicinski <kuba@...nel.org>
> ---
> include/uapi/linux/devlink.h | 7 +++++++
> net/core/devlink.c | 3 +--
> 2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
> index 8cd1508b525b..f623bbc63489 100644
> --- a/include/uapi/linux/devlink.h
> +++ b/include/uapi/linux/devlink.h
> @@ -617,10 +617,17 @@ enum devlink_port_fn_opstate {
> * @DL_HEALTH_STATE_ERROR: error state, running health reporter's recovery
> * may fix the issue, otherwise user needs to try
> * power cycling or other forms of reset
> + * @DL_HEALTH_STATE_BAD_IMAGE: device's non-volatile memory needs
> + * to be re-written, usually due to block corruption
> + * @DL_HEALTH_STATE_BAD_HW: hardware errors detected, device, host
> + * or the connection between the two may be at fault
> */
> enum devlink_health_state {
> DL_HEALTH_STATE_HEALTHY,
> DL_HEALTH_STATE_ERROR,
> +
> + DL_HEALTH_STATE_BAD_IMAGE,
> + DL_HEALTH_STATE_BAD_HW,
> };
>
> /**
> diff --git a/net/core/devlink.c b/net/core/devlink.c
> index 09d77d43ff63..4a9fa6288a4a 100644
> --- a/net/core/devlink.c
> +++ b/net/core/devlink.c
> @@ -6527,8 +6527,7 @@ void
> devlink_health_reporter_state_update(struct devlink_health_reporter *reporter,
> enum devlink_health_state state)
> {
> - if (WARN_ON(state != DL_HEALTH_STATE_HEALTHY &&
> - state != DL_HEALTH_STATE_ERROR))
> + if (WARN_ON(state > DL_HEALTH_STATE_BAD_HW))
> return;
>
> if (reporter->health_state == state)
>
devlink_health_reporter_recover() requires an update as well.
something like:
@@ -6346,8 +6346,15 @@ devlink_health_reporter_recover(struct
devlink_health_reporter *reporter,
{
int err;
- if (reporter->health_state == DL_HEALTH_STATE_HEALTHY)
+ switch (reporter->health_state) {
+ case DL_HEALTH_STATE_HEALTHY:
return 0;
+ case DL_HEALTH_STATE_ERROR:
+ break;
+ case DL_HEALTH_STATE_BAD_IMAGE:
+ case DL_HEALTH_STATE_BAD_HW:
+ return -EOPNOTSUPP;
+ }
if (!reporter->ops->recover)
return -EOPNOTSUPP;
Powered by blists - more mailing lists