netdev - Re: [PATCH iwl-next v1] ice: fw and port health status

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <dd4cdc06-23a4-4e3a-abcf-f7fd12f33622@intel.com>
Date: Fri, 22 Nov 2024 11:33:07 -0800
From: Tony Nguyen <anthony.l.nguyen@...el.com>
To: Konrad Knitter <konrad.knitter@...el.com>,
	<intel-wired-lan@...ts.osuosl.org>
CC: <przemyslaw.kitszel@...el.co>, <netdev@...r.kernel.org>,
	<kuba@...nel.org>, <pabeni@...hat.com>, <dumazet@...gle.com>,
	<davem@...emloft.net>, <andrew+netdev@...n.ch>, Sharon Haroni
	<sharon.haroni@...el.com>, Nicholas Nunley <nicholas.d.nunley@...el.com>,
	Brett Creeley <brett.creeley@....com>
Subject: Re: [PATCH iwl-next v1] ice: fw and port health status



On 11/18/2024 2:48 AM, Konrad Knitter wrote:
> Firmware generates events for global events or port specific events.
> 
> Driver shall subscribe for health status events from firmware on supported
> FW versions >= 1.7.6.
> Driver shall expose those under specific health reporter, two new
> reporters are introduced:
> - FW health reporter shall represent global events (problems with the
> image, recovery mode);
> - Port health reporter shall represent port-specific events (module
> failure).
> 
> Firmware only reports problems when those are detected, it does not store
> active fault list.
> Driver will hold only last global and last port-specific event.
> Driver will report all events via devlink health report,
> so in case of multiple events of the same source they can be reviewed
> using devlink autodump feature.
> 
> $ devlink health
> 
> pci/0000:b1:00.3:
>    reporter fw
>      state healthy error 0 recover 0 auto_dump true
>    reporter port
>      state error error 1 recover 0 last_dump_date 2024-03-17
> 	last_dump_time 09:29:29 auto_dump true
> 
> $ devlink health diagnose pci/0000:b1:00.3 reporter port
> 
>    Syndrome: 262
>    Description: Module is not present.
>    Possible Solution: Check that the module is inserted correctly.
>    Port Number: 0
> 
> Tested on Intel Corporation Ethernet Controller E810-C for SFP
> 
> Co-developed-by: Sharon Haroni <sharon.haroni@...el.com>
> Signed-off-by: Sharon Haroni <sharon.haroni@...el.com>
> Co-developed-by: Nicholas Nunley <nicholas.d.nunley@...el.com>
> Signed-off-by: Nicholas Nunley <nicholas.d.nunley@...el.com>
> Co-developed-by: Brett Creeley <brett.creeley@....com>
> Signed-off-by: Brett Creeley <brett.creeley@....com>
> Signed-off-by: Konrad Knitter <konrad.knitter@...el.com>
> ---
>   .../net/ethernet/intel/ice/devlink/health.c   | 290 +++++++++++++++++-
>   .../net/ethernet/intel/ice/devlink/health.h   |  12 +
>   .../net/ethernet/intel/ice/ice_adminq_cmd.h   |  87 ++++++
>   drivers/net/ethernet/intel/ice/ice_common.c   |  37 +++
>   drivers/net/ethernet/intel/ice/ice_common.h   |   2 +
>   drivers/net/ethernet/intel/ice/ice_main.c     |   3 +
>   drivers/net/ethernet/intel/ice/ice_type.h     |   5 +
>   7 files changed, 429 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/devlink/health.c b/drivers/net/ethernet/intel/ice/devlink/health.c
> index c7a8b8c9e1ca..4e6c6891e207 100644
> --- a/drivers/net/ethernet/intel/ice/devlink/health.c
> +++ b/drivers/net/ethernet/intel/ice/devlink/health.c
> @@ -1,13 +1,272 @@
>   // SPDX-License-Identifier: GPL-2.0
>   /* Copyright (c) 2024, Intel Corporation. */
>   
> -#include "health.h"
>   #include "ice.h"
> +#include "ice_adminq_cmd.h" /* for enum ice_aqc_health_status_elem */
> +#include "health.h"

Is there a reason you're re-ordering health.h?

>   #include "ice_ethtool_common.h"
>   
>   #define ICE_DEVLINK_FMSG_PUT_FIELD(fmsg, obj, name) \
>   	devlink_fmsg_put(fmsg, #name, (obj)->name)
>   
> +#define ICE_HEALTH_STATUS_DATA_SIZE 2
> +
> +struct ice_health_status {
> +	enum ice_aqc_health_status code;
> +	const char *description;
> +	const char *solution;
> +	const char *data_label[ICE_HEALTH_STATUS_DATA_SIZE];
> +};
> +
> +/**

Wrong style, should be '/*'

drivers/net/ethernet/intel/ice/devlink/health.c:22: warning: This 
comment starts with '/**', but isn't a kernel-doc comment. Refer 
Documentation/doc-guide/kernel-doc.rst

> + * In addition to the health status codes provided below, the firmware might
> + * generate Health Status Codes that are not pertinent to the end-user.
> + * For instance, Health Code 0x1002 is triggered when the command fails.
> + * Such codes should be disregarded by the end-user.
> + * The below lookup requires to be sorted by code.
> + */
> +
> +static const char *const ice_common_port_solutions =
> +	"Check your cable connection. Change or replace the module or cable. Manually set speed and duplex.";
> +static const char *const ice_port_number_label = "Port Number";
> +static const char *const ice_update_nvm_solution = "Update to the latest NVM image.";

...

> +static void ice_describe_status_code(struct devlink_fmsg *fmsg,
> +				     struct ice_aqc_health_status_elem *hse)
> +{
> +	static const char *const aux_label[] = { "Aux Data 1", "Aux Data 2" };
> +	const struct ice_health_status *health_code;
> +	u32 internal_data[2];
> +	u16 status_code;
> +
> +	status_code = le16_to_cpu(hse->health_status_code);
> +
> +	devlink_fmsg_put(fmsg, "Syndrome", status_code);
> +	if (status_code != 0) {

if (status_code) {...

> +		internal_data[0] = le32_to_cpu(hse->internal_data1);
> +		internal_data[1] = le32_to_cpu(hse->internal_data2);
> +
> +		health_code = ice_get_health_status(status_code);
> +
> +		if (!health_code)
> +			return;

Please don't separate the error check with a newline. Other occurrences 
in this patch as well, please fix those too.

> +
> +		devlink_fmsg_string_pair_put(fmsg, "Description", health_code->description);
> +
> +		if (health_code->solution)
> +			devlink_fmsg_string_pair_put(fmsg, "Possible Solution",
> +						     health_code->solution);
> +
> +		for (int i = 0; i < ICE_HEALTH_STATUS_DATA_SIZE; i++) {
> +			if (internal_data[i] != ICE_AQC_HEALTH_STATUS_UNDEFINED_DATA)
> +				devlink_fmsg_u32_pair_put(fmsg,
> +							  health_code->data_label[i] ?
> +							  health_code->data_label[i] :
> +							  aux_label[i],
> +							  internal_data[i]);
> +		}
> +	}
> +}
> +

...

> +void ice_process_health_status_event(struct ice_pf *pf, struct ice_rq_event_info *event)
> +{
> +	const struct ice_aqc_health_status_elem *health_info;
> +	const struct ice_health_status *health_code;
> +	u16 status_code, count;
> +
> +	health_info = (struct ice_aqc_health_status_elem *)event->msg_buf;
> +	count = le16_to_cpu(event->desc.params.get_health_status.health_status_count);
> +
> +	if (count > (event->buf_len / sizeof(*health_info))) {
> +		dev_err(ice_pf_to_dev(pf), "Received a health status event with invalid element count\n");
> +		return;
> +	}
> +
> +	for (int i = 0; i < count; i++) {
> +		status_code = le16_to_cpu(health_info->health_status_code);
> +		health_code = ice_get_health_status(status_code);

Looks like the scope of these vars can be reduced to this loop.

> +
> +		if (health_code) {
> +			switch (health_info->event_source) {
> +			case ICE_AQC_HEALTH_STATUS_GLOBAL:
> +				pf->health_reporters.fw_status = *health_info;
> +				devlink_health_report(pf->health_reporters.fw,
> +						      "FW syndrome reported", NULL);
> +				break;
> +			case ICE_AQC_HEALTH_STATUS_PF:
> +			case ICE_AQC_HEALTH_STATUS_PORT:
> +				pf->health_reporters.port_status = *health_info;
> +				devlink_health_report(pf->health_reporters.port,
> +						      "Port syndrome reported", NULL);
> +				break;
> +			default:
> +				dev_err(ice_pf_to_dev(pf), "Health code with unknown source\n");
> +			}
> +		} else {
> +			u32 data1, data2;
> +			u16 source;
> +
> +			source = le16_to_cpu(health_info->event_source);
> +			data1 = le32_to_cpu(health_info->internal_data1);
> +			data2 = le32_to_cpu(health_info->internal_data2);
> +			dev_dbg(ice_pf_to_dev(pf),
> +				"Received internal health status code 0x%08x, source: 0x%08x, data1: 0x%08x, data2: 0x%08x",
> +				status_code, source, data1, data2);
> +		}
> +		health_info++;
> +	}
> +}

...

> @@ -27,15 +29,21 @@ enum ice_mdd_src {
>    * struct ice_health - stores ice devlink health reporters and accompanied data
>    * @tx_hang: devlink health reporter for tx_hang event
>    * @mdd: devlink health reporter for MDD detection event
> + * @fw: devlink health reporter for FW Health Status events
> + * @port: devlink health reporter for Port Health Status events

These should be in the order of the struct i.e. 'mdd' should be in-between.

>    * @tx_hang_buf: pre-allocated place to put info for Tx hang reporter from
>    *               non-sleeping context
>    * @tx_ring: ring that the hang occured on
>    * @head: descriptior head
>    * @intr: interrupt register value
>    * @vsi_num: VSI owning the queue that the hang occured on
> + * @fw_status: buffer for last received FW Status event
> + * @port_status: buffer for last received Port Status event
>    */
>   struct ice_health {
> +	struct devlink_health_reporter *fw;
>   	struct devlink_health_reporter *mdd;
> +	struct devlink_health_reporter *port;
>   	struct devlink_health_reporter *tx_hang;
>   	struct_group_tagged(ice_health_tx_hang_buf, tx_hang_buf,
>   		struct ice_tx_ring *tx_ring;

...

> +/**
> + * ice_is_fw_health_report_supported

drivers/net/ethernet/intel/ice/ice_common.c:6052: warning: missing 
initial short description on line:
  * ice_is_fw_health_report_supported

> + * @hw: pointer to the hardware structure
> + *
> + * Return true if firmware supports health status reports,

Return isn't recognized, it should be Return:

drivers/net/ethernet/intel/ice/ice_common.c:6059: warning: No 
description found for return value of 'ice_is_fw_health_report_supported'


> + * false otherwise
> + */
> +bool ice_is_fw_health_report_supported(struct ice_hw *hw)
> +{
> +	return ice_is_fw_api_min_ver(hw, ICE_FW_API_HEALTH_REPORT_MAJ,
> +				     ICE_FW_API_HEALTH_REPORT_MIN,
> +				     ICE_FW_API_HEALTH_REPORT_PATCH);
> +}
> +
> +/**
> + * ice_aq_set_health_status_cfg - Configure FW health events
> + * @hw: pointer to the HW struct
> + * @event_source: type of diagnostic events to enable
> + *
> + * Configure the health status event types that the firmware will send to this
> + * PF. The supported event types are: PF-specific, all PFs, and global.
> + * Return: 0 on success, negative error code otherwise.

IMO a newline separating the Return: would be make it easier to 
differentiate.

Thanks,
Tony

> + */
> +int ice_aq_set_health_status_cfg(struct ice_hw *hw, u8 event_source)