[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <a7904c43-01c7-4f9c-a1f9-e0a7ce2db532@ans.pl>
Date: Sat, 31 Aug 2024 23:28:03 -0700
From: Krzysztof Olędzki <ole@....pl>
To: Ido Schimmel <idosch@...dia.com>, Tariq Toukan <tariqt@...dia.com>,
Yishai Hadas <yishaih@...dia.com>
Cc: Michal Kubecek <mkubecek@...e.cz>, Jakub Kicinski <kuba@...nel.org>,
Andrew Lunn <andrew@...n.ch>,
"netdev@...r.kernel.org"
<netdev@...r.kernel.org>
Subject: [mlx4] Mellanox ConnectX2 (MHQH29C aka 26428) and module diagnostic
support (ethtool -m) issues
Hi,
I noticed that module diagnostic on Mellanox ConnectX2 NIC (MHQH29Caka 26428 aka 15b3:673c, FW version 2.10.0720) behaves in somehow strange ways.
1. For SFP modules the driver is able to read the first page but not the 2nd one:
[ 318.082923] mlx4_core 0000:01:00.0: MLX4_CMD_MAD_IFC Get Module info attr(ff60) port(1) i2c_addr(51) offset(0) size(48): Response Mad Status(71c) - invalid I2C slave address
[ 318.082936] mlx4_en: eth1: mlx4_get_module_info i(0) offset(256) bytes_to_read(128) - FAILED (0xfffff8e4)
However, as the driver intentionally tries mask the problem [1], ethtool reports "Optical diagnostics support" being available and shows completely wrong information [2].
Removing the workaround allows ethtool to recognize the problem and handle everything correctly [3]:
---- cut here ----
--- a/drivers/net/ethernet/mellanox/mlx4/port.c 2024-07-27 02:34:11.000000000 -0700
+++ b/drivers/net/ethernet/mellanox/mlx4/port.c 2024-08-31 21:57:11.211612505 -0700
@@ -2197,14 +2197,7 @@
0xFF60, port, i2c_addr, offset, size,
ret, cable_info_mad_err_str(ret));
- if (i2c_addr == I2C_ADDR_HIGH &&
- MAD_STATUS_2_CABLE_ERR(ret) == CABLE_INF_I2C_ADDR)
- /* Some SFP cables do not support i2c slave
- * address 0x51 (high page), abort silently.
- */
- ret = 0;
- else
- ret = -ret;
+ ret = -ret;
goto out;
}
cable_info = (struct mlx4_cable_info *)outm
---- cut here ----
However, we end up with a strange "netlink error: Unknown error 1820" error because mlx4_get_module_info returns -0x71c (0x71c is 1820 in decimal).
This can be fixed with returning -EIO instead of ret, either in mlx4_get_module_info() or perhaps better mlx4_en_get_module_eeprom() from en_ethtool.c:
---- cut here ----
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c 2024-07-27 02:34:11.000000000 -0700
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c 2024-08-31 21:52:50.370553218 -0700
@@ -2110,7 +2110,7 @@
en_err(priv,
"mlx4_get_module_info i(%d) offset(%d) bytes_to_read(%d) - FAILED (0x%x)\n",
i, offset, ee->len - i, ret);
- return ret;
+ return -EIO;
}
i += ret;
---- cut here ----
BTW: it is also possible to augment the error reporting in ethtool/sfpid.c:
---- cut here ----
- if (ret)
+ if (ret) {
+ fprintf(stderr, "Failed to read Page A2h.\n");
goto out;
+ }
---- cut here ----
With all the above changes, we now get:
---- cut here ----
Identifier : 0x03 (SFP)
Extended identifier : 0x04 (GBIC/SFP defined by 2-wire interface ID)
(...)
Date code : <REDACTED>
netlink error: Input/output error
Failed to read Page A2h.
---- cut here ----
So, the first question is if above set of fixes makes sense, give that ethtool handles this correctly? If so, I'm happy to send the fixes.
The second question is if not being able to read Page A2h and "invalid I2C slave address" is a due to a bug in the driver or a HW (firmware?) limitation and if something can be done to address this?
2. For a QSFP module (which works in CX3/CX3Pro), handling "ethtool -m" seems to be completely broken.
With QSFP module in port #2 (eth2), for the first attempt (ethtool -m eth2):
mlx4_core 0000:01:00.0: MLX4_CMD_MAD_IFC Get Module info attr(ff60) port(2) i2c_addr(50) offset(0) size(48): Response Mad Status(41c) - the connected cable has no EPROM (passive copper cable)
mlx4_en: eth2: mlx4_get_module_info i(0) offset(0) bytes_to_read(128) - FAILED (0xfffffbe4)
However, if I first try run "ethtool -m eth1" with a SFP module installed in port #1, and then immediately "ethtool -m eth2", I end up getting the information for the SFP module:
# ethtool -m eth2
Identifier : 0x03 (SFP)
Extended identifier : 0x04 (GBIC/SFP defined by 2-wire interface ID)
(...)
I this case, I even get the same "invalid I2C slave address" error:
mlx4_core 0000:01:00.0: MLX4_CMD_MAD_IFC Get Module info attr(ff60) port(2) i2c_addr(51) offset(0) size(48): Response Mad Status(71c) - invalid I2C slave address
If I immediately run "ethtool -m eth1" I get:
mlx4_core 0000:01:00.0: MLX4_CMD_MAD_IFC Get Module info attr(ff60) port(1) i2c_addr(50) offset(224) size(32): Response Mad Status(61c) - invalid device_address or size (that is, size equals 0 or address+size is greater than 256)
mlx4_en: eth1: mlx4_get_module_info i(96) offset(224) bytes_to_read(32) - FAILED (0xfffff9e4)
Alternatively, if I remove SFP module from port #1 and run "ethtool -m eth2", I get:
[ 1071.945737] mlx4_core 0000:01:00.0: MLX4_CMD_MAD_IFC Get Module ID attr(ff60) port(2) i2c_addr(50) offset(0) size(1): Response Mad Status(31c) - cable is not connected
At this point, running "ethtool -m eth1" produces one of:
*)
mlx4_core 0000:01:00.0: MLX4_CMD_MAD_IFC Get Module ID attr(ff60) port(2) i2c_addr(50) offset(0) size(1): Response Mad Status(41c) - the connected cable has no EPROM (passive copper cable)
*)
mlx4_core 0000:01:00.0: MLX4_CMD_MAD_IFC Get Module info attr(ff60) port(2) i2c_addr(50) offset(128) size(48): Response Mad Status(41c) - the connected cable has no EPROM (passive copper cable)
mlx4_en: eth2: mlx4_get_module_info i(0) offset(128) bytes_to_read(128) - FAILED (0xfffffbe4)
*)
mlx4_core 0000:01:00.0: MLX4_CMD_MAD_IFC Get Module ID attr(ff60) port(2) i2c_addr(50) offset(0) size(1): Response Mad Status(41c) - the connected cable has no EPROM (passive copper cable)
*)
mlx4_core 0000:01:00.0: MLX4_CMD_MAD_IFC Get Module info attr(ff60) port(2) i2c_addr(50) offset(0) size(48): Response Mad Status(41c) - the connected cable has no EPROM (passive copper cable)
mlx4_en: eth2: mlx4_get_module_info i(0) offset(0) bytes_to_read(128) - FAILED (0xfffffbe4)
*)
mlx4_core 0000:01:00.0: MLX4_CMD_MAD_IFC Get Module ID attr(ff60) port(2) i2c_addr(50) offset(0) size(1): Response Mad Status(41c) - the connected cable has no EPROM (passive copper cable)
I wonder if in this situation we are communicating with a wrong device or returning some stale data from kernel memory or the firmware?
Thanks,
Krzysztof
[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx4/port.c#n2200
[2]
Identifier : 0x03 (SFP)
Extended identifier : 0x04 (GBIC/SFP defined by 2-wire interface ID)
(...)
Optical diagnostics support : Yes
Laser bias current : 0.000 mA
Laser output power : 0.0000 mW / -inf dBm
Receiver signal average optical power : 0.0000 mW / -inf dBm
Module temperature : 0.00 degrees C / 32.00 degrees F
Module voltage : 0.0000 V
Alarm/warning flags implemented : Yes
Laser bias current high alarm : Off
Laser bias current low alarm : Off
Laser bias current high warning : Off
Laser bias current low warning : Off
Laser output power high alarm : Off
Laser output power low alarm : Off
Laser output power high warning : Off
Laser output power low warning : Off
Module temperature high alarm : Off
Module temperature low alarm : Off
Module temperature high warning : Off
Module temperature low warning : Off
Module voltage high alarm : Off
Module voltage low alarm : Off
Module voltage high warning : Off
Module voltage low warning : Off
Laser rx power high alarm : Off
Laser rx power low alarm : Off
Laser rx power high warning : Off
Laser rx power low warning : Off
Laser bias current high alarm threshold : 0.000 mA
Laser bias current low alarm threshold : 0.000 mA
Laser bias current high warning threshold : 0.000 mA
Laser bias current low warning threshold : 0.000 mA
Laser output power high alarm threshold : 0.0000 mW / -inf dBm
Laser output power low alarm threshold : 0.0000 mW / -inf dBm
Laser output power high warning threshold : 0.0000 mW / -inf dBm
Laser output power low warning threshold : 0.0000 mW / -inf dBm
Module temperature high alarm threshold : 0.00 degrees C / 32.00 degrees F
Module temperature low alarm threshold : 0.00 degrees C / 32.00 degrees F
Module temperature high warning threshold : 0.00 degrees C / 32.00 degrees F
Module temperature low warning threshold : 0.00 degrees C / 32.00 degrees F
Module voltage high alarm threshold : 0.0000 V
Module voltage low alarm threshold : 0.0000 V
Module voltage high warning threshold : 0.0000 V
Module voltage low warning threshold : 0.0000 V
Laser rx power high alarm threshold : 0.0000 mW / -inf dBm
Laser rx power low alarm threshold : 0.0000 mW / -inf dBm
Laser rx power high warning threshold : 0.0000 mW / -inf dBm
Laser rx power low warning threshold : 0.0000 mW / -inf dBm
[3]
# ethtool -m eth1
Identifier : 0x03 (SFP)
Extended identifier : 0x04 (GBIC/SFP defined by 2-wire interface ID)
Connector : 0x07 (LC)
Transceiver codes : 0x10 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
Transceiver type : 10G Ethernet: 10G Base-SR
Encoding : 0x06 (64B/66B)
BR, Nominal : 10300MBd
Rate identifier : 0x00 (unspecified)
Length (SMF,km) : 0km
Length (SMF) : 0m
Length (50um) : 80m
Length (62.5um) : 30m
Length (Copper) : 0m
Length (OM3) : 300m
Laser wavelength : 850nm
Vendor name : IBM-Avago
Vendor OUI : <REDACTED>
Vendor PN : <REDACTED>
Vendor rev : G2.3
Option values : 0x00 0x1a
Option : RX_LOS implemented
Option : TX_FAULT implemented
Option : TX_DISABLE implemented
BR margin, max : 0%
BR margin, min : 0%
Vendor SN : <REDACTED>
Date code : <REDACTED>
netlink error: Unknown error 1820
Powered by blists - more mailing lists