[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dabeaf9c-fe6c-464e-a647-815e51ec33ce@ans.pl>
Date: Wed, 11 Sep 2024 23:46:11 -0700
From: Krzysztof Olędzki <ole@....pl>
To: Ido Schimmel <idosch@...dia.com>
Cc: gal@...dia.com, Tariq Toukan <tariqt@...dia.com>,
Yishai Hadas <yishaih@...dia.com>, Michal Kubecek <mkubecek@...e.cz>,
Jakub Kicinski <kuba@...nel.org>, Andrew Lunn <andrew@...n.ch>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [mlx4] Mellanox ConnectX2 (MHQH29C aka 26428) and module
diagnostic support (ethtool -m) issues
Good morning Ido,
On 05.09.2024 at 09:23, Ido Schimmel wrote:
> On Wed, Sep 04, 2024 at 09:47:04PM -0700, Krzysztof Olędzki wrote:
>> This BTW looks like another problem:
>>
>> # ethtool -m eth1 hex on offset 254 length 1
>> Offset Values
>> ------ ------
>> 0x00fe: 00
>>
>> # ethtool -m eth1 hex on offset 255 length 1
>> Cannot get Module EEPROM data: Unknown error 1564
>>
>> mlx4_core 0000:01:00.0: MLX4_CMD_MAD_IFC Get Module info attr(ff60) port(1) i2c_addr(50) offset(255) size(1): Response Mad Status(61c) - invalid device_address or size (that is, size equals 0 or address+size is greater than 256)
>> mlx4_en: eth1: mlx4_get_module_info i(0) offset(255) bytes_to_read(1) - FAILED (0xfffff9e4)
>>
>> With the netlink interface, ethtool seems to be only asking for for the first 128 bytes, which works:
>
> Yes. The upper 128 bytes are reserved so sff8079_show_all_nl() doesn't
> bother querying them. Explains why you don't see this error with
> netlink.
>
> Regarding the runtime "--disable-netlink" patch, I personally don't mind
> and Andrew seems in favor, so please post a proper patch and lets see
> what Michal says.
Thanks. Will send it, let's see how it goes.
> Regarding the patch that unmasks the I2C address error, I would target
> it at net-next as it doesn't really fix a bug (ethtool already displays
> what it can).
Well, I would argue that it does fix a bug to some extent, because otherwise
"ethtool -m" shows wrong DOM data, instead of just not showing it at all.
Also, judging from the date of the commit [1], I would say it is likely that
the workaround was actually added to address the limitation of the firmware
that was there at that time.
That said, definitely a net-next target, same for the other three patches
that I just sent.
> Thinking about it, I believe it would be more worthwhile
> to implement the much simpler get_module_eeprom_by_page() ethtool
> operation in mlx4 (I can help with the review). It would've helped
> avoiding the current issue (kernel will return an error) and the
> previous bug [1] you encountered with the legacy operations.
Sure, I can also try to work on that one. Would mlx5/core/en_ethtool.c
be a good example of how this should be implemented?
> Regarding the fact that these modules work properly with CX3, but not
> with CX2 (which uses the same driver), it really seems like a HW/FW
> problem and unfortunately I can't help with that.
Yes, understood. Initially I was hoping for this to be either a bug in
the driver, or something that could be quirked there, but after spending
the last week playing with both the kernel and the firmware, I am convinced
this it is a firmware issue (at least for the "offset(255)" and the "A2h"
problems) and both could be only fixed there. At this point, I cannot
count how many times I used mlxburn/flint and the "reset" button after
crashing either the NIC FW or the kernel, but I think I now understand
where the problem is and even have some ideas how to fix...
For the "offset(255)" issue, I am confident this is an off-by-one error,
(address+size is compared to 255 instead of 256) that most likely never
got fixed for the CX2 firmware, at least not in the ones that are available
in public. :( I was able to reproduce almost exactly the same behavior
on CX3 flashing a very, very ancient version.
For the "A2h" issue, it looks like a FW limitation as the old (CX2) version
of firmware only allows to access 0x50 i2c address, instead of 0x50 (for A0h)
*and* 0x51 for A02h. I was also able to reproduce it with the ancient CX3 FW.
The QSFP error is a big mystery that most likely will never be solved (at least
not by me), as I don't have the knowledge nor the documentation to make any
progress there and after all, CX2 is technically a retro-hardware at this point:
"ConnectX-2 is End-of-Life since 2015 and End-of-Service since 2017" says
it all and is 100% spot on. Still, fun to use! ;)
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=32a173c7f9e9ec2b87142f67e1478cd20084a45b
Thanks,
Krzysztof
Powered by blists - more mailing lists