[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20241127184803.75086499e71c6b1588a4fb5a@paranoici.org>
Date: Wed, 27 Nov 2024 18:48:03 +0100
From: Francesco Poli <invernomuto@...anoici.org>
To: Leon Romanovsky <leonro@...dia.com>
Cc: Uwe Kleine-König <ukleinek@...ian.org>,
<1086520@...s.debian.org>, Mark Zhang <markzhang@...dia.com>,
<linux-rdma@...r.kernel.org>, <netdev@...r.kernel.org>
Subject: Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to
start
On Mon, 25 Nov 2024 21:38:37 +0200 Leon Romanovsky wrote:
> On Mon, Nov 25, 2024 at 07:54:43PM +0100, Francesco Poli wrote:
[...]
> > I will try to continue to bisect by testing the resulting kernels on a
> > compute node: there's no OpenSM there and it cannot run anyway, if
> > there's another OpenSM on the same InfiniBand network.
> > However, I can check whether those issm* symlinks are created in
> > /sys/class/infiniband_mad/
> > I really hope that this is enough to pinpoint the first bad
> > commit...
>
> Yes, these symlinks should be there. Your test scenario is correct one.
OK, I have completed the bisect on a compute node without OpenSM, by
looking at the issm* symlinks, as I said.
See below.
>
> >
> > Any better ideas?
>
> I think that commit: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port")
> is the one which is causing to troubles, which leads me to suspect FW.
[...]
Thanks to your guess about the possibly troublesome commit, the bisect was completed in a few steps:
$ git checkout 2a5db20fa532
$ make -j 12 my_defconfig bindeb-pkg
[install this version on a compute node test image and reboot
one compute node with that image: the InfiniBand network was
working for that node, that's no surprise, since OpenSM was running
on the head node, but no issm* symlink was created; please note
that, surprisingly, the Ethernet network was not working, I mean
that the Ethernet interfaces were not found by the kernel...]
root@...e # ls -altrF /sys/class/infiniband_mad/
total 0
drwxr-xr-x 60 root root 0 Nov 26 17:06 ../
lrwxrwxrwx 1 root root 0 Nov 26 17:06 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
-r--r--r-- 1 root root 4096 Nov 26 17:06 abi_version
lrwxrwxrwx 1 root root 0 Nov 26 17:06 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
drwxr-xr-x 2 root root 0 Nov 26 17:08 ./
$ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[65528cfb21fdb68de8ae6dccae19af180d93e143] net/mlx5: mlx5_ifc update for multi-plane support
$ make -j 12 my_defconfig bindeb-pkg
[install this version on the compute node test image and reboot
one compute node with that image: the InfiniBand network again
working for that node, issm* symlinks were created;
Ethernet network again not working for that node...]
root@...e # ls -altrF /sys/class/infiniband_mad/
total 0
drwxr-xr-x 60 root root 0 Nov 26 17:31 ../
lrwxrwxrwx 1 root root 0 Nov 26 17:31 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
-r--r--r-- 1 root root 4096 Nov 26 17:31 abi_version
lrwxrwxrwx 1 root root 0 Nov 26 17:31 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
lrwxrwxrwx 1 root root 0 Nov 26 17:36 issm1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/issm1/
lrwxrwxrwx 1 root root 0 Nov 26 17:36 issm0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/issm0/
drwxr-xr-x 2 root root 0 Nov 26 17:36 ./
$ git bisect good
2a5db20fa532198639671713c6213f96ff285b85 is the first bad commit
commit 2a5db20fa532198639671713c6213f96ff285b85
Author: Mark Zhang <markzhang@...dia.com>
Date: Sun Jun 16 19:08:35 2024 +0300
RDMA/mlx5: Add support to multi-plane device and port
When multi-plane is supported, a logical port, which is aggregation of
multiple physical plane ports, is exposed for data transmission.
Compared with a normal mlx5 IB port, this logical port supports all
functionalities except Subnet Management.
Signed-off-by: Mark Zhang <markzhang@...dia.com>
Link: https://lore.kernel.org/r/7e37c06c9cb243be9ac79930cd17053903785b95.1718553901.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@...dia.com>
drivers/infiniband/hw/mlx5/main.c | 60 +++++++++++++++++++++----
drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 +
drivers/net/ethernet/mellanox/mlx5/core/vport.c | 1 +
include/linux/mlx5/driver.h | 1 +
4 files changed, 55 insertions(+), 9 deletions(-)
In other words, bingo!, your guess looks correct, the first bad commit
is the one you mentioned.
Now, I will try to upgrade the firmware of the InfiniBand NICs, as you
suggested, and check whether this solves the issue with the recent
Linux kernel versions.
Please confirm that the procedure to be followed is the one described in
<https://docs.nvidia.com/networking/display/ubuntu2204/firmware+burning>
Thanks for your time and patience, and for all the help you are kindly
providing! :-)
--
http://www.inventati.org/frx/
There's not a second to spare! To the laboratory!
..................................................... Francesco Poli .
GnuPG key fpr == CA01 1147 9CD2 EFDF FB82 3925 3E1C 27E1 1F69 BFFE
Content of type "application/pgp-signature" skipped
Powered by blists - more mailing lists