lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <acpo6ocggcl66fjdllk5zrfs2vwiivpetd5ierdek5ruxvdbyl@tfbc3mfnp23o>
Date: Wed, 4 Dec 2024 17:37:05 +0100
From: Uwe Kleine-König <ukleinek@...ian.org>
To: Francesco Poli <invernomuto@...anoici.org>
Cc: Leon Romanovsky <leonro@...dia.com>, 
	"1086520@...s.debian.org Mark Zhang" <markzhang@...dia.com>, linux-rdma@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to start

Hello Francesco,

On Wed, Nov 27, 2024 at 10:04:13PM +0200, Leon Romanovsky wrote:
> On Wed, Nov 27, 2024 at 06:48:03PM +0100, Francesco Poli wrote:
> > On Mon, 25 Nov 2024 21:38:37 +0200 Leon Romanovsky wrote:
> > 
> > > On Mon, Nov 25, 2024 at 07:54:43PM +0100, Francesco Poli wrote:
> > [...]
> > > > I will try to continue to bisect by testing the resulting kernels on a
> > > > compute node: there's no OpenSM there and it cannot run anyway, if
> > > > there's another OpenSM on the same InfiniBand network.
> > > > However, I can check whether those issm* symlinks are created in
> > > > /sys/class/infiniband_mad/ 
> > > > I really hope that this is enough to pinpoint the first bad
> > > > commit...
> > > 
> > > Yes, these symlinks should be there. Your test scenario is correct one.
> > 
> > OK, I have completed the bisect on a compute node without OpenSM, by
> > looking at the issm* symlinks, as I said.
> > 
> > See below.
> > 
> > > 
> > > > 
> > > > Any better ideas?
> > > 
> > > I think that commit: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port")
> > > is the one which is causing to troubles, which leads me to suspect FW.
> > [...]
> > 
> > Thanks to your guess about the possibly troublesome commit, the bisect was completed in a few steps:
> > 
> >   $ git checkout 2a5db20fa532
> >   $ make -j 12 my_defconfig bindeb-pkg
> >   
> >   [install this version on a compute node test image and reboot
> >   one compute node with that image: the InfiniBand network was
> >   working for that node, that's no surprise, since OpenSM was running
> >   on the head node, but no issm* symlink was created; please note
> >   that, surprisingly, the Ethernet network was not working, I mean
> >   that the Ethernet interfaces were not found by the kernel...]
> >   
> >   root@...e # ls -altrF /sys/class/infiniband_mad/
> >   total 0
> >   drwxr-xr-x 60 root root    0 Nov 26 17:06 ../
> >   lrwxrwxrwx  1 root root    0 Nov 26 17:06 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
> >   -r--r--r--  1 root root 4096 Nov 26 17:06 abi_version
> >   lrwxrwxrwx  1 root root    0 Nov 26 17:06 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
> >   drwxr-xr-x  2 root root    0 Nov 26 17:08 ./
> >   
> >   $ git bisect bad
> >   Bisecting: 0 revisions left to test after this (roughly 0 steps)
> >   [65528cfb21fdb68de8ae6dccae19af180d93e143] net/mlx5: mlx5_ifc update for multi-plane support
> >   $ make -j 12 my_defconfig bindeb-pkg
> >   
> >   [install this version on the compute node test image and reboot
> >   one compute node with that image: the InfiniBand network again
> >   working for that node, issm* symlinks were created;
> >   Ethernet network again not working for that node...]
> >   
> >   root@...e # ls -altrF /sys/class/infiniband_mad/
> >   total 0
> >   drwxr-xr-x 60 root root    0 Nov 26 17:31 ../
> >   lrwxrwxrwx  1 root root    0 Nov 26 17:31 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
> >   -r--r--r--  1 root root 4096 Nov 26 17:31 abi_version
> >   lrwxrwxrwx  1 root root    0 Nov 26 17:31 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
> >   lrwxrwxrwx  1 root root    0 Nov 26 17:36 issm1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/issm1/
> >   lrwxrwxrwx  1 root root    0 Nov 26 17:36 issm0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/issm0/
> >   drwxr-xr-x  2 root root    0 Nov 26 17:36 ./
> >   
> >   $ git bisect good
> >   2a5db20fa532198639671713c6213f96ff285b85 is the first bad commit
> >   commit 2a5db20fa532198639671713c6213f96ff285b85
> >   Author: Mark Zhang <markzhang@...dia.com>
> >   Date:   Sun Jun 16 19:08:35 2024 +0300
> >   
> >       RDMA/mlx5: Add support to multi-plane device and port
> >   
> >       When multi-plane is supported, a logical port, which is aggregation of
> >       multiple physical plane ports, is exposed for data transmission.
> >       Compared with a normal mlx5 IB port, this logical port supports all
> >       functionalities except Subnet Management.
> >   
> >       Signed-off-by: Mark Zhang <markzhang@...dia.com>
> >       Link: https://lore.kernel.org/r/7e37c06c9cb243be9ac79930cd17053903785b95.1718553901.git.leon@kernel.org
> >       Signed-off-by: Leon Romanovsky <leonro@...dia.com>
> >   
> >    drivers/infiniband/hw/mlx5/main.c               | 60 +++++++++++++++++++++----
> >    drivers/infiniband/hw/mlx5/mlx5_ib.h            |  2 +
> >    drivers/net/ethernet/mellanox/mlx5/core/vport.c |  1 +
> >    include/linux/mlx5/driver.h                     |  1 +
> >    4 files changed, 55 insertions(+), 9 deletions(-)
> > 
> > 
> > In other words, bingo!, your guess looks correct, the first bad commit
> > is the one you mentioned.
> > 
> > 
> > Now, I will try to upgrade the firmware of the InfiniBand NICs, as you
> > suggested, and check whether this solves the issue with the recent
> > Linux kernel versions.
> > 
> > Please confirm that the procedure to be followed is the one described in
> > <https://docs.nvidia.com/networking/display/ubuntu2204/firmware+burning>
> 
> Yes, it looks correct procedure.
> If you didn't upgrade FW, this diff will achieve same result for you:
> 
> diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
> index c2314797afc9..110ce177c305 100644
> --- a/drivers/infiniband/hw/mlx5/main.c
> +++ b/drivers/infiniband/hw/mlx5/main.c
> @@ -2846,7 +2846,7 @@ static int mlx5_ib_get_plane_num(struct mlx5_core_dev *mdev, u8 *num_plane)
>         if (err)
>                 return err;
> 
> -       *num_plane = vport_ctx.num_plane;
> +       *num_plane = (vport_ctx.num_plane > 1) ? vport_ctx.num_plane : 0;
>         return 0;
>  }
> 
> The culprit of your issue that in some FW versions, the vport_ctx.num_plane
> was 1 and not 0 for devices which don't support that mode, while for the driver
> everything that is not 0 means supported.

I wonder if you could test a firmware upgrade or the above patch. Would
be nice to know if there are still some things to do for us (= Debian
kernel team) here.

If everything is fine for you, I'd like to close this bug.

Best regards
Uwe

Download attachment "signature.asc" of type "application/pgp-signature" (489 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ