lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20241127184803.75086499e71c6b1588a4fb5a@paranoici.org>
Date: Wed, 27 Nov 2024 18:48:03 +0100
From: Francesco Poli <invernomuto@...anoici.org>
To: Leon Romanovsky <leonro@...dia.com>
Cc: Uwe Kleine-König <ukleinek@...ian.org>,
 <1086520@...s.debian.org>, Mark Zhang <markzhang@...dia.com>,
 <linux-rdma@...r.kernel.org>, <netdev@...r.kernel.org>
Subject: Re: Bug#1086520: linux-image-6.11.2-amd64: makes opensm fail to
 start

On Mon, 25 Nov 2024 21:38:37 +0200 Leon Romanovsky wrote:

> On Mon, Nov 25, 2024 at 07:54:43PM +0100, Francesco Poli wrote:
[...]
> > I will try to continue to bisect by testing the resulting kernels on a
> > compute node: there's no OpenSM there and it cannot run anyway, if
> > there's another OpenSM on the same InfiniBand network.
> > However, I can check whether those issm* symlinks are created in
> > /sys/class/infiniband_mad/ 
> > I really hope that this is enough to pinpoint the first bad
> > commit...
> 
> Yes, these symlinks should be there. Your test scenario is correct one.

OK, I have completed the bisect on a compute node without OpenSM, by
looking at the issm* symlinks, as I said.

See below.

> 
> > 
> > Any better ideas?
> 
> I think that commit: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port")
> is the one which is causing to troubles, which leads me to suspect FW.
[...]

Thanks to your guess about the possibly troublesome commit, the bisect was completed in a few steps:

  $ git checkout 2a5db20fa532
  $ make -j 12 my_defconfig bindeb-pkg
  
  [install this version on a compute node test image and reboot
  one compute node with that image: the InfiniBand network was
  working for that node, that's no surprise, since OpenSM was running
  on the head node, but no issm* symlink was created; please note
  that, surprisingly, the Ethernet network was not working, I mean
  that the Ethernet interfaces were not found by the kernel...]
  
  root@...e # ls -altrF /sys/class/infiniband_mad/
  total 0
  drwxr-xr-x 60 root root    0 Nov 26 17:06 ../
  lrwxrwxrwx  1 root root    0 Nov 26 17:06 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
  -r--r--r--  1 root root 4096 Nov 26 17:06 abi_version
  lrwxrwxrwx  1 root root    0 Nov 26 17:06 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
  drwxr-xr-x  2 root root    0 Nov 26 17:08 ./
  
  $ git bisect bad
  Bisecting: 0 revisions left to test after this (roughly 0 steps)
  [65528cfb21fdb68de8ae6dccae19af180d93e143] net/mlx5: mlx5_ifc update for multi-plane support
  $ make -j 12 my_defconfig bindeb-pkg
  
  [install this version on the compute node test image and reboot
  one compute node with that image: the InfiniBand network again
  working for that node, issm* symlinks were created;
  Ethernet network again not working for that node...]
  
  root@...e # ls -altrF /sys/class/infiniband_mad/
  total 0
  drwxr-xr-x 60 root root    0 Nov 26 17:31 ../
  lrwxrwxrwx  1 root root    0 Nov 26 17:31 umad0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/umad0/
  -r--r--r--  1 root root 4096 Nov 26 17:31 abi_version
  lrwxrwxrwx  1 root root    0 Nov 26 17:31 umad1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/umad1/
  lrwxrwxrwx  1 root root    0 Nov 26 17:36 issm1 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.1/infiniband_mad/issm1/
  lrwxrwxrwx  1 root root    0 Nov 26 17:36 issm0 -> ../../devices/pci0000:00/0000:00:01.1/0000:01:00.0/infiniband_mad/issm0/
  drwxr-xr-x  2 root root    0 Nov 26 17:36 ./
  
  $ git bisect good
  2a5db20fa532198639671713c6213f96ff285b85 is the first bad commit
  commit 2a5db20fa532198639671713c6213f96ff285b85
  Author: Mark Zhang <markzhang@...dia.com>
  Date:   Sun Jun 16 19:08:35 2024 +0300
  
      RDMA/mlx5: Add support to multi-plane device and port
  
      When multi-plane is supported, a logical port, which is aggregation of
      multiple physical plane ports, is exposed for data transmission.
      Compared with a normal mlx5 IB port, this logical port supports all
      functionalities except Subnet Management.
  
      Signed-off-by: Mark Zhang <markzhang@...dia.com>
      Link: https://lore.kernel.org/r/7e37c06c9cb243be9ac79930cd17053903785b95.1718553901.git.leon@kernel.org
      Signed-off-by: Leon Romanovsky <leonro@...dia.com>
  
   drivers/infiniband/hw/mlx5/main.c               | 60 +++++++++++++++++++++----
   drivers/infiniband/hw/mlx5/mlx5_ib.h            |  2 +
   drivers/net/ethernet/mellanox/mlx5/core/vport.c |  1 +
   include/linux/mlx5/driver.h                     |  1 +
   4 files changed, 55 insertions(+), 9 deletions(-)


In other words, bingo!, your guess looks correct, the first bad commit
is the one you mentioned.


Now, I will try to upgrade the firmware of the InfiniBand NICs, as you
suggested, and check whether this solves the issue with the recent
Linux kernel versions.

Please confirm that the procedure to be followed is the one described in
<https://docs.nvidia.com/networking/display/ubuntu2204/firmware+burning>

Thanks for your time and patience, and for all the help you are kindly
providing!   :-)


-- 
 http://www.inventati.org/frx/
 There's not a second to spare! To the laboratory!
..................................................... Francesco Poli .
 GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE

Content of type "application/pgp-signature" skipped

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ