lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 26 May 2024 15:35:37 +0300
From: Shay Drori <shayd@...dia.com>
To: "Berger, Michal" <michal.berger@...el.com>, "netdev@...r.kernel.org"
	<netdev@...r.kernel.org>, <moshe@...dia.com>
Subject: Re: Kernel panic triggered while removing mlx5_core devices from the
 pci bus

Hi Michal.

can you please try the bellow change[1]?
we try it locally and it seems to solve the issue.

thanks
Shay Drory

[1]
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 6574c145dc1e..459a836a5d9c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1298,6 +1298,9 @@ static int mlx5_function_teardown(struct 
mlx5_core_dev *dev, bool boot)
         if (!err)
                 mlx5_function_disable(dev, boot);
+       else
+               mlx5_stop_health_poll(dev, boot);
+
         return err;
}



On 24/05/2024 11:07, Berger, Michal wrote:
> Kernel: 6.7.0, 6.8.8 (fedora builds)
> Devices: MT27710 Family [ConnectX-4 Lx] (0x1015), fw_ver: 14.23.1020
> rdma-core: 44.0
> 
> We have a small test which performs a somewhat controlled hotplug of the net device on the pci bus (via sysfs). The affected device is part of the nvmf-rdma setup running in SPDK context (i.e. https://github.com/spdk/spdk/blob/master/test/nvmf/target/device_removal.sh)  Sometimes (it's not reproducible at each run unfortunately) when the device is removed, kernel hits
> Oops - with our panic setup it's then followed by a kernel reboot, but if we allow the kernel to continue it eventually deadlocks itself.
> 
> This happens across different systems using the same set of NICs. Example of these oops attached.
> 
> Just to note, we previously had the same issue under older kernels (e.g. 6.1), all reported here https://bugzilla.kernel.org/show_bug.cgi?id=218288. Bump to 6.7.0 helped to reduce the frequency
> of this issue but unfortunately it's still there.
> 
> Any hints on how to tackle this issue would be appreciated.
> 
> Regards,
> Michal
> ---------------------------------------------------------------------
> Intel Technology Poland sp. z o.o.
> ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
> Spolka oswiadcza, ze posiada status duzego przedsiebiorcy w rozumieniu ustawy z dnia 8 marca 2013 r. o przeciwdzialaniu nadmiernym opoznieniom w transakcjach handlowych.
> 
> Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
> This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ