netdev - Kernel crash after FLR reset of a ConnectX-5 PF in switchdev mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <90e1efad457f40c1f9f7b8cb56852072d8ea00fd.camel@linux.ibm.com>
Date:   Tue, 11 Apr 2023 17:11:11 +0200
From:   Niklas Schnelle <schnelle@...ux.ibm.com>
To:     Saeed Mahameed <saeedm@...dia.com>,
        Leon Romanovsky <leon@...nel.org>
Cc:     Gerd Bayer <gbayer@...ux.ibm.com>,
        "alexander.sschmidt" <alexander.sschmidt@...ux.ibm.com>,
        Alexandra Winter <wintera@...ux.ibm.com>,
        netdev@...r.kernel.org
Subject: Kernel crash after FLR reset of a ConnectX-5 PF in switchdev mode

Hi Saeed, Hi Leon,

While testing PCI recovery with a ConnectX-5 card (MT28800, fw
16.35.1012) and vanilla 6.3-rc4/5/6 on s390 I've run into a kernel
crash (stacktrace attached) when the card is in switchdev mode. No
crash occurs and the recovery succeeds in legacy mode (with VFs). I
found that the same crash occurs also with a simple Function Level
Reset instead of the s390 specific PCI recovery, see instructions
below. From the looks of it I think this could affect non-s390 too but
I don't have a proper x86 test system with a ConnectX card to test
with.

Anyway, I tried to analyze further but got stuck after figuring out
that in mlx5e_remove() deep down from mlx5_fw_fatal_reporter_err_work()
(see trace) the mlx5e_dev->priv pointer is valid but the pointed to
struct only contains zeros as it was previously zeroed by
mlx5_mdev_uninit() which then leads to a NULL pointer access.

The crash itself can be prevented by the following debug patch though
clearly this is not a proper fix:

--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -6012,6 +6012,10 @@ static void mlx5e_remove(struct auxiliary_device
*adev)
        struct mlx5e_priv *priv = mlx5e_dev->priv;
        pm_message_t state = {};

+       if (!priv->mdev) {
+               pr_err("%s with zeroed mlx5e_dev->priv\n", __func__);
+               return;
+       }
        mlx5_core_uplink_netdev_set(priv->mdev, NULL);
        mlx5e_dcbnl_delete_app(priv);
        unregister_netdev(priv->netdev);

With that I then tried to track down why mlx5_mdev_uninit() is called
and this might actually be s390 specific in that this happens during
the removal of the VF which on s390 causes extra hot unplug events for
the VFs (our virtualized PCI hotplug is per-PCI function) resulting in
the following call trace:

...
zpci_bus_remove_device()
   zpci_iov_remove_virtfn()
      pci_iov_remove_virtfn()
         pci_stop_and_remove_bus_device()
            pci_stop_bus_device()
               device_release_driver_internal()
                  pci_device_remove()
                     remove_one()
                        mlx5_mdev_uninit()

Then again I would expect that on other architectures VFs become at
leastunresponsive during a FLR of the PF not sure if that also lead to
calls to remove_one() though.

As another data point I tried the same on the default Ubuntu 22.04
generic 5.15 kernel and there no crash occurs so this might be a newer
issue.

Also, I did test with and without the patch I sent recently for
skipping the wait in mlx5_health_wait_pci_up() but that made no
difference.

Any hints on how to debug this further and could you try to see if this
occurs on other architectures as well?

Thanks,
Niklas

Reproduced with (0004:00:00.0 being the first PF of a ConnectX-5):

$ devlink dev eswitch set pci/0004:00:00.0 mode switchdev
$ devlink dev eswitch set pci/0004:00:00.1 mode switchdev

The next 2 lines are needed on s390 due to an unrelated issue with smsf mode.

$ devlink dev param set pci/0004:00:00.0 name flow_steering_mode value dmfs cmode runtime
$ devlink dev param set pci/0004:00:00.1 name flow_steering_mode value dmfs cmode runtime
$ echo 1 > /sys/bus/pci/devices/0004:00:00.0/sriov_numvfs
# Check the reset method is FLR though others might also cause this
$ cat /sys/bus/pci/devices/0004:00:00.0/reset_method
flr

Then to trigger the crash (after a bit of recovery, it may be racy though it hits pretty
consistently for me)

$ echo 1 > /sys/bus/pci/devices/0004:00:00.0/reset

View attachment "mlx5_switchdev_reset_crash_backtrace.txt" of type "text/plain" (11932 bytes)