netdev - Re: Kernel crash after FLR reset of a ConnectX-5 PF in switchdev mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a699edef43990b6403884097e5eb923b411b36f6.camel@linux.ibm.com>
Date:   Wed, 19 Apr 2023 13:47:12 +0200
From:   Niklas Schnelle <schnelle@...ux.ibm.com>
To:     Saeed Mahameed <saeedm@...dia.com>
Cc:     Saeed Mahameed <saeed@...nel.org>,
        Leon Romanovsky <leon@...nel.org>,
        Gerd Bayer <gbayer@...ux.ibm.com>,
        "alexander.sschmidt" <alexander.sschmidt@...ux.ibm.com>,
        Alexandra Winter <wintera@...ux.ibm.com>,
        netdev@...r.kernel.org, rrameshbabu@...dia.com, gal@...dia.com,
        moshe@...dia.com, shayd@...dia.com
Subject: Re: Kernel crash after FLR reset of a ConnectX-5 PF in switchdev
 mode

On Fri, 2023-04-14 at 15:27 -0700, Saeed Mahameed wrote:
> On 14 Apr 09:12, Niklas Schnelle wrote:
> > On Thu, 2023-04-13 at 15:02 -0700, Saeed Mahameed wrote:
> > > On 13 Apr 14:02, Leon Romanovsky wrote:
> > > > On Tue, Apr 11, 2023 at 05:11:11PM +0200, Niklas Schnelle wrote:
> > > > > Hi Saeed, Hi Leon,
> > > > > 
> > > > > While testing PCI recovery with a ConnectX-5 card (MT28800, fw
> > > > > 16.35.1012) and vanilla 6.3-rc4/5/6 on s390 I've run into a kernel
> > > > > crash (stacktrace attached) when the card is in switchdev mode. No
> > > > > crash occurs and the recovery succeeds in legacy mode (with VFs). I
> > > > > found that the same crash occurs also with a simple Function Level
> > > > > Reset instead of the s390 specific PCI recovery, see instructions
> > > > > below. From the looks of it I think this could affect non-s390 too but
> > > > > I don't have a proper x86 test system with a ConnectX card to test
> > > > > with.
> > > > > 
> > > > > Anyway, I tried to analyze further but got stuck after figuring out
> > > > > that in mlx5e_remove() deep down from mlx5_fw_fatal_reporter_err_work()
> > > > > (see trace) the mlx5e_dev->priv pointer is valid but the pointed to
> > > > > struct only contains zeros as it was previously zeroed by
> > > > > mlx5_mdev_uninit() which then leads to a NULL pointer access.
> > > > > 
> > > > > The crash itself can be prevented by the following debug patch though
> > > > > clearly this is not a proper fix:
> > > > > 
> > > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > > > @@ -6012,6 +6012,10 @@ static void mlx5e_remove(struct auxiliary_device
> > > > > *adev)
> > > > >         struct mlx5e_priv *priv = mlx5e_dev->priv;
> > > > >         pm_message_t state = {};
> > > > > 
> > > > > +       if (!priv->mdev) {
> > > > > +               pr_err("%s with zeroed mlx5e_dev->priv\n", __func__);
> > > > > +               return;
> > > > > +       }
> > > > >         mlx5_core_uplink_netdev_set(priv->mdev, NULL);
> > > > >         mlx5e_dcbnl_delete_app(priv);
> > > > >         unregister_netdev(priv->netdev);
> > > > > 
> > > > > With that I then tried to track down why mlx5_mdev_uninit() is called
> > > > > and this might actually be s390 specific in that this happens during
> > > > > the removal of the VF which on s390 causes extra hot unplug events for
> > > > > the VFs (our virtualized PCI hotplug is per-PCI function) resulting in
> > > > > the following call trace:
> > > > > 
> > > > > ...
> > > > > zpci_bus_remove_device()
> > > > >    zpci_iov_remove_virtfn()
> > > > >       pci_iov_remove_virtfn()
> > > > >          pci_stop_and_remove_bus_device()
> > > > >             pci_stop_bus_device()
> > > > >                device_release_driver_internal()
> > > > >                   pci_device_remove()
> > > > >                      remove_one()
> > > > >                         mlx5_mdev_uninit()
> > > > > 
> > > > > Then again I would expect that on other architectures VFs become at
> > > > > leastunresponsive during a FLR of the PF not sure if that also lead to
> > > > > calls to remove_one() though.
> > > > > 
> > > > > As another data point I tried the same on the default Ubuntu 22.04
> > > > > generic 5.15 kernel and there no crash occurs so this might be a newer
> > > > > issue.
> > > > > 
> > > > > Also, I did test with and without the patch I sent recently for
> > > > > skipping the wait in mlx5_health_wait_pci_up() but that made no
> > > > > difference.
> > > > > 
> > > > > Any hints on how to debug this further and could you try to see if this
> > > > > occurs on other architectures as well?
> > > > 
> > > > My guess that the splash, which complains about missing mutex_init(), is an outcome of these failures:
> > > > [ 1375.771395] mlx5_core 0004:00:00.0 eth0 (unregistering): vport 1 error -67 reading stats
> > > > [ 1376.151345] mlx5_core 0004:00:00.0: mlx5e_init_nic_tx:5376:(pid 1505): create tises failed, -67
> > > > [ 1376.238808] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: new profile init failed, -67
> > > > [ 1376.243746] mlx5_core 0004:00:00.0: mlx5e_init_rep_tx:1101:(pid 1505): create tises failed, -67
> > > > [ 1376.328623] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: failed to rollback to orig profile,
> > > 
> > > Yes, I also agree with Leon, if rollback fails this could be fatal to mlx5e
> > > aux device removal as we don't have a way to check the state of the mlx5e
> > > priv, We always assume it is up as long as the aux is up, which is wrong
> > > only in case of this un-expected error flow.
> > > 
> > > If we just add a flag and skip mlx5e_remove, then we will end up with
> > > dangling netdev and some other resources as the cleanup wasn't complete..
> > > 
> > > I need to dive deeper to figure out a proper solution, I will create an internal
> > > ticket to track this and help provide a solution soon, hopefully.
> > > 
> > 
> > Thank you for looking into this, do you have an idea what got us into
> > this unexpected error flow. This occurs very reliably for me but I'm
> > not sure if it is s390 specific or just caused by the switchdev setup.
> > It's also unexpected to me that the code reports -ENOLINK does that
> > refer to the PCIe link here or to the representor device being
> > disconnected?
> > 
> 
> I believe this is not related to s390, and should happen on  x86 as well, 
> I just learned yesterday that you already filed this issue through our
> support and we already have an assignee working on this, let's work through
> the support ticket and reduce clutter on the mailing list, I am sure we will
> come up with a patch very soon and will all learn what went wrong :) ..
> I have a clear idea what the issue is, but the solution may require a
> bit of refactoring.. 
> 
> -Saeed.

Sounds good and thank you!

Niklas