linux-kernel - RE: [REGRESSION] mlx5: Driver remove during hot unplug is broken

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <AM0PR05MB4866585FF543DA370E78B992D1670@AM0PR05MB4866.eurprd05.prod.outlook.com>
Date:   Wed, 8 Jul 2020 10:43:43 +0000
From:   Parav Pandit <parav@...lanox.com>
To:     Niklas Schnelle <schnelle@...ux.ibm.com>,
        Saeed Mahameed <saeedm@...lanox.com>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "ubraun@...ux.ibm.com" <ubraun@...ux.ibm.com>,
        "kgraul@...ux.ibm.com" <kgraul@...ux.ibm.com>,
        "raspl@...ibm.com" <raspl@...ibm.com>
Subject: RE: [REGRESSION] mlx5: Driver remove during hot unplug is broken

Hi Niklas,

> From: Niklas Schnelle <schnelle@...ux.ibm.com>
> Sent: Monday, June 15, 2020 3:32 PM
> 
> Hello Saeed,
> 
> On 6/13/20 12:01 AM, Saeed Mahameed wrote:
> > On Fri, 2020-06-12 at 15:09 +0200, Niklas Schnelle wrote:
> >> Hello Parav, Hello Saeed,
> >>
> ... snip ...
> >>
> >> So without really knowing anything about these functions I would
> >> guess that with the device still registered the drained queue does
> >> not remain empty as new entries are added.
> >> Does that sound plausible to you?
> >>
> >
> > I don't think it is related, maybe this is similar to some issues
> > addressed lately by Shay's patches:
> >
> >
> https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.23501
> > 4-2-saeedm@...lanox.com/
> >
> https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.23501
> > 4-3-saeedm@...lanox.com/
> >
> > net/mlx5: drain health workqueue in case of driver load error
> > net/mlx5: Fix fatal error handling during device load
> 
> I agree with your similarity assessment especially for the first commit.
> These do not fix the issue though, with mainline v5.8-rc1 which has both I'm
> still getting a hang over 50% of the time with the following detach sequence
> on z/VM:
> 
> vmcp detach pcif <mlx_fid>; echo 1 > /proc/cio_settle
> 
> Since now the commit 41798df9bfca ("net/mlx5: Drain wq first during PCI
> device removal") no longer reverts cleanly I used the following diff to move
> the mlx5_drain_health_wq(dev) after the mlx5_unregister_devices(dev).
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index 8b658908f044..63a196fd8e68 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -1382,8 +1382,8 @@ static void remove_one(struct pci_dev *pdev)
> 
>         devlink_reload_disable(devlink);
>         mlx5_crdump_disable(dev);
> -       mlx5_drain_health_wq(dev);
>         mlx5_unload_one(dev, true);
> +       mlx5_drain_health_wq(dev);
>         mlx5_pci_close(dev);
>         mlx5_mdev_uninit(dev);
>         mlx5_devlink_free(devlink);
> 
> 
> Note that this changed order also matches the call order in
> mlx5_pci_err_detected().
> With that change I've now done over two dozen detachments with varying
> time between attach and detach to have the driver at different stages of
> initialization.
> With the change all worked without a hitch.
> 
> Best regards,
> Niklas Schnelle
> >

Sorry for my late response.
Yes, this looks good and I also found same in my analysis.
With latest code mlx5_pci_close() already does drain_health_wq(), so the additional call in remove_one() is redundant.
It should be just removed.
If you can verify below hunk in your setup, it will be really helpful.
You still need patch 42ea9f1b5c6 in your tree.

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 8b658908f044..ebec2318dbc4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1382,7 +1382,6 @@ static void remove_one(struct pci_dev *pdev)

        devlink_reload_disable(devlink);
        mlx5_crdump_disable(dev);
-       mlx5_drain_health_wq(dev);
        mlx5_unload_one(dev, true);
        mlx5_pci_close(dev);
        mlx5_mdev_uninit(dev);