netdev - Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <26dedb23-819f-8121-6e04-72677110f3cc@linux.ibm.com>
Date:   Mon, 15 Jun 2020 12:01:32 +0200
From:   Niklas Schnelle <schnelle@...ux.ibm.com>
To:     Saeed Mahameed <saeedm@...lanox.com>,
        Parav Pandit <parav@...lanox.com>
Cc:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        ubraun@...ux.ibm.com, kgraul@...ux.ibm.com, raspl@...ibm.com
Subject: Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken

Hello Saeed,

On 6/13/20 12:01 AM, Saeed Mahameed wrote:
> On Fri, 2020-06-12 at 15:09 +0200, Niklas Schnelle wrote:
>> Hello Parav, Hello Saeed,
>>
... snip ...
>>
>> So without really knowing anything about these functions I would
>> guess that with the device still registered the drained
>> queue does not remain empty as new entries are added.
>> Does that sound plausible to you?
>>
> 
> I don't think it is related, maybe this is similar to some issues
> addressed lately by Shay's patches:
> 
> https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.235014-2-saeedm@mellanox.com/
> https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.235014-3-saeedm@mellanox.com/
> 
> net/mlx5: drain health workqueue in case of driver load error
> net/mlx5: Fix fatal error handling during device load

I agree with your similarity assessment especially for the first commit.
These do not fix the issue though, with mainline v5.8-rc1 which has
both I'm still getting a hang over 50% of the time with the following
detach sequence on z/VM:

vmcp detach pcif <mlx_fid>; echo 1 > /proc/cio_settle

Since now the commit 41798df9bfca ("net/mlx5: Drain wq first during PCI device removal")
no longer reverts cleanly I used the following diff to move the mlx5_drain_health_wq(dev)
after the mlx5_unregister_devices(dev).

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 8b658908f044..63a196fd8e68 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1382,8 +1382,8 @@ static void remove_one(struct pci_dev *pdev)

        devlink_reload_disable(devlink);
        mlx5_crdump_disable(dev);
-       mlx5_drain_health_wq(dev);
        mlx5_unload_one(dev, true);
+       mlx5_drain_health_wq(dev);
        mlx5_pci_close(dev);
        mlx5_mdev_uninit(dev);
        mlx5_devlink_free(devlink);


Note that this changed order also matches the call order in mlx5_pci_err_detected().
With that change I've now done over two dozen detachments with varying time between
attach and detach to have the driver at different stages of initialization.
With the change all worked without a hitch.

Best regards,
Niklas Schnelle
> 
>> Best regards,
>> Niklas Schnelle
>>
>> [0] dmesg output:
... snip ...
> 
> Shay's patches also came to avoid such command timeouts.
> 
>