lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 15 Jun 2020 12:01:32 +0200
From:   Niklas Schnelle <schnelle@...ux.ibm.com>
To:     Saeed Mahameed <saeedm@...lanox.com>,
        Parav Pandit <parav@...lanox.com>
Cc:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        ubraun@...ux.ibm.com, kgraul@...ux.ibm.com, raspl@...ibm.com
Subject: Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken

Hello Saeed,

On 6/13/20 12:01 AM, Saeed Mahameed wrote:
> On Fri, 2020-06-12 at 15:09 +0200, Niklas Schnelle wrote:
>> Hello Parav, Hello Saeed,
>>
... snip ...
>>
>> So without really knowing anything about these functions I would
>> guess that with the device still registered the drained
>> queue does not remain empty as new entries are added.
>> Does that sound plausible to you?
>>
> 
> I don't think it is related, maybe this is similar to some issues
> addressed lately by Shay's patches:
> 
> https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.235014-2-saeedm@mellanox.com/
> https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.235014-3-saeedm@mellanox.com/
> 
> net/mlx5: drain health workqueue in case of driver load error
> net/mlx5: Fix fatal error handling during device load

I agree with your similarity assessment especially for the first commit.
These do not fix the issue though, with mainline v5.8-rc1 which has
both I'm still getting a hang over 50% of the time with the following
detach sequence on z/VM:

vmcp detach pcif <mlx_fid>; echo 1 > /proc/cio_settle

Since now the commit 41798df9bfca ("net/mlx5: Drain wq first during PCI device removal")
no longer reverts cleanly I used the following diff to move the mlx5_drain_health_wq(dev)
after the mlx5_unregister_devices(dev).

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 8b658908f044..63a196fd8e68 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1382,8 +1382,8 @@ static void remove_one(struct pci_dev *pdev)

        devlink_reload_disable(devlink);
        mlx5_crdump_disable(dev);
-       mlx5_drain_health_wq(dev);
        mlx5_unload_one(dev, true);
+       mlx5_drain_health_wq(dev);
        mlx5_pci_close(dev);
        mlx5_mdev_uninit(dev);
        mlx5_devlink_free(devlink);


Note that this changed order also matches the call order in mlx5_pci_err_detected().
With that change I've now done over two dozen detachments with varying time between
attach and detach to have the driver at different stages of initialization.
With the change all worked without a hitch.

Best regards,
Niklas Schnelle
> 
>> Best regards,
>> Niklas Schnelle
>>
>> [0] dmesg output:
... snip ...
> 
> Shay's patches also came to avoid such command timeouts.
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ