netdev - Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7660d8e0d2cb1fbd40cf89ea4c9a0eff4807157c.camel@mellanox.com>
Date:   Fri, 12 Jun 2020 22:01:56 +0000
From:   Saeed Mahameed <saeedm@...lanox.com>
To:     "schnelle@...ux.ibm.com" <schnelle@...ux.ibm.com>,
        Parav Pandit <parav@...lanox.com>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken

On Fri, 2020-06-12 at 15:09 +0200, Niklas Schnelle wrote:
> Hello Parav, Hello Saeed,
> 
> our CI system for IBM Z Linux found a hang[0] when hot unplugging a
> ConnectX-4 Lx VF from a z/VM guest
> in Linus' current tree and added during the merge window.
> Sadly it didn't happen all the time which sent me on the wrong path
> for two full git bisects.
> 
> Anyway, I've now tracked this down to the following commit which when
> reverted
> fixes the issue:
> 
> 41798df9bfca ("net/mlx5: Drain wq first during PCI device removal")
> 
> Looking at the diff I'd say the likely culprit is that before
> the commit the order of calls was:
> 
> mlx5_unregister_device(dev)
> mlx5_drain_health_wq(dev)
> 
> But with the commit it becomes
> 
> mlx5_drain_health_wq(dev)
> mlx5_unregister_device(dev)
> 
> So without really knowing anything about these functions I would
> guess that with the device still registered the drained
> queue does not remain empty as new entries are added.
> Does that sound plausible to you?
> 

I don't think it is related, maybe this is similar to some issues
addressed lately by Shay's patches:

https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.235014-2-saeedm@mellanox.com/
https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.235014-3-saeedm@mellanox.com/

net/mlx5: drain health workqueue in case of driver load error
net/mlx5: Fix fatal error handling during device load

> Best regards,
> Niklas Schnelle
> 
> [0] dmesg output:
> [   36.447442] mlx5_core 0000:00:00.0: poll_health:694:(pid 0): Fatal
> error 1 detected
> [   36.447450] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[0] 0xffffffff
> [   36.447453] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[1] 0xffffffff
> [   36.447455] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[2] 0xffffffff
> [   36.447458] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[3] 0xffffffff
> [   36.447461] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[4] 0xffffffff
> [   36.447463] mlx5_core 0000:00:00.0: print_health_info:375:(pid 0):
> assert_exit_ptr 0xffffffff
> [   36.447467] mlx5_core 0000:00:00.0: print_health_info:377:(pid 0):
> assert_callra 0xffffffff
> [   36.447471] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0):
> fw_ver 65535.65535.65535
> [   36.447475] mlx5_core 0000:00:00.0: print_health_info:381:(pid 0):
> hw_id 0xffffffff
> [   36.447478] mlx5_core 0000:00:00.0: print_health_info:382:(pid 0):
> irisc_index 255
> [   36.447492] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0):
> synd 0xff: unrecognized error
> [   36.447621] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0):
> ext_synd 0xffff
> [   36.447624] mlx5_core 0000:00:00.0: print_health_info:387:(pid 0):
> raw fw_ver 0xffffffff
> [   36.447885] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B,
> anc=0, erc=0, rsid=0
> [   36.447897] zpci: 0000:00:00.0: Event 0x303 reconfigured PCI
> function 0x514
> [   47.099220] mlx5_core 0000:00:00.0: poll_health:709:(pid 0):
> device's health compromised - reached miss count
> [   47.099228] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[0] 0xffffffff
> [   47.099231] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[1] 0xffffffff
> [   47.099234] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[2] 0xffffffff
> [   47.099236] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[3] 0xffffffff
> [   47.099239] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0):
> assert_var[4] 0xffffffff
> [   47.099241] mlx5_core 0000:00:00.0: print_health_info:375:(pid 0):
> assert_exit_ptr 0xffffffff
> [   47.099245] mlx5_core 0000:00:00.0: print_health_info:377:(pid 0):
> assert_callra 0xffffffff
> [   47.099249] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0):
> fw_ver 65535.65535.65535
> [   47.099253] mlx5_core 0000:00:00.0: print_health_info:381:(pid 0):
> hw_id 0xffffffff
> [   47.099256] mlx5_core 0000:00:00.0: print_health_info:382:(pid 0):
> irisc_index 255
> [   47.099327] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0):
> synd 0xff: unrecognized error
> [   47.099329] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0):
> ext_synd 0xffff
> [   47.099330] mlx5_core 0000:00:00.0: print_health_info:387:(pid 0):
> raw fw_ver 0xffffffff
> [  100.539106] mlx5_core 0000:00:00.0: wait_func:991:(pid 121):
> 2RST_QP(0x50a) timeout. Will cause a leak of a command resource
> [  100.539118] infiniband mlx5_0: destroy_qp_common:2525:(pid 121):
> mlx5_ib: modify QP 0x00072c to RESET failed
> [  141.499325] mlx5_core 0000:00:00.0: wait_func:991:(pid 32):
> QUERY_VPORT_COUNTER(0x770) timeout. Will cause a leak of a command
> resource
> [  161.978957] mlx5_core 0000:00:00.0: wait_func:991:(pid 121):
> DESTROY_QP(0x501) timeout. Will cause a leak of a command resource

Shay's patches also came to avoid such command timeouts.