netdev - Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6323ccc1-c708-9a4f-aba8-6b3786136a92@linux.ibm.com>
Date:   Fri, 10 Jul 2020 10:34:50 +0200
From:   Niklas Schnelle <schnelle@...ux.ibm.com>
To:     Parav Pandit <parav@...lanox.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Shay Drory <shayd@...lanox.com>
Cc:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "ubraun@...ux.ibm.com" <ubraun@...ux.ibm.com>,
        "kgraul@...ux.ibm.com" <kgraul@...ux.ibm.com>,
        "raspl@...ibm.com" <raspl@...ibm.com>
Subject: Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken



On 7/9/20 8:34 PM, Parav Pandit wrote:
> On 7/9/2020 3:36 PM, Niklas Schnelle wrote:
>>
>> On 7/8/20 5:44 PM, Parav Pandit wrote:
... snip ..

> 
>>>
>> As is the patch above fixes the dereference but results in the same completion error
>> as current 5.8-rc4
> 
> Below patch should hopefully fix it. It fixes the bug introduced in
> commit 41798df9bfca.
> Additionally it fixes one small change of 42ea9f1b5c62.
> Please remove all previous changes and apply below changes.
> Hopefully this should resolve.
> 
> 
> From a260b2e6a6065a57c2fa621271483cd51d0a1abf Mon Sep 17 00:00:00 2001
> From: Parav Pandit <parav@...lanox.com>
> Date: Thu, 9 Jul 2020 20:57:13 +0300
> Subject: [PATCH] net/mlx5: Drain health wq after unregistering device
> 
... snip ....Hmm, we still seem to be missing something, this patch
does not yet restore the v5.7 behavior and still results
in a hang on e.g. echo 1 > /proc/cio_settle after a vmcp detach pcif <fid>.
In particular the PCI device is never removed.
I do agree however that your changes make sense so maybe something else
broke the part that finally gives up on recovering the device.

By the you might want to turn off automatic line wrapping in your mail client
the patch was corrupted.

See the dmesg output below for the actual dmesg and below that
for the output on v5.7

[  111.243608] mlx5_core 0000:00:00.0: poll_health:702:(pid 0): Fatal error 1 detected
[  111.243851] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[0] 0xffffffff
[  111.243858] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[1] 0xffffffff
[  111.243864] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[2] 0xffffffff
[  111.243871] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[3] 0xffffffff
[  111.243878] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[4] 0xffffffff
[  111.243885] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0): assert_exit_ptr 0xffffffff
[  111.243892] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0): assert_callra 0xffffffff
[  111.243902] mlx5_core 0000:00:00.0: print_health_info:388:(pid 0): fw_ver 65535.65535.65535
[  111.243909] mlx5_core 0000:00:00.0: print_health_info:389:(pid 0): hw_id 0xffffffff
[  111.243916] mlx5_core 0000:00:00.0: print_health_info:390:(pid 0): irisc_index 255
[  111.243925] mlx5_core 0000:00:00.0: print_health_info:391:(pid 0): synd 0xff: unrecognized error
[  111.243942] mlx5_core 0000:00:00.0: print_health_info:393:(pid 0): ext_synd 0xffff
[  111.243949] mlx5_core 0000:00:00.0: print_health_info:395:(pid 0): raw fw_ver 0xffffffff
[  111.244263] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B, anc=0, erc=0, rsid=0
[  111.244291] mlx5_core 0000:00:00.0: PME# disabled
[  244.092384] INFO: task kmcheck:73 blocked for more than 122 seconds.
[  244.092626]       Not tainted 5.8.0-rc4-dirty #2
[  244.092631] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  244.092636] kmcheck         D10888    73      2 0x00000000
[  244.092648] Call Trace:
[  244.092659]  [<00000000088a1c46>] __schedule+0x2d6/0x5a8
[  244.092664]  [<00000000088a1f72>] schedule+0x5a/0x130
[  244.092671]  [<00000000088a8182>] schedule_timeout+0xfa/0x138
[  244.092677]  [<00000000088a37e8>] wait_for_completion+0xc0/0x110
[  244.092685]  [<0000000007d30798>] __flush_work+0xd8/0x120
[  244.092691]  [<0000000007d30f88>] __cancel_work_timer+0x150/0x208
[  244.092750]  [<000003ff801d5c4c>] mlx5_unload_one+0x64/0x148 [mlx5_core]
[  244.092770]  [<000003ff801d5dc2>] remove_one+0x52/0xd8 [mlx5_core]
[  244.092778]  [<0000000008532500>] pci_device_remove+0x40/0x90
[  244.092787]  [<00000000085a0012>] __device_release_driver+0x17a/0x220
[  244.092794]  [<00000000085a00f8>] device_release_driver+0x40/0x50
[  244.092812]  [<0000000008527dcc>] pci_stop_bus_device+0x94/0xc0
[  244.092817]  [<00000000085280c0>] pci_stop_and_remove_bus_device_locked+0x30/0x48
[  244.092822]  [<0000000007cfefd6>] __zpci_event_availability+0x12e/0x318
[  244.092829]  [<00000000088066fc>] chsc_process_event_information.constprop.0+0x214/0x228
[  244.092834]  [<000000000880daa8>] crw_collect_info+0x280/0x350
[  244.092840]  [<0000000007d38a96>] kthread+0x176/0x1a0
[  244.092844]  [<00000000088aa820>] ret_from_fork+0x24/0x2c
[  244.092863] 4 locks held by kmcheck/73:
[  244.092868]  #0: 0000000008db5e48 (crw_handler_mutex){+.+.}-{3:3}, at: crw_collect_info+0x250/0x350
[  244.092878]  #1: 0000000008d95668 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x48
[  244.092887]  #2: 00000000de431238 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x32/0x50
[  244.092896]  #3: 00000000d152cbb8 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_unload_one+0x38/0x148 [mlx5_core]
[  244.092922] INFO: task kworker/u128:2:346 blocked for more than 122 seconds.
[  244.092932]       Not tainted 5.8.0-rc4-dirty #2
[  244.092936] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  244.092940] kworker/u128:2  D10848   346      2 0x02000000
[  244.092968] Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
[  244.092974] Call Trace:
[  244.092979]  [<00000000088a1c46>] __schedule+0x2d6/0x5a8
[  244.092985]  [<00000000088a1f72>] schedule+0x5a/0x130
[  244.092990]  [<00000000088a2704>] schedule_preempt_disabled+0x2c/0x48
[  244.092996]  [<00000000088a49c2>] __mutex_lock+0x372/0x960
[  244.093002]  [<00000000088a4fe2>] mutex_lock_nested+0x32/0x40
[  244.093021]  [<000003ff801e0978>] mlx5_enter_error_state+0xa0/0x100 [mlx5_core]
[  244.093041]  [<000003ff801e0a0c>] mlx5_fw_fatal_reporter_err_work+0x34/0xb8 [mlx5_core]
[  244.093047]  [<0000000007d2f84c>] process_one_work+0x27c/0x478
[  244.093054]  [<0000000007d2faae>] worker_thread+0x66/0x368
[  244.093061]  [<0000000007d38a96>] kthread+0x176/0x1a0
[  244.093067]  [<00000000088aa820>] ret_from_fork+0x24/0x2c
[  244.093077] 3 locks held by kworker/u128:2/346:
[  244.093081]  #0: 00000000d230d948 ((wq_completion)mlx5_health0000:00:00.0){+.+.}-{0:0}, at: process_one_work+0x1dc/0x478
[  244.093089]  #1: 000003e000783e18 ((work_completion)(&health->fatal_report_work)){+.+.}-{0:0}, at: process_one_work+0x1dc/0x478
[  244.093097]  #2: 00000000d152cbb8 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_enter_error_state+0xa0/0x100 [mlx5_core]
[  244.093136]
               Showing all locks held in the system:
[  244.093147] 1 lock held by khungtaskd/27:
[  244.093153]  #0: 0000000008d16990 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire.constprop.0+0x0/0x50
[  244.093164] 4 locks held by kmcheck/73:
[  244.093170]  #0: 0000000008db5e48 (crw_handler_mutex){+.+.}-{3:3}, at: crw_collect_info+0x250/0x350
[  244.093178]  #1: 0000000008d95668 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x48
[  244.093187]  #2: 00000000de431238 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x32/0x50
[  244.093199]  #3: 00000000d152cbb8 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_unload_one+0x38/0x148 [mlx5_core]
[  244.093221] 3 locks held by kworker/u128:2/346:
[  244.093225]  #0: 00000000d230d948 ((wq_completion)mlx5_health0000:00:00.0){+.+.}-{0:0}, at: process_one_work+0x1dc/0x478
[  244.093233]  #1: 000003e000783e18 ((work_completion)(&health->fatal_report_work)){+.+.}-{0:0}, at: process_one_work+0x1dc/0x478
[  244.093241]  #2: 00000000d152cbb8 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_enter_error_state+0xa0/0x100 [mlx5_core]

[  244.093274] =============================================

On v5.7 on the other hand the recovery flow is aborted and the device is released so
that the hard unplug can complete with the device completely removed:
(Note there are SMC messages in the below example that are not in the one above
but you can ignore them)

[78144.718600] mlx5_core 0001:00:00.0: poll_health:695:(pid 3222): Fatal error 1 detected
[78144.718608] mlx5_core 0001:00:00.0: print_health_info:373:(pid 3222): assert_var[0] 0xffffffff
[78144.718611] mlx5_core 0001:00:00.0: print_health_info:373:(pid 3222): assert_var[1] 0xffffffff
[78144.718613] mlx5_core 0001:00:00.0: print_health_info:373:(pid 3222): assert_var[2] 0xffffffff
[78144.718616] mlx5_core 0001:00:00.0: print_health_info:373:(pid 3222): assert_var[3] 0xffffffff
[78144.718619] mlx5_core 0001:00:00.0: print_health_info:373:(pid 3222): assert_var[4] 0xffffffff
[78144.718621] mlx5_core 0001:00:00.0: print_health_info:376:(pid 3222): assert_exit_ptr 0xffffffff
[78144.718625] mlx5_core 0001:00:00.0: print_health_info:378:(pid 3222): assert_callra 0xffffffff
[78144.718629] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B, anc=0, erc=0, rsid=0
[78144.718634] mlx5_core 0001:00:00.0: print_health_info:381:(pid 3222): fw_ver 65535.65535.65535
[78144.718638] mlx5_core 0001:00:00.0: print_health_info:382:(pid 3222): hw_id 0xffffffff
[78144.718642] mlx5_core 0001:00:00.0: print_health_info:383:(pid 3222): irisc_index 255
[78144.718652] mlx5_core 0001:00:00.0: print_health_info:384:(pid 3222): synd 0xff: unrecognized error
[78144.718654] mlx5_core 0001:00:00.0: print_health_info:386:(pid 3222): ext_synd 0xffff
[78144.718656] mlx5_core 0001:00:00.0: print_health_info:388:(pid 3222): raw fw_ver 0xffffffff
[78144.718906] smc: removing ib device mlx5_0
[78144.719016] mlx5_core 0001:00:00.0: mlx5_health_try_recover:308:(pid 3174): handling bad device here
[78144.719020] mlx5_core 0001:00:00.0: mlx5_error_sw_reset:225:(pid 3174): start
[78145.754943] mlx5_core 0001:00:00.0: NIC IFC still 7 after 1000ms.
[78145.754947] mlx5_core 0001:00:00.0: mlx5_error_sw_reset:258:(pid 3174): end
[78206.864928] mlx5_core 0001:00:00.0: mlx5_health_try_recover:313:(pid 3174): health recovery flow aborted, PCI reads still not working
[78206.864955] mlx5_core 0001:00:00.0: mlx5_unload_one:1252:(pid 121): mlx5_unload_one: interface is down, NOP
[78206.905176] pci 0001:00:00.0: Removing from iommu group 1
[78206.906828] sclp_cmd: configure PCI I/O adapter failed: cmd=0x001b0001  response=0x09f0
[78206.906873] pci_bus 0001:00: busn_res: [bus 00] is released