[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1d8966f1-bc62-3981-6889-cda6ef19ae44@linux.ibm.com>
Date: Thu, 9 Jul 2020 12:06:28 +0200
From: Niklas Schnelle <schnelle@...ux.ibm.com>
To: Parav Pandit <parav@...lanox.com>,
Saeed Mahameed <saeedm@...lanox.com>,
Shay Drory <shayd@...lanox.com>
Cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"ubraun@...ux.ibm.com" <ubraun@...ux.ibm.com>,
"kgraul@...ux.ibm.com" <kgraul@...ux.ibm.com>,
"raspl@...ibm.com" <raspl@...ibm.com>
Subject: Re: [REGRESSION] mlx5: Driver remove during hot unplug is broken
On 7/8/20 5:44 PM, Parav Pandit wrote:
... snip ..
>>
>
> It is likely because events_cleanup() freed the memory using kvfree() that health recovery context is trying to access in notifier chain.
>
> While reviewing I see few more errors as below.
> (a) mlx5_pci_err_detected() invokes mlx5_drain_health_wq() outside of intf_state_mutex.
> (b) mlx5_enter_error_state() in commit b6e0b6bebe0 read and updates dev state outside of intf_state_mutex.
> (c) due to drain_health_wq() introduction in mlx5_pci_close() in commit 42ea9f1b5c625 health_wq is flushed twice.
> (d) priv->events freed is kvfree() but allocated using kzalloc().
>
> This code certainly needs rework.
>
> In meantime to avoid this regression, I believe below hunk eliminates error introduced in patch 41798df9bfc.
> Will you please help test it?
>
> Shay and I did the review of below patch.
> If it works I will get it through Saeed's tree and internal reviews.
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index ebec2318dbc4..529df5703737 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -785,11 +785,6 @@ static int mlx5_pci_init(struct mlx5_core_dev *dev, struct pci_dev *pdev,
>
> static void mlx5_pci_close(struct mlx5_core_dev *dev)
> {
> - /* health work might still be active, and it needs pci bar in
> - * order to know the NIC state. Therefore, drain the health WQ
> - * before removing the pci bars
> - */
> - mlx5_drain_health_wq(dev);
> iounmap(dev->iseg);
> pci_clear_master(dev->pdev);
> release_bar(dev->pdev);
> @@ -1235,6 +1230,7 @@ void mlx5_unload_one(struct mlx5_core_dev *dev, bool cleanup)
> if (cleanup) {
> mlx5_unregister_device(dev);
> mlx5_devlink_unregister(priv_to_devlink(dev));
> + mlx5_drain_health_wq(dev);
I think with the above you can remove the mlx5_drain_health_wq(dev) in remove_one()
as that calls mlx5_unload_one() with cleanup == true. Also I wonder if it is a problem
that the order of mlx5_devlink_unregister(priv_to_devlink(dev)) and mlx5_unregister_device(dev)
is switched compared to the 5.7 code. That said changing both seems to result in a deadlock
though not the "leak of a command resource":
[ 48.082222] mlx5_core 0000:00:00.0: poll_health:702:(pid 0): Fatal error 1 detected
[ 48.082296] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[0] 0xffffffff
[ 48.082303] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[1] 0xffffffff
[ 48.082309] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[2] 0xffffffff
[ 48.082316] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[3] 0xffffffff
[ 48.082322] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[4] 0xffffffff
[ 48.082329] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0): assert_exit_ptr 0xffffffff
[ 48.082336] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0): assert_callra 0xffffffff
[ 48.082345] mlx5_core 0000:00:00.0: print_health_info:388:(pid 0): fw_ver 65535.65535.65535
[ 48.082353] mlx5_core 0000:00:00.0: print_health_info:389:(pid 0): hw_id 0xffffffff
[ 48.082360] mlx5_core 0000:00:00.0: print_health_info:390:(pid 0): irisc_index 255
[ 48.082374] mlx5_core 0000:00:00.0: print_health_info:391:(pid 0): synd 0xff: unrecognized error
[ 48.082390] mlx5_core 0000:00:00.0: print_health_info:393:(pid 0): ext_synd 0xffff
[ 48.082396] mlx5_core 0000:00:00.0: print_health_info:395:(pid 0): raw fw_ver 0xffffffff
[ 48.082824] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B, anc=0, erc=0, rsid=0
[ 48.082852] mlx5_core 0000:00:00.0: PME# disabled
[ 244.103844] INFO: task kworker/u128:1:63 blocked for more than 122 seconds.
[ 244.104076] Not tainted 5.8.0-rc4-dirty #2
[ 244.104081] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 244.104086] kworker/u128:1 D13264 63 2 0x00000000
[ 244.104158] Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
[ 244.104164] Call Trace:
[ 244.104174] [<0000000031cf1c46>] __schedule+0x2d6/0x5a8
[ 244.104180] [<0000000031cf1f72>] schedule+0x5a/0x130
[ 244.104186] [<0000000031cf2704>] schedule_preempt_disabled+0x2c/0x48
[ 244.104192] [<0000000031cf49c2>] __mutex_lock+0x372/0x960
[ 244.104198] [<0000000031cf4fe2>] mutex_lock_nested+0x32/0x40
[ 244.104219] [<000003ff801e4978>] mlx5_enter_error_state+0xa0/0x100 [mlx5_core]
[ 244.104238] [<000003ff801e4a0c>] mlx5_fw_fatal_reporter_err_work+0x34/0xb8 [mlx5_core]
[ 244.104249] [<000000003117f84c>] process_one_work+0x27c/0x478
[ 244.104255] [<000000003117faae>] worker_thread+0x66/0x368
[ 244.104275] [<0000000031188a96>] kthread+0x176/0x1a0
[ 244.104280] [<0000000031cfa820>] ret_from_fork+0x24/0x2c
[ 244.104285] 3 locks held by kworker/u128:1/63:
[ 244.104289] #0: 00000000d0392948 ((wq_completion)mlx5_health0000:00:00.0){+.+.}-{0:0}, at: process_one_work+0x1dc/0x478
[ 244.104298] #1: 000003e00295fe18 ((work_completion)(&health->fatal_report_work)){+.+.}-{0:0}, at: process_one_work+0x1dc/0x478
[ 244.104305] #2: 00000000d1b2cbb8 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_enter_error_state+0xa0/0x100 [mlx5_core]
[ 244.104328] INFO: task kmcheck:73 blocked for more than 122 seconds.
[ 244.104333] Not tainted 5.8.0-rc4-dirty #2
[ 244.104337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 244.104341] kmcheck D10888 73 2 0x00000000
[ 244.104350] Call Trace:
[ 244.104355] [<0000000031cf1c46>] __schedule+0x2d6/0x5a8
[ 244.104360] [<0000000031cf1f72>] schedule+0x5a/0x130
[ 244.104365] [<0000000031cf8182>] schedule_timeout+0xfa/0x138
[ 244.104370] [<0000000031cf37e8>] wait_for_completion+0xc0/0x110
[ 244.104374] [<0000000031180798>] __flush_work+0xd8/0x120
[ 244.104380] [<0000000031180f88>] __cancel_work_timer+0x150/0x208
[ 244.104398] [<000003ff801d9c3a>] mlx5_unload_one+0x52/0x148 [mlx5_core]
[ 244.104416] [<000003ff801d9dc2>] remove_one+0x52/0xd8 [mlx5_core]
[ 244.104422] [<0000000031982500>] pci_device_remove+0x40/0x90
[ 244.104429] [<00000000319f0012>] __device_release_driver+0x17a/0x220
[ 244.104433] [<00000000319f00f8>] device_release_driver+0x40/0x50
[ 244.104439] [<0000000031977dcc>] pci_stop_bus_device+0x94/0xc0
[ 244.104444] [<00000000319780c0>] pci_stop_and_remove_bus_device_locked+0x30/0x48
[ 244.104449] [<000000003114efd6>] __zpci_event_availability+0x12e/0x318
[ 244.104456] [<0000000031c566fc>] chsc_process_event_information.constprop.0+0x214/0x228
[ 244.104461] [<0000000031c5daa8>] crw_collect_info+0x280/0x350
[ 244.104466] [<0000000031188a96>] kthread+0x176/0x1a0
[ 244.104470] [<0000000031cfa820>] ret_from_fork+0x24/0x2c
[ 244.104475] 4 locks held by kmcheck/73:
[ 244.104479] #0: 0000000032205e48 (crw_handler_mutex){+.+.}-{3:3}, at: crw_collect_info+0x250/0x350
[ 244.104486] #1: 00000000321e5668 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x48
[ 244.104493] #2: 00000000d7fd1238 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x32/0x50
[ 244.104553] #3: 00000000d1b2cbb8 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_unload_one+0x38/0x148 [mlx5_core]
> } else {
> mlx5_detach_device(dev);
> }
> @@ -1366,6 +1362,11 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *id)
> return 0;
>
> err_load_one:
> + /* health work might still be active, and it needs pci bar in
> + * order to know the NIC state. Therefore, drain the health WQ
> + * before removing the pci bars
> + */
> + mlx5_drain_health_wq(dev);
> mlx5_pci_close(dev);
> pci_init_err:
> mlx5_mdev_uninit(dev);
>
As is the patch above fixes the dereference but results in the same completion error
as current 5.8-rc4
[ 103.257202] mlx5_core 0000:00:00.0: poll_health:702:(pid 930): Fatal error 1 detected
[ 103.257405] mlx5_core 0000:00:00.0: print_health_info:380:(pid 930): assert_var[0] 0xffffffff
...
[ 114.268505] mlx5_core 0000:00:00.0: poll_health:717:(pid 0): device's health compromised - reached miss count
[ 114.268524] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[0] 0xffffffff
...
[ 167.308576] mlx5_core 0000:00:00.0: wait_func:1008:(pid 73): 2RST_QP(0x50a) timeout. Will cause a leak of a command resource
[ 167.308799] infiniband mlx5_0: destroy_qp_common:2367:(pid 73): mlx5_ib: modify QP 0x000724 to RESET failed
[ 198.028576] mlx5_core 0000:00:00.0: wait_func:1008:(pid 407): QUERY_VPORT_COUNTER(0x770) timeout. Will cause a leak of a command resource
[ 228.748700] mlx5_core 0000:00:00.0: wait_func:1008:(pid 73): DESTROY_QP(0x501) timeout. Will cause a leak of a command resource
[ 259.468867] mlx5_core 0000:00:00.0: wait_func:1008:(pid 407): QUERY_Q_COUNTER(0x773) timeout. Will cause a leak of a command resource
[ 290.189284] mlx5_core 0000:00:00.0: wait_func:1008:(pid 73): DESTROY_CQ(0x401) timeout. Will cause a leak of a command resource
[ 320.909972] mlx5_core 0000:00:00.0: wait_func:1008:(pid 407): QUERY_Q_COUNTER(0x773) timeout. Will cause a leak of a command resource
[ 351.628945] mlx5_core 0000:00:00.0: wait_func:1008:(pid 73): DESTROY_CQ(0x401) timeout. Will cause a leak of a command resource
>
>
>
... snip ...
Powered by blists - more mailing lists