[<prev] [next>] [day] [month] [year] [list]
Message-ID: <PH0PR12MB5481C863153142FEE8422279DCF89@PH0PR12MB5481.namprd12.prod.outlook.com>
Date: Mon, 25 Apr 2022 02:24:10 +0000
From: Parav Pandit <parav@...dia.com>
To: 张广辉 <zhang.guanghui@...tc.cn>,
Roi Dayan <roid@...dia.com>,
Saeed Mahameed <saeedm@...dia.com>,
Jason Gunthorpe <jgg@...dia.com>,
gregkh <gregkh@...uxfoundation.org>
CC: "linux-kernel " <linux-kernel@...r.kernel.org>,
stable <stable@...r.kernel.org>
Subject: RE: Fix a devlink AB-BA deadlock on net namespace deletion
Did you audit if it is safe to not hold the pernet_ops_rwsem when traversing the pernet_list list?
Last time several months back when I reviewed this area for this issue, it appeared that pernet_ops_rwsem must be held while traversing pernet_list.
You also need to fix the mail client to send text only patches.
From: 张广辉 <zhang.guanghui@...tc.cn>
Sent: Sunday, April 24, 2022 2:02 AM
To: 张广辉 <zhang.guanghui@...tc.cn>; Roi Dayan <roid@...dia.com>; Saeed Mahameed <saeedm@...dia.com>; Parav Pandit <parav@...dia.com>; Jason Gunthorpe <jgg@...dia.com>; gregkh <gregkh@...uxfoundation.org>
Cc: linux-kernel <linux-kernel@...r.kernel.org>; stable <stable@...r.kernel.org>
Subject: Fix a devlink AB-BA deadlock on net namespace deletion
Hi all
Deleting a netns holds pernet_ops_rwsem and then takes devlink_mutex.
at that time changing mode to switchdev, holds the devlink_mutex, unregistered to netdevice notifier and then takes pernet_ops_rwsem.
So AB-BA deadlock problem can happen. I have made a patch to fix the deadlock problem, it work well. please help with the review. Thanks
Example sequence is:
\$ ip netns add foo
\$ ip netns del foo &
\$ devlink dev eswitch set pci/0000:af:00.1 mode switchdev
Process A: Process B:
cleanup_net() genl_family_rcv_msg_doit
down_read(&pernet_ops_rwsem); <- first sem acquired
ops_pre_exit_list() pre_doit
devlink_nl_pre_doit mutex_lock(&devlink_mutex); <-first devlink_mutex acquired
pre_exit()
devlink_pernet_pre_exit() mutex_lock(&devlink_mutex);<-first devlink_mutex acquired
devlink_nl_cmd_eswitch_set_doit
mlx5_devlink_eswitch_mode_set
mlx5_lag_disable_change
mlx5_disable_lag
mlx5_rescan_drivers_locked
device_del
...
unregister_netdevice_notifier
down_write(&pernet_ops_rwsem);<- first sem acquired
deleting netns trace:
[ 248.061947] INFO: task kworker/u160:3:1179 blocked for more than 122 seconds.
[ 248.061953] Not tainted 5.15.13-0.el9.x86_64 #1
[ 248.061955] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 248.061956] task:kworker/u160:3 state:D stack: 0 pid: 1179 ppid: 2 flags:0x00004000
[ 248.061962] Workqueue: netns cleanup_net
[ 248.061970] Call Trace:
[ 248.061972] <TASK>
[ 248.061975] __schedule+0x200/0x540
[ 248.061982] schedule+0x44/0xa0
[ 248.061984] schedule_preempt_disabled+0xa/0x10
[ 248.061986] __mutex_lock.constprop.0+0x212/0x400
[ 248.061989] devlink_pernet_pre_exit+0x2a/0x140
[ 248.061994] cleanup_net+0x1d2/0x3a0
[ 248.061997] process_one_work+0x1e8/0x390
[ 248.062003] worker_thread+0x53/0x3c0
[ 248.062005] ? process_one_work+0x390/0x390
[ 248.062007] kthread+0x10c/0x130
[ 248.062011] ? set_kthread_struct+0x40/0x40
[ 248.062014] ret_from_fork+0x1f/0x30
[ 248.062020] </TASK>
changing mode to switchdev trace:
[ 248.062078] task:devlink state:D stack: 0 pid: 8546 ppid: 8542 flags:0x00004000
[ 248.062081] Call Trace:
[ 248.062082] <TASK>
[ 248.062083] __schedule+0x200/0x540
[ 248.062087] ? free_msg+0x3f/0xb0 [mlx5_core]
[ 248.062156] schedule+0x44/0xa0
[ 248.062158] rwsem_down_write_slowpath+0x19c/0x3c0
[ 248.062165] unregister_netdevice_notifier+0x1c/0xb0
[ 248.062168] mlx5_ib_roce_cleanup+0x8a/0x110 [mlx5_ib]
[ 248.062184] mlx5r_remove+0x36/0x60 [mlx5_ib]
[ 248.062196] auxiliary_bus_remove+0x18/0x30
[ 248.062200] __device_release_driver+0x177/0x240
[ 248.062203] device_release_driver+0x24/0x30
[ 248.062205] bus_remove_device+0xd8/0x140
[ 248.062210] device_del+0x18b/0x400
[ 248.062213] mlx5_rescan_drivers_locked.part.0+0x7e/0x150 [mlx5_core]
[ 248.062267] mlx5_disable_lag+0x149/0x160 [mlx5_core]
[ 248.062318] mlx5_lag_disable_change+0x60/0xa0 [mlx5_core]
[ 248.062369] mlx5_devlink_eswitch_mode_set+0x4b/0x1a0 [mlx5_core]
[ 248.062436] devlink_nl_cmd_eswitch_set_doit+0xc1/0x150
[ 248.062440] genl_family_rcv_msg_doit+0xe7/0x150
[ 248.062445] genl_rcv_msg+0xdc/0x1e0
[ 248.062448] ? __devlink_port_phys_port_name_get+0x1e0/0x1e0
[ 248.062451] ? genl_get_cmd+0xd0/0xd0
[ 248.062454] netlink_rcv_skb+0x4e/0xf0
[ 248.062457] genl_rcv+0x24/0x40
[ 248.062460] netlink_unicast+0x1fe/0x2d0
[ 248.062463] netlink_sendmsg+0x24f/0x4b0
[ 248.062466] sock_sendmsg+0x5b/0x60
[ 248.062469] __sys_sendto+0xf0/0x160
[ 248.062473] ? handle_mm_fault+0xbf/0x280
[ 248.062478] ? do_user_addr_fault+0x1d0/0x670
[ 248.062482] __x64_sys_sendto+0x20/0x30
[ 248.062484] do_syscall_64+0x38/0x90
[ 248.062487] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 248.062492] RIP: 0033:0x7ff8cc469c3a
[ 248.062494] RSP: 002b:00007ffe06025e08 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 248.062497] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007ff8cc469c3a
[ 248.062499] RDX: 0000000000000038 RSI: 000055c261bf7440 RDI: 0000000000000003
[ 248.062501] RBP: 0000000000000000 R08: 00007ff8cc52d200 R09: 000000000000000c
[ 248.062502] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 248.062503] R13: 000055c261bf72a0 R14: 000055c260a01d5c R15: 000055c261bf7440
[ 248.062505] </TASK>
the patch details:
diff --git a/linux/net/core/net_namespace.c b/linux/net/core/net_namespace.c
index 202fa5eac..5c872db1f 100644
--- a/linux/net/core/net_namespace.c
+++ b/linux/net/core/net_namespace.c
@@ -576,6 +576,7 @@ static void cleanup_net(struct work_struct *work)
list_add_tail(&net->exit_list, &net_exit_list);
}
+ up_read(&pernet_ops_rwsem);
/* Run all of the network namespace pre_exit methods */
list_for_each_entry_reverse(ops, &pernet_list, list)
ops_pre_exit_list(ops, &net_exit_list);
@@ -596,7 +597,6 @@ static void cleanup_net(struct work_struct *work)
list_for_each_entry_reverse(ops, &pernet_list, list)
ops_free_list(ops, &net_exit_list);
- up_read(&pernet_ops_rwsem);
/* Ensure there are no outstanding rcu callbacks using this
* network namespace.
Powered by blists - more mailing lists