[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6c0a6b462200847d87221dcd7655b6a746012061.camel@nvidia.com>
Date: Wed, 7 May 2025 15:13:36 +0000
From: Cosmin Ratiu <cratiu@...dia.com>
To: "stfomichev@...il.com" <stfomichev@...il.com>
CC: "davem@...emloft.net" <davem@...emloft.net>,
"linux-kselftest@...r.kernel.org" <linux-kselftest@...r.kernel.org>, Dragos
Tatulea <dtatulea@...dia.com>, "sdf@...ichev.me" <sdf@...ichev.me>,
"kuba@...nel.org" <kuba@...nel.org>, "jiri@...nulli.us" <jiri@...nulli.us>,
"edumazet@...gle.com" <edumazet@...gle.com>, "pabeni@...hat.com"
<pabeni@...hat.com>, Saeed Mahameed <saeedm@...dia.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH net] net: Lock lower level devices when updating features
On Wed, 2025-05-07 at 14:35 +0000, Cosmin Ratiu wrote:
> On Tue, 2025-05-06 at 11:13 -0700, Stanislav Fomichev wrote:
> > On 05/06, Cosmin Ratiu wrote:
> >
> >
> > Right, but netdev_sync_lower_features calls lower's
> > __netdev_update_features
> > only for NETIF_F_UPPER_DISABLES. So it doesn't propagate all
> > features,
> > only LRO AFAICT.
>
> Got it, I didn't look into what netdev_sync_lower_features actually
> does besides noticing it can call __netdev_update_feature...
>
> In any case, please hold off with picking this patch up, it seems
> there's a possibility of a real deadlock. Here's the scenario:
>
> ============================================
> WARNING: possible recursive locking detected
> --------------------------------------------
> ethtool/44150 is trying to acquire lock:
> ffff8881364e8c80 (&dev_instance_lock_key#7){+.+.}-{4:4}, at:
> __netdev_update_features+0x31e/0xe20
>
> but task is already holding lock:
> ffff8881364e8c80 (&dev_instance_lock_key#7){+.+.}-{4:4}, at:
> ethnl_set_features+0xbc/0x4b0
> and the lock comparison function returns 0:
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(&dev_instance_lock_key#7);
> lock(&dev_instance_lock_key#7);
>
> *** DEADLOCK ***
>
> May be due to missing lock nesting notation
>
> 3 locks held by ethtool/44150:
> #0: ffffffff830e5a50 (cb_lock){++++}-{4:4}, at: genl_rcv+0x15/0x40
> #1: ffffffff830cf708 (rtnl_mutex){+.+.}-{4:4}, at:
> ethnl_set_features+0x88/0x4b0
> #2: ffff8881364e8c80 (&dev_instance_lock_key#7){+.+.}-{4:4}, at:
> ethnl_set_features+0xbc/0x4b0
>
> stack backtrace:
> Call Trace:
> <TASK>
> dump_stack_lvl+0x69/0xa0
> print_deadlock_bug.cold+0xbd/0xca
> __lock_acquire+0x163c/0x2f00
> lock_acquire+0xd3/0x2e0
> __mutex_lock+0x98/0xf10
> __netdev_update_features+0x31e/0xe20
> netdev_update_features+0x1f/0x60
> vlan_device_event+0x57d/0x930 [8021q]
> notifier_call_chain+0x3d/0x100
> netdev_features_change+0x32/0x50
> ethnl_set_features+0x17e/0x4b0
> genl_family_rcv_msg_doit+0xe0/0x130
> genl_rcv_msg+0x188/0x290
> [...]
>
> I'm not sure how to solve this yet...
> Cosmin.
If it's not clear, the problem is that:
1. the lower device is already ops locked
2. netdev_feature_change gets called.
3. __netdev_update_features gets called for the vlan (upper) dev.
4. It tries to acquire the same lock instance as 1 (this patch).
5. Deadlock.
One solution I can think of would be to run device notifiers for
changing features outside the lock, it doesn't seem like the netdev
lock has anything to do with what other devices might do with this
information.
This can be triggered from many scenarios, I have another similar stack
involving bonding.
What do you think?
Cosmin.
Powered by blists - more mailing lists