[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <19e12e6b5f04ba9e5b192001fbe31a3fc47d380a.camel@sipsolutions.net>
Date: Fri, 25 Mar 2022 13:04:23 +0100
From: Johannes Berg <johannes@...solutions.net>
To: William McVicker <willmcvicker@...gle.com>,
linux-wireless@...r.kernel.org
Cc: Marek Szyprowski <m.szyprowski@...sung.com>,
Kalle Valo <kvalo@...eaurora.org>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org,
Amitkumar Karwar <amitkarwar@...il.com>,
Ganapathi Bhat <ganapathi.bhat@....com>,
Xinming Hu <huxinming820@...il.com>, kernel-team@...roid.com,
Paolo Abeni <pabeni@...hat.com>
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd
Hi,
(Jakub, can you please see below, I wonder which you prefer)
>
> I found that we can hit this same ABBA deadlock within the nl80211 code
> before ever even calling into the vendor doit() function.
Hmm.
> The issue I found
> is caused by the way we unlock the RTNL mutex. Here is the call flow that
> leads to the deadlock:
>
> Thread 1 Thread 2
> nl80211_pre_doit():
> rtnl_lock()
> wiphy_lock() nl80211_pre_doit():
> rtnl_lock() // blocked by Thread 1
> rtnl_unlock():
> netdev_run_todo():
> __rtnl_unlock()
> <got RTNL lock>
> wiphy_lock() // blocked by Thread 1
> rtnl_lock(); // DEADLOCK
> doit()
> nl80211_post_doit():
> wiphy_unlock();
>
> Basically, unlocking the RTNL within netdev_run_todo() gives another thread
> that is waiting for the RTNL in nl80211_pre_doit() a chance to grab the
> RTNL lock leading to the deadlock. I found that there are multiple
> instances where rtnl_lock() is called within netdev_run_todo(): a couple of
> times inside netdev_wait_allrefs() and directly by netdev_run_todo().
Have you actually observed this in practice?
It's true, but dynamically this only happens if you get into the
while (!list_empty(&list)) {
...
}
code in netdev_run_todo().
Which in itself can only be true if a netdev was unregistered and
netdev_run_todo() didn't run yet.
Now, since normally we go through rtnl_unlock(), it's highly likely that
we get into rtnl_lock() with the todo list being empty (but not
impossible, read on), so then rtnl_unlock()/netdev_run_todo() won't be
getting into this branch, and then the deadlock cannot happen.
Now, it might be possible somewhere in the tree to unregister a netdev
and then unlock using __rtnl_unlock(), so the todo list isn't processed
until the next time, but __rtnl_unlock() isn't exported and all the
users I found didn't seem to cause this problem.
On the other hand, clearly people thought it was worth worrying about,
there are already *two* almost identical implementations of avoiding
this problem:
- rtnl_lock_unregistering
- rtnl_lock_unregistering_all
(identical except for the netns list they use, partial vs. all).
So we can avoid the potential deadlock in cfg80211 in a few ways:
1) export rtnl_lock_unregistering_all() or maybe a variant after
refactoring the two versions, to allow cfg80211 to use it, that way
netdev_run_todo() can never have a non-empty todo list
2) export __rtnl_unlock() so cfg80211 can avoid running
netdev_run_todo() in the unlock, personally I like this less because
it might encourage random drivers to use it
3) completely rework cfg80211's locking, adding a separate mutex for
the wiphy list so we don't need to acquire the RTNL at all here
(unless the ops need it, but there's no issue if we don't drop it),
something like https://p.sipsolutions.net/27d08e1f5881a793.txt
I think I'm happy with 3) now (even if it took a couple of hours), so I
think we can go with it, just need to go through all the possibilities.
johannes
Powered by blists - more mailing lists