[<prev] [next>] [day] [month] [year] [list]
Message-ID: <YjivHrPnzG7h2Sdf@google.com>
Date: Mon, 21 Mar 2022 17:00:14 +0000
From: William McVicker <willmcvicker@...gle.com>
To: Johannes Berg <johannes.berg@...el.com>,
linux-wireless@...r.kernel.org
Cc: Marek Szyprowski <m.szyprowski@...sung.com>,
Kalle Valo <kvalo@...eaurora.org>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org,
Amitkumar Karwar <amitkarwar@...il.com>,
Ganapathi Bhat <ganapathi.bhat@....com>,
Xinming Hu <huxinming820@...il.com>,
"Cc: Android Kernel" <kernel-team@...roid.com>
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd
On 03/21/2022, Will McVicker wrote:
> On Thu, Mar 17, 2022 at 10:09 AM <willmcvicker@...gle.com> wrote:
>
> > Hi,
> >
> > I wanted to report a deadlock that I'm hitting as a result of the upstream
> > commit a05829a7222e ("cfg80211: avoid holding the RTNL when calling the
> > driver"). I'm using the Pixel 6 with downstream version of the 5.15 kernel,
> > but I'm pretty sure this will happen on the upstream tip-of-tree kernel as
> > well.
> >
> > Basically, my wlan driver uses the wiphy_vendor_command ops to handle
> > a number of vendor specific operations. One of them in particular deletes
> > a cfg80211 interface. The deadlock happens when thread 1 tries to take the
> > RTNL lock before calling cfg80211_unregister_device() while thread 2 is
> > inside nl80211_pre_doit(), holding the RTNL lock, and waiting on
> > wiphy_lock().
> >
> > Here is the call flow:
> >
> > Thread 1: Thread 2:
> >
> > nl80211_pre_doit():
> > -> rtnl_lock()
> > nl80211_pre_doit():
> > -> rtnl_lock()
> > -> <blocked by Thread 1>
> > -> wiphy_lock()
> > -> rtnl_unlock()
> > -> <unblock Thread 1>
> > exit nl80211_pre_doit()
> > <Thread 2 got the RTNL lock>
> > -> wiphy_lock()
> > -> <blocked by Thread 1>
> > nl80211_doit()
> > -> nl80211_vendor_cmd()
> > -> rtnl_lock() <DEADLOCK>
> > -> cfg80211_unregister_device()
> > -> rtnl_unlock()
> >
> >
> > To be complete, here are the kernel call traces when the deadlock occurs:
> >
> > Thread 1 Call trace:
> > <Take rtnl before calling cfg80211_unregister_device()>
> > nl80211_vendor_cmd+0x210/0x218
> > genl_rcv_msg+0x3ac/0x45c
> > netlink_rcv_skb+0x130/0x168
> > genl_rcv+0x38/0x54
> > netlink_unicast_kernel+0xe4/0x1f4
> > netlink_unicast+0x128/0x21c
> > netlink_sendmsg+0x2d8/0x3d8
> >
> > Thread 2 Call trace:
> > <Take wiphy_lock>
> > nl80211_pre_doit+0x1b0/0x250
> > genl_rcv_msg+0x37c/0x45c
> > netlink_rcv_skb+0x130/0x168
> > genl_rcv+0x38/0x54
> > netlink_unicast_kernel+0xe4/0x1f4
> > netlink_unicast+0x128/0x21c
> > netlink_sendmsg+0x2d8/0x3d8
> >
> > I'm not an networking expert. So my main question is if I'm allowed to take
> > the RTNL lock inside the nl80211_vendor_cmd callbacks? If so, then
> > regardless of why I take it, we shouldn't be allowing this deadlock
> > situation, right?
> >
> > I hope that helps explain the issue. Let me know if you need any more
> > details.
> >
> > Thanks,
> > Will
> >
>
> Sorry my CC list got dropped. Adding the following:
>
> Kalle Valo <kvalo@...eaurora.org>
> "David S. Miller" <davem@...emloft.net>
> Jakub Kicinski <kuba@...nel.org>
> netdev@...r.kernel.org
> Amitkumar Karwar <amitkarwar@...il.com>
> Ganapathi Bhat <ganapathi.bhat@....com>
> Xinming Hu <huxinming820@...il.com>
> kernel-team@...roid.com
Sorry for the noise. The lists bounced due to html. Resending with mutt to make
sure everyone gets this message.
As an update, I was able to fix the deadlock by updating nl80211_pre_doit() to
not hold the RTNL lock while waiting to get the wiphy_lock. This allows us to
take the RTNL lock within nl80211_doit() and have parallel calls to
nl80211_doit(). Below is the logic I tested. Please let me know if I'm heading
in the right direction.
Thanks,
Will
diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index 686a69381731..bb4ad746509b 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -15227,7 +15227,24 @@ static int nl80211_pre_doit(const struct genl_ops *ops, struct sk_buff *skb,
}
if (rdev && !(ops->internal_flags & NL80211_FLAG_NO_WIPHY_MTX)) {
- wiphy_lock(&rdev->wiphy);
+ while (!mutex_trylock(&rdev->wiphy.mtx)) {
+ /* Holding the RTNL lock while waiting for the wiphy lock can lead to
+ * a deadlock within doit() ops that don't hold the RTNL in pre_doit. So
+ * we need to release the RTNL lock first while we wait for the wiphy
+ * lock.
+ */
+ rtnl_unlock();
+ wiphy_lock(&rdev->wiphy);
+
+ /* Once we get the wiphy_lock, we need to grab the RTNL lock. If we can't
+ * get it, then we need to unlock the wiphy to avoid a deadlock in
+ * pre_doit and then retry taking the locks again. */
+ if (!rtnl_trylock()) {
+ wiphy_unlock(&rdev->wiphy);
+ rtnl_lock();
+ } else
+ break;
+ }
/* we keep the mutex locked until post_doit */
__release(&rdev->wiphy.mtx);
}
Powered by blists - more mailing lists