netdev - Re: [BUG] deadlock in nl80211_vendor

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <YjivHrPnzG7h2Sdf@google.com>
Date:   Mon, 21 Mar 2022 17:00:14 +0000
From:   William McVicker <willmcvicker@...gle.com>
To:     Johannes Berg <johannes.berg@...el.com>,
        linux-wireless@...r.kernel.org
Cc:     Marek Szyprowski <m.szyprowski@...sung.com>,
        Kalle Valo <kvalo@...eaurora.org>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org,
        Amitkumar Karwar <amitkarwar@...il.com>,
        Ganapathi Bhat <ganapathi.bhat@....com>,
        Xinming Hu <huxinming820@...il.com>,
        "Cc: Android Kernel" <kernel-team@...roid.com>
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 03/21/2022, Will McVicker wrote:
> On Thu, Mar 17, 2022 at 10:09 AM <willmcvicker@...gle.com> wrote:
> 
> > Hi,
> >
> > I wanted to report a deadlock that I'm hitting as a result of the upstream
> > commit a05829a7222e ("cfg80211: avoid holding the RTNL when calling the
> > driver"). I'm using the Pixel 6 with downstream version of the 5.15 kernel,
> > but I'm pretty sure this will happen on the upstream tip-of-tree kernel as
> > well.
> >
> > Basically, my wlan driver uses the wiphy_vendor_command ops to handle
> > a number of vendor specific operations. One of them in particular deletes
> > a cfg80211 interface. The deadlock happens when thread 1 tries to take the
> > RTNL lock before calling cfg80211_unregister_device() while thread 2 is
> > inside nl80211_pre_doit(), holding the RTNL lock, and waiting on
> > wiphy_lock().
> >
> > Here is the call flow:
> >
> > Thread 1:                         Thread 2:
> >
> > nl80211_pre_doit():
> >   -> rtnl_lock()
> >                                       nl80211_pre_doit():
> >                                        -> rtnl_lock()
> >                                        -> <blocked by Thread 1>
> >   -> wiphy_lock()
> >   -> rtnl_unlock()
> >   -> <unblock Thread 1>
> > exit nl80211_pre_doit()
> >                                        <Thread 2 got the RTNL lock>
> >                                        -> wiphy_lock()
> >                                        -> <blocked by Thread 1>
> > nl80211_doit()
> >   -> nl80211_vendor_cmd()
> >       -> rtnl_lock() <DEADLOCK>
> >       -> cfg80211_unregister_device()
> >       -> rtnl_unlock()
> >
> >
> > To be complete, here are the kernel call traces when the deadlock occurs:
> >
> > Thread 1 Call trace:
> >     <Take rtnl before calling cfg80211_unregister_device()>
> >     nl80211_vendor_cmd+0x210/0x218
> >     genl_rcv_msg+0x3ac/0x45c
> >     netlink_rcv_skb+0x130/0x168
> >     genl_rcv+0x38/0x54
> >     netlink_unicast_kernel+0xe4/0x1f4
> >     netlink_unicast+0x128/0x21c
> >     netlink_sendmsg+0x2d8/0x3d8
> >
> > Thread 2 Call trace:
> >     <Take wiphy_lock>
> >     nl80211_pre_doit+0x1b0/0x250
> >     genl_rcv_msg+0x37c/0x45c
> >     netlink_rcv_skb+0x130/0x168
> >     genl_rcv+0x38/0x54
> >     netlink_unicast_kernel+0xe4/0x1f4
> >     netlink_unicast+0x128/0x21c
> >     netlink_sendmsg+0x2d8/0x3d8
> >
> > I'm not an networking expert. So my main question is if I'm allowed to take
> > the RTNL lock inside the nl80211_vendor_cmd callbacks? If so, then
> > regardless of why I take it, we shouldn't be allowing this deadlock
> > situation, right?
> >
> > I hope that helps explain the issue. Let me know if you need any more
> > details.
> >
> > Thanks,
> > Will
> >
> 
> Sorry my CC list got dropped. Adding the following:
> 
> Kalle Valo <kvalo@...eaurora.org>
> "David S. Miller" <davem@...emloft.net>
> Jakub Kicinski <kuba@...nel.org>
> netdev@...r.kernel.org
> Amitkumar Karwar <amitkarwar@...il.com>
> Ganapathi Bhat <ganapathi.bhat@....com>
> Xinming Hu <huxinming820@...il.com>
> kernel-team@...roid.com

Sorry for the noise. The lists bounced due to html. Resending with mutt to make
sure everyone gets this message.

As an update, I was able to fix the deadlock by updating nl80211_pre_doit() to
not hold the RTNL lock while waiting to get the wiphy_lock. This allows us to
take the RTNL lock within nl80211_doit() and have parallel calls to
nl80211_doit(). Below is the logic I tested. Please let me know if I'm heading
in the right direction.

Thanks,
Will

diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index 686a69381731..bb4ad746509b 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -15227,7 +15227,24 @@ static int nl80211_pre_doit(const struct genl_ops *ops, struct sk_buff *skb,
 	}
 
 	if (rdev && !(ops->internal_flags & NL80211_FLAG_NO_WIPHY_MTX)) {
-		wiphy_lock(&rdev->wiphy);
+		while (!mutex_trylock(&rdev->wiphy.mtx)) {
+			/* Holding the RTNL lock while waiting for the wiphy lock can lead to
+			 * a deadlock within doit() ops that don't hold the RTNL in pre_doit. So
+			 * we need to release the RTNL lock first while we wait for the wiphy
+			 * lock.
+			 */
+			rtnl_unlock();
+			wiphy_lock(&rdev->wiphy);
+
+			/* Once we get the wiphy_lock, we need to grab the RTNL lock. If we can't
+			 * get it, then we need to unlock the wiphy to avoid a deadlock in
+			 * pre_doit and then retry taking the locks again. */
+			if (!rtnl_trylock()) {
+				wiphy_unlock(&rdev->wiphy);
+				rtnl_lock();
+			} else
+				break;
+		}
 		/* we keep the mutex locked until post_doit */
 		__release(&rdev->wiphy.mtx);
 	}