netdev - Re: [BUG] deadlock in nl80211_vendor

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Yj4FFIXi//ivQC3X@google.com>
Date:   Fri, 25 Mar 2022 18:08:20 +0000
From:   William McVicker <willmcvicker@...gle.com>
To:     Johannes Berg <johannes@...solutions.net>
Cc:     Jakub Kicinski <kuba@...nel.org>, linux-wireless@...r.kernel.org,
        Marek Szyprowski <m.szyprowski@...sung.com>,
        Kalle Valo <kvalo@...eaurora.org>,
        "David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
        Amitkumar Karwar <amitkarwar@...il.com>,
        Ganapathi Bhat <ganapathi.bhat@....com>,
        Xinming Hu <huxinming820@...il.com>, kernel-team@...roid.com,
        Paolo Abeni <pabeni@...hat.com>
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 03/25/2022, Johannes Berg wrote:
> On Fri, 2022-03-25 at 09:49 -0700, Jakub Kicinski wrote:
> > On Fri, 25 Mar 2022 13:04:23 +0100 Johannes Berg wrote:
> > > So we can avoid the potential deadlock in cfg80211 in a few ways:
> > > 
> > >  1) export rtnl_lock_unregistering_all() or maybe a variant after
> > >     refactoring the two versions, to allow cfg80211 to use it, that way
> > >     netdev_run_todo() can never have a non-empty todo list
> > > 
> > >  2) export __rtnl_unlock() so cfg80211 can avoid running
> > >     netdev_run_todo() in the unlock, personally I like this less because
> > >     it might encourage random drivers to use it
> > > 
> > >  3) completely rework cfg80211's locking, adding a separate mutex for
> > >     the wiphy list so we don't need to acquire the RTNL at all here
> > >     (unless the ops need it, but there's no issue if we don't drop it),
> > >     something like https://p.sipsolutions.net/27d08e1f5881a793.txt
> > > 
> > > 
> > > I think I'm happy with 3) now (even if it took a couple of hours), so I
> > > think we can go with it, just need to go through all the possibilities.
> > 
> > I like 3) as well. FWIW a few places (e.g. mlx5, devlink, I think I've
> > seen more) had been converting to xarray for managing the "registered"
> > objects. It may be worth looking into if you're re-doing things, anyway.
> > 
> 
> That's not a bad idea, but I think I wouldn't want to backport that, so
> separately :) I don't think that fundamentally changes the locking
> properties though.
> 
> 
> Couple of more questions I guess: First, are we assuming that the
> cfg80211 code *is* actually broken, even if it looks like nothing can
> cause the situation, due to the empty todo list?

I'm able to reproduce this issue pretty easily with a Pixel 6 when I add
support to allow vendor commands to request for the RTNL. For this case, I just
delay unlocking the RTNL until nl80211_vendor_cmds() at which point I check the
flags to see if I should unlock before calling doit(). That allows me to run my
tests again and hit this issue. I imagine that I could hit this issue without
any changes if I re-work my vendor ops to not need the RTNL.

> 
> Given that we have rtnl_lock_unregistering() (and also
> rtnl_lock_unregistering_all()), it looks like we *do* in fact at least
> not want to make an assumption that no user of __rtnl_unlock() can have
> added a todo item.
> 
> I mean, there's technically yet *another* thing we could do - something
> like this:
> 
> [this doesn't compile, need to suitably make net_todo_list non-static]
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -95,6 +95,7 @@ void __rtnl_unlock(void)
>  
>         defer_kfree_skb_list = NULL;
>  
> +       WARN_ON(!list_empty(&net_todo_list));
>         mutex_unlock(&rtnl_mutex);
>  
>         while (head) {
> 
> and actually that would allow us to get rid of rtnl_lock_unregistering()
> and rtnl_lock_unregistering_all() simply because we'd actually guarantee
> the invariant that when the RTNL is freshly locked, the list is empty
> (by guaranteeing that it's always empty when it's unlocked, since it can
> only be added to under RTNL).
> 
> With some suitable commentary, that might also be a reasonable thing?
> __rtnl_unlock() is actually rather pretty rare, and not exported.
> 
> 
> However, if you don't like that ...
> 
> I've been testing with this patch, to make lockdep complain:
> 
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -9933,6 +9933,11 @@ void netdev_run_todo(void)
>         if (!list_empty(&list))
>                 rcu_barrier();
>  
> +#ifdef CONFIG_LOCKDEP
> +       rtnl_lock();
> +       __rtnl_unlock();
> +#endif
> +
>         while (!list_empty(&list)) {
>                 struct net_device *dev
>                         = list_first_entry(&list, struct net_device, todo_list);
> 
> 
> That causes lockdep to complain for cfg80211 even if the list *is* in
> fact empty.
> 
> Would you be open to adding something like that? Perhaps if I don't just
> do the easy rtnl_lock/unlock, but try to find the corresponding lockdep-
> only things to do there, to cause lockdep to do things without really
> locking? OTOH, the locking overhead of the RTNL we just unlocked is
> probably minimal, vs. the actual work *lockdep* is doing to track all
> this ...
> 
> Thanks,
> johannes

Let me know if you'd like me to test any patches out.

Thanks,
Will