linux-kernel - Re: [PATCH 05/14] mwifiex: re-register wiphy across reset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 05 Jun 2017 18:54:18 +0300
From:   Kalle Valo <kvalo@...eaurora.org>
To:     Brian Norris <briannorris@...omium.org>
Cc:     Ganapathi Bhat <gbhat@...vell.com>,
        Nishant Sarmukadam <nishants@...vell.com>,
        linux-kernel@...r.kernel.org,
        Dmitry Torokhov <dmitry.torokhov@...il.com>,
        Amitkumar Karwar <amitkarwar@...il.com>,
        linux-wireless@...r.kernel.org,
        Johannes Berg <johannes@...solutions.net>
Subject: Re: [PATCH 05/14] mwifiex: re-register wiphy across reset

Brian Norris <briannorris@...omium.org> writes:

> On Thu, Jun 01, 2017 at 12:15:45PM +0300, Kalle Valo wrote:
>> Brian Norris <briannorris@...omium.org> writes:
>> 
>> > In general, it's helpful to use the same code for device removal as for
>> > device reset, as this tends to have fewer bugs. Let's move the wiphy
>> > unregistration code into the common reset and removal code.
>> >
>> > In particular, it's very hard to properly handle the reset sequence when
>> > something fails. Currently, if mwifiex_reinit_sw() fails, we've failed
>> > to unregister the associated wiphy, and so running something as simple
>> > as "iw phy" can trigger an OOPS, as the wiphy still has hooks back into
>> > freed mwifiex data structures. For example, KASAN complained:
>> >
>> > [... see reset fail for other reasons ...]
>> > [ 1184.821158] mwifiex_pcie 0000:01:00.0: info: dnld wifi firmware from 174948 bytes
>> > [ 1186.870914] mwifiex_pcie 0000:01:00.0: info: FW download over, size 608396 bytes
>> > [ 1187.685990] mwifiex_pcie 0000:01:00.0: WLAN FW is active
>> > [ 1187.692673] mwifiex_pcie 0000:01:00.0: cmd_wait_q terminated: -512
>> > [ 1187.699075] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister device
>> > [ 1187.713476] mwifiex: Failed to bring up adapter: -5
>> > [ 1187.718644] mwifiex_pcie 0000:01:00.0: reinit failed: -5
>> >
>> > [... run `iw phy` ...]
>> > [ 1212.902419] ==================================================================
>> > [ 1212.909806] BUG: KASAN: use-after-free in mwifiex_cfg80211_get_antenna+0x54/0xfc [mwifiex] at addr ffffffc0ad1a8028
>> > [ 1212.920246] Read of size 1 by task iw/3127
>> > [...]
>> > [ 1212.934946] page dumped because: kasan: bad access detected
>> > [...]
>> > [ 1212.950665] Call trace:
>> > [ 1212.953148] [<ffffffc00020a69c>] dump_backtrace+0x0/0x190
>> > [ 1212.958572] [<ffffffc00020a96c>] show_stack+0x20/0x28
>> > [ 1212.963648] [<ffffffc0005ce18c>] dump_stack+0xa4/0xcc
>> > [ 1212.968723] [<ffffffc0003c4430>] kasan_report+0x378/0x500
>> > [ 1212.974140] [<ffffffc0003c3358>] __asan_load1+0x44/0x4c
>> > [ 1212.979462] [<ffffffbffc2e8360>] mwifiex_cfg80211_get_antenna+0x54/0xfc [mwifiex]
>> > [ 1212.987131] [<ffffffbffc084fc4>] nl80211_send_wiphy+0x75c/0x2de0 [cfg80211]
>> > [ 1212.994246] [<ffffffbffc094f60>] nl80211_dump_wiphy+0x32c/0x438 [cfg80211]
>> > [ 1213.001149] [<ffffffc000ab6404>] genl_lock_dumpit+0x48/0x64
>> > [ 1213.006746] [<ffffffc000ab3474>] netlink_dump+0x178/0x398
>> > [ 1213.012171] [<ffffffc000ab3d18>] __netlink_dump_start+0x1bc/0x260
>> > [...]
>> >
>> > This all goes away if we just tear down the wiphy on the way down, and
>> > set it back up if/when we bring the device back up.
>> 
>> I don't know exactly what kind of reset this is about,
>
> Marvell firmwares are known to be quite buggy, and there are plenty of
> situations in which they crash (often resulting in a command timeout).

That is a common problem with firmwares :/

> The current best workaround for these is to essentially unwind the
> whole driver, reset the card, and reprobe the whole thing. See
> anywhere that the ->card_reset() callback is called.
>
> This has been around for a long time on SDIO, and you recently merged my
> changes to enable this for PCIe:
>
> 6d7d579a8243 mwifiex: pcie: add card_reset() support
>
>> but isn't this a
>> quite intrusive solution? For example, phy name changes because of this?
>
> Yes, it is a bit intrusive. But the whole process is intrusive, as it
> deletes all the virtual interfaces and loses your settings. This all
> relies on user space being prepared to clean up and reinitialize
> everything afterward.
>
> And yes, this causes a phy name change.
>
> In favor: this is what the SDIO reset code *used* to do, before this
> commit:
>
> c742e623e941 mwifiex: sdio card reset enhancement
>
> where the SDIO driver started using the half-baked reset solution
> written for PCIe.
>
> Lastly, I still need to analyze a few more things in this driver, but
> AFAICT, if we *don't* unregister the wiphy, we are exposed to quite a
> few more race conditions -- not just the easy-to-notice condition
> described above. What happens if the wiphy still processes cfg80211
> operations while we're still resetting the firmware? Much of the driver
> may not be prepared for this. At the moment, I can't find anything
> terribly wrong; if I slow down the reset (e.g., with msleep()s) I can
> just trigger complaints about "cmd node not available" or "card is
> removed", but I haven't yet found a true bug.
>
> That's not to say that there aren't such bugs out there. I'd still be
> willing to bet there are. And IMO, it seems wise to just do the same
> teardown/setup as one would do for (e.g.) 'rmmod', to prevent exposing
> *too* many new permutations of "wiphy is available but rest of the
> driver is torn down".

This feels like a sledge hammer approach causing all sort of problems
for user space and I really like the mac80211 approach more. For
example, if an ath10k firmware crash happens user only sees a few second
pause in data traffic and a warning in kernel log, otherwise everything
happens behind the scenes. Of course there are very likely races
somewhere but at least I haven't seen that many reports related to
firmware restart functionality.

> But if none of this is convincing to you, I can take a stab at a
> different solution.

I don't have any problem applying this patch but more about being
curious why doing it like this. And hopefully finding a less intrusive
solution in the future.

> BTW, since you're taking an interest in this code now, can I trouble you
> with a question? Looking at mwifiex_uninit_sw() (after this patchset),
> you can see a loop like this:
>
>         /* Stop data */
>         for (i = 0; i < adapter->priv_num; i++) {
>                 priv = adapter->priv[i];
>                 if (priv && priv->netdev) {
>                         mwifiex_stop_net_dev_queue(priv->netdev, adapter);
>                         if (netif_carrier_ok(priv->netdev))
>                                 netif_carrier_off(priv->netdev);
>                         netif_device_detach(priv->netdev);
>                 }
>         }
>
> That seems to be the only attempt to prevent user space from talking to
> the device while we proceed to shut down (mwifiex_shutdown_drv()). AIUI,
> that's wholly insufficient, and we need to actually stop all the virtual
> interfaces (and possibly the wiphy as well) first. I'm looking at trying
> to move the mwifiex_del_virtual_intf() loop up much further in this
> function (but there are other bugs preventing me from doing that yet).
>
> Does that sound like the right approach to you? I'm kinda figuring this
> should better mimic the mac80211 ieee80211_remove_interfaces()
> structure.

Johannes is much better person to answer this (CCed).

-- 
Kalle Valo