linux-kernel - Re: [4.20 PATCH] Revert "mwifiex: restructure rx_reorder_tbl

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190308023401.GA121759@google.com>
Date:   Thu, 7 Mar 2019 18:34:03 -0800
From:   Brian Norris <briannorris@...omium.org>
To:     Ganapathi Bhat <gbhat@...vell.com>
Cc:     linux-kernel@...r.kernel.org,
        Amitkumar Karwar <amitkarwar@...il.com>,
        Nishant Sarmukadam <nishants@...vell.com>,
        Ganapathi Bhat <gbhat@...vell.com>,
        Xinming Hu <huxinming820@...il.com>,
        linux-wireless@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [4.20 PATCH] Revert "mwifiex: restructure rx_reorder_tbl_lock
 usage"

Hi again Ganapathi,

By the way, I was a little curious about what went wrong here, so I dug
in a little further:

On Fri, Nov 30, 2018 at 09:59:57AM -0800, Brian Norris wrote:
> This reverts commit 5188d5453bc9380ccd4ae1086138dd485d13aef2, because it
> introduced lock recursion:
> 
>   BUG: spinlock recursion on CPU#2, kworker/u13:1/395
>    lock: 0xffffffc0e28a47f0, .magic: dead4ead, .owner: kworker/u13:1/395, .owner_cpu: 2
>   CPU: 2 PID: 395 Comm: kworker/u13:1 Not tainted 4.20.0-rc4+ #2
>   Hardware name: Google Kevin (DT)
>   Workqueue: MWIFIEX_RX_WORK_QUEUE mwifiex_rx_work_queue [mwifiex]
>   Call trace:
>    dump_backtrace+0x0/0x140
>    show_stack+0x20/0x28
>    dump_stack+0x84/0xa4
>    spin_bug+0x98/0xa4
>    do_raw_spin_lock+0x5c/0xdc
>    _raw_spin_lock_irqsave+0x38/0x48
>    mwifiex_flush_data+0x2c/0xa4 [mwifiex]
>    call_timer_fn+0xcc/0x1c4
>    run_timer_softirq+0x264/0x4f0
>    __do_softirq+0x1a8/0x35c
>    do_softirq+0x54/0x64
>    netif_rx_ni+0xe8/0x120
>    mwifiex_recv_packet+0xfc/0x10c [mwifiex]
>    mwifiex_process_rx_packet+0x1d4/0x238 [mwifiex]
>    mwifiex_11n_dispatch_pkt+0x190/0x1ac [mwifiex]
>    mwifiex_11n_rx_reorder_pkt+0x28c/0x354 [mwifiex]

TL;DR: the problem was right here ^^^
where you started running mwifiex_11n_dispatch_pkt() (via
mwifiex_11n_scan_and_dispatch()) while holding a spinlock.

When you do that, you eventually call netif_rx_ni(), which specifically
defers to softirq contexts. Then, if you happen to have your flush timer
expiring just before that, you end up in mwifiex_flush_data(), which
also needs that spinlock.

There are a few possible ways to handle this:
(a) prevent processing softirqs in that context; e.g., with
    local_bh_disable(). This seems somewhat of a hack.
    (Side note: I think most of the locks in this driver really could be
    spin_lock_bh(), not spin_lock_irqsave() -- we don't really care
    about hardirq context for 99% of these locks.)
(b) restructure so that packet processing (e.g., netif_rx_ni()) is done
    outside of the spinlock.

It's actually not that hard to do (b). You can just queue your skb's up
in a temporary sk_buff_head list and process them all at once after
you've finished processing the reorder table. I have a local patch to do
this, and I might send it your way if I can give it a bit more testing.

Brian

>    mwifiex_process_sta_rx_packet+0x204/0x26c [mwifiex]
>    mwifiex_handle_rx_packet+0x15c/0x16c [mwifiex]
>    mwifiex_rx_work_queue+0x104/0x134 [mwifiex]
>    worker_thread+0x4cc/0x72c
>    kthread+0x134/0x13c
>    ret_from_fork+0x10/0x18
> 
> This was clearly not tested well at all. I simply performed 'wget' in a
> loop and it fell over within a few seconds.
> 
> Fixes: 5188d5453bc9 ("mwifiex: restructure rx_reorder_tbl_lock usage")
> Cc: <stable@...r.kernel.org>
> Cc: Ganapathi Bhat <gbhat@...vell.com>
> Signed-off-by: Brian Norris <briannorris@...omium.org>