lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sun, 26 May 2019 15:58:19 -0400
From:   Brian Masney <masneyb@...tation.org>
To:     Arend Van Spriel <arend.vanspriel@...adcom.com>
Cc:     Adrian Hunter <adrian.hunter@...el.com>,
        Franky Lin <franky.lin@...adcom.com>,
        Hante Meuleman <hante.meuleman@...adcom.com>,
        Chi-Hsien Lin <chi-hsien.lin@...ress.com>,
        Wright Feng <wright.feng@...ress.com>, ulf.hansson@...aro.org,
        faiz_abbas@...com, linux-mmc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-arm-msm@...r.kernel.org,
        Kalle Valo <kvalo@...eaurora.org>,
        linux-wireless@...r.kernel.org,
        brcm80211-dev-list.pdl@...adcom.com,
        brcm80211-dev-list@...ress.com, netdev@...r.kernel.org
Subject: Re: Issue with Broadcom wireless in 5.2rc1 (was Re: [PATCH] mmc:
 sdhci: queue work after sdhci_defer_done())

On Sun, May 26, 2019 at 08:42:21PM +0200, Arend Van Spriel wrote:
> On 5/26/2019 2:21 PM, Brian Masney wrote:
> > + Broadcom wireless maintainers
> > 
> > On Fri, May 24, 2019 at 11:49:58AM -0400, Brian Masney wrote:
> > > On Fri, May 24, 2019 at 03:17:13PM +0300, Adrian Hunter wrote:
> > > > On 24/05/19 2:10 PM, Brian Masney wrote:
> > > > > WiFi stopped working on the LG Nexus 5 phone and the issue was bisected
> > > > > to the commit c07a48c26519 ("mmc: sdhci: Remove finish_tasklet") that
> > > > > moved from using a tasklet to a work queue. That patch also changed
> > > > > sdhci_irq() to return IRQ_WAKE_THREAD instead of finishing the work when
> > > > > sdhci_defer_done() is true. Change it to queue work to the complete work
> > > > > queue if sdhci_defer_done() is true so that the functionality is
> > > > > equilivent to what was there when the finish_tasklet was present. This
> > > > > corrects the WiFi breakage on the Nexus 5 phone.
> > > > > 
> > > > > Signed-off-by: Brian Masney <masneyb@...tation.org>
> > > > > Fixes: c07a48c26519 ("mmc: sdhci: Remove finish_tasklet")
> > > > > ---
> > > > > [ ... ]
> > > > > 
> > > > >   drivers/mmc/host/sdhci.c | 2 +-
> > > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> > > > > index 97158344b862..3563c3bc57c9 100644
> > > > > --- a/drivers/mmc/host/sdhci.c
> > > > > +++ b/drivers/mmc/host/sdhci.c
> > > > > @@ -3115,7 +3115,7 @@ static irqreturn_t sdhci_irq(int irq, void *dev_id)
> > > > >   			continue;
> > > > >   		if (sdhci_defer_done(host, mrq)) {
> > > > > -			result = IRQ_WAKE_THREAD;
> > > > > +			queue_work(host->complete_wq, &host->complete_work);
> > > > 
> > > > The IRQ thread has a lot less latency than the work queue, which is why it
> > > > is done that way.
> > > > 
> > > > I am not sure why you say this change is equivalent to what was there
> > > > before, nor why it fixes your problem.
> > > > 
> > > > Can you explain some more?
> > > 
> > > [ ... ]
> > > 
> > > drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c calls
> > > sdio_claim_host() and it appears to never return.
> > 
> > When the brcmfmac driver is loaded, the firmware is requested from disk,
> > and that's when the deadlock occurs in 5.2rc1. Specifically:
> > 
> > 1) brcmf_sdio_download_firmware() in
> >     drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c calls
> >     sdio_claim_host()
> > 
> > 2) brcmf_sdio_firmware_callback() is called and brcmf_sdiod_ramrw()
> >     tries to claim the host, but has to wait since its already claimed
> >     in #1 and the deadlock occurs.
> 
> This does not make any sense to me. brcmf_sdio_download_firmware() is called
> from brcmf_sdio_firmware_callback() so they are in the same context. So #2
> is not waiting for #1, but something else I would say. Also #2 calls
> sdio_claim_host() after brcmf_sdio_download_firmware has completed so
> definitely not waiting for #1.

I attached a patch that shows how I was able to determine what had
already claimed the host. It's messy; please don't judge me negatively
for this. :) Anyways, sdio_claim_host() is mostly a wrapper for
__mmc_claim_host() and there is a mmc_ctx structure that contains a
task struct. This context can be NULL. I added a description field
to the context structure and put the function name that claimed the
host in there. The mmc_host structure already contained a 'claimer'
member, so that made it easy.

I see the following messages in dmesg that shows what has already
claimed the host when loading the brcmfmac module in 5.2rc1:

cfg80211: Loading compiled-in X.509 certificates for regulatory database
cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
cfg80211: failed to load regulatory.db
brcmfmac: brcmf_fw_alloc_request: using brcm/brcmfmac4339-sdio for chip BCM4339/2
brcmfmac mmc1:0001:1: Direct firmware load for brcm/brcmfmac4339-sdio.lge,hammerhead.txt failed with error -2
------------[ cut here ]------------
WARNING: CPU: 0 PID: 5 at drivers/mmc/core/core.c:819 __mmc_claim_host+0x28c/0x2c0
Modules linked in: brcmfmac brcmutil cfg80211 dm_mod
CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.2.0-rc1-00175-g9899510d2cd1-dirty #420
Hardware name: Generic DT based system
Workqueue: events request_firmware_work_func
[<c03122dc>] (unwind_backtrace) from [<c030d5dc>] (show_stack+0x10/0x14)
[<c030d5dc>] (show_stack) from [<c0ac284c>] (dump_stack+0x78/0x8c)
[<c0ac284c>] (dump_stack) from [<c0321438>] (__warn.part.3+0xb8/0xd4)
[<c0321438>] (__warn.part.3) from [<c03215b0>] (warn_slowpath_null+0x44/0x4c)
[<c03215b0>] (warn_slowpath_null) from [<c093e608>] (__mmc_claim_host+0x28c/0x2c0)
[<c093e608>] (__mmc_claim_host) from [<bf115018>] (brcmf_sdiod_ramrw+0x9c/0x200 [brcmfmac])
[<bf115018>] (brcmf_sdiod_ramrw [brcmfmac]) from [<bf110508>] (brcmf_sdio_firmware_callback+0xe8/0x7b4 [brcmfmac])
[<bf110508>] (brcmf_sdio_firmware_callback [brcmfmac]) from [<bf108830>] (brcmf_fw_request_done+0xf0/0x110 [brcmfmac])
[<bf108830>] (brcmf_fw_request_done [brcmfmac]) from [<c081a4e8>] (request_firmware_work_func+0x4c/0x88)
[<c081a4e8>] (request_firmware_work_func) from [<c033c260>] (process_one_work+0x1fc/0x564)
[<c033c260>] (process_one_work) from [<c033ceb8>] (worker_thread+0x44/0x584)
[<c033ceb8>] (worker_thread) from [<c034226c>] (kthread+0x148/0x150)
[<c034226c>] (kthread) from [<c03010e8>] (ret_from_fork+0x14/0x2c)
Exception stack(0xee8bdfb0 to 0xee8bdff8)
dfa0:                                     00000000 00000000 00000000 00000000
dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
---[ end trace 4ab1b01efc876120 ]---
mmc_host mmc1: __mmc_claim_host: FIXME - before schedule() - descr=brcmf_sdiod_ramrw, claimer=brcmf_sdio_download_firmware

The 'after schedule()' line is not shown and WiFi doesn't work.

> > I tried to release the host before the firmware is requested, however
> > parts of brcmf_chip_set_active() needs the host to be claimed, and a
> > similar deadlock occurs in brcmf_sdiod_ramrw() if I claim the host
> > before calling brcmf_chip_set_active().
> > 
> > I started to look at moving the sdio_{claim,release}_host() calls out of
> > brcmf_sdiod_ramrw() but there's a fair number of callers, so I'd like to
> > get feedback about the best course of action here.
> 
> Long ago Franky reworked the sdio critical sections requiring sdio
> claim/release and I am pretty sure they are correct.
> 
> Could you try with lockdep kernel and see if that brings any more
> information. In the mean time I will update my dev branch to 5.2-rc1 and see
> if I can find any clues.

My .config has CONFIG_LOCKDEP_SUPPORT enabled. I haven't used lockdep
but my understanding is that it should print something in dmesg if a
deadlock occurs. I assume it won't pick up cases like this where
schedule() is called.

Brian

View attachment "0001-troubleshoot-broadcom-wireless-lockup-in-5.2rc1.patch" of type "text/x-diff" (18270 bytes)

Powered by blists - more mailing lists