linux-kernel - Re: BUG in mmc: core: Disable card detect during shutdown

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPDyKFrjH8c=2LYkzj81jm7t-sy-EBs3AMzAS7M=LEHsh9qCCA@mail.gmail.com>
Date:   Fri, 3 Jun 2022 12:46:28 +0200
From:   Ulf Hansson <ulf.hansson@...aro.org>
To:     "H. Nikolaus Schaller" <hns@...delico.com>
Cc:     Discussions about the Letux Kernel <letux-kernel@...nphoenux.org>,
        kernel@...a-handheld.com, aTc <atc@...-p.org>,
        Tony Lindgren <tony@...mide.com>,
        linux-omap <linux-omap@...r.kernel.org>,
        linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-mmc@...r.kernel.org
Subject: Re: BUG in mmc: core: Disable card detect during shutdown

On Mon, 30 May 2022 at 18:55, H. Nikolaus Schaller <hns@...delico.com> wrote:
>
> Hi Ulf,
> users did report a strange issue that the OMAP5 based Pyra does not
> shutdown if a kernel 5.10.116 is used.
>
> Someone did a bisect and found that reverting
>
> 0d66b395210c5084c2b7324945062c1d1f95487a
>
> resp. upstream
>
> 66c915d09b942fb3b2b0cb2f56562180901fba17
>
> solves it.
>
> I could now confirm that it also happens with v5.18.0.
> But interestingly only on the Pyra handheld device and not
> on the omap5evm (which is supported by mainline).
>
> The symptom is:
>
> a) without revert
>
> root@...ux:~# poweroff
>
> Broadcast message from root@...ux (console) (Sat Jan  1 01:08:25 2000):
>
> The system is going down for system halt NOW!
> INIT: Sending processes the TERM signal
> root@...ux:~# [info] Using makefile-style concurrent boot in runlevel 0.
> [....] Stopping cgroup management proxy daemon: cgproxy[....] Stopping cgroup management daemon: cgmanager[....] Stop[ ok bluetooth: /usr/sbin/bluetoothd.
> [FAIL] Stopping ISC DHCP server: dhcpd failed!
> dhcpcd[3055]: sending signal 15 to pid 2976
> dhcpcd[3055]: waiting for pid 2976 to exit
> [ ok ] Shutting down ALSA...done.
> [ ok ] Asking all remaining processes to terminate...done.
> [ ok ] All processes ended within 2 seconds...done.
> [ ok [[c[....] Stopping enhanced syslogd: rsyslogd.
> [ ok ....] Deconfiguring network interfaces...done.
> ^[[c[info] Saving the system clock.
> [info] Hardware Clock updated to Sat Jan  1 01:08:30 UTC 2000.
> [ ok ] Deactivating swap...done.
> ^[[c[   77.289332] EXT4-fs (mmcblk0p2): re-mounted. Quota mode: none.
> [info] Will now halt.
>
> b) with reverting your patch
>
> root@...ux:~# uname -a
> Linux letux 5.18.0-letux-lpae+ #9678 SMP PREEMPT Mon May 30 18:02:28 CEST 2022 armv7l GNU/Linux
> root@...ux:~# poweroff
>
> Broadcast message from root@...ux (console) (Sat Jan  1 01:39:15 2000):
>
> The system is going down for system halt NOW!
> INIT: Sending processes the TERM signal
> root@...ux:~# [info] Using makefile-style concurrent boot in runlevel 0.
> [FAIL] Stopping cgroup management proxy daemon: cgproxy[....] Stopping ISC DHCP server: dhcpd failed!
> [....] Stopping cgroup management daemon: cgmanagerdhcpcd[3100]: sending signal 15 to pid 3013
> dhcpcd[3100]: waiting for pid 3013 to exit
> [ ok ] Stopping bluetooth: /usr/sbin/bluetoothd.
> [ ok ] Shutting down ALSA...done.
> [ ok ] Asking all remaining processes to terminate...done.
> [ ok ] All processes ended within 3 seconds...done.
> [ ok [[c[....] Stopping enhanced syslogd: rsyslogd.
> [ ok ....] Deconfiguring network interfaces...done.
> ^[[c[info] Saving the system clock.
> [info] Hardware Clock updated to Sat Jan  1 01:39:21 UTC 2000.
> [ ok ] Deactivating swap...done.
> ^[[c[   44.563256] EXT4-fs (mmcblk0p2): re-mounted. Quota mode: none.
> [info] Will now halt.
> [   46.917534] reboot: Power down
>
>
> What I suspect is that we have multiple mmc interfaces and have
> card detect wired up in the Pyra while it is ignored in the
> EVM. Is it possible that __mmc_stop_host() never returns in
> .shutdown_pre if card detect is set up (and potentially
> shut down earlier)?
>
> Setup of mmc is done in omap5-board-common.dtsi and omap5.dtsi.
>
> Out Pyra has a non-upstream device tree where we use
> omap5-board-common.dtsi and overwrite it by e.g.
>
> &mmc4 { /* second (u)SD slot (SDIO capable) */
>         status = "okay";
>         vmmc-supply = <&ldo2_reg>;
>         pinctrl-names = "default";
>         pinctrl-0 = <&mmc4_pins>;
>         bus-width = <4>;
>         cd-gpios = <&gpio3 13 GPIO_ACTIVE_LOW>; /* gpio3_77 */
>         wp-gpios = <&gpio3 15 GPIO_ACTIVE_HIGH>;        /* gpio3_79 */
> };
>
> But I have tried to remove the cd-gpois and wp-gpois. Or make the
> mmc interface being disabled (but I may not have catched everything
> in first place).
>
> Then I added some printk to mmc_stop_host() and __mmc_stop_host().
>
> mmc_stop_host() is not called but __mmc_stop_host() is called 4 times.
> There are 4 active MMC interfaces in the Pyra - 3 for (µ)SD slots
> and one for an SDIO WLAN module.
>
> Now it looks as if 3 of them are properly teared down (two of them
> seem to have host->slot.cd_irq >= 0) but on the fourth call
> cancel_delayed_work_sync(&host->detect); does not return. This is
> likely the location of the stall why we don't see a "reboot: Power down"
>
> Any ideas?

I guess the call to cancel_delayed_work_sync() in __mmc_stop_host()
hangs for one of the mmc hosts. This shouldn't happen - and indicates
that there is something else being wrong.

See more suggestions below.

>
> BR and thanks,
> Nikolaus
>
> printk hack:
>
> void __mmc_stop_host(struct mmc_host *host)
> {
> printk("%s 1\n", __func__);
>         if (host->slot.cd_irq >= 0) {
> printk("%s 2\n", __func__);
>                 mmc_gpio_set_cd_wake(host, false);
> printk("%s 3\n", __func__);
>                 disable_irq(host->slot.cd_irq);
> printk("%s 4\n", __func__);
>         }
>
>         host->rescan_disable = 1;
> printk("%s 5\n", __func__);

My guess is that it's the same mmc host that causes the hang. I
suggest you print the name of the host too, to verify that. Something
along the lines of the below.

printk("%s: %s 5\n", mmc_hostname(host), __func__);

>         cancel_delayed_work_sync(&host->detect);
> printk("%s 6\n", __func__);

Ditto.

> }
>
> resulting log:
>
> [info] Will now halt.
> [  282.780929] __mmc_stop_host 1
> [  282.784276] __mmc_stop_host 2
> [  282.787735] __mmc_stop_host 3
> [  282.791030] __mmc_stop_host 4
> [  282.794235] __mmc_stop_host 5
> [  282.797369] __mmc_stop_host 6
> [  282.800918] __mmc_stop_host 1
> [  282.804269] __mmc_stop_host 5
> [  282.807541] __mmc_stop_host 6
> [  282.810715] __mmc_stop_host 1
> [  282.813842] __mmc_stop_host 2
> [  282.816984] __mmc_stop_host 3
> [  282.820175] __mmc_stop_host 4
> [  282.823302] __mmc_stop_host 5
> [  282.826449] __mmc_stop_host 6
> [  282.830941] __mmc_stop_host 1
> [  282.834076] __mmc_stop_host 5
>
> --- here should be another __mmc_stop_host 6
> --- and reboot: Power down

When/if you figured out that it's the same host that hangs, you could
try to disable that host through the DTS files (add status =
"disabled" in the device node, for example) - and see if that works.

Kind regards
Uffe