linux-kernel - RE: Why CMA allocater fails if there is a signal pending?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <VI1PR04MB5327DE30DCECAD65554F64018B5F0@VI1PR04MB5327.eurprd04.prod.outlook.com>
Date:   Tue, 26 Mar 2019 02:21:22 +0000
From:   Peter Chen <peter.chen@....com>
To:     Florian Fainelli <f.fainelli@...il.com>,
        Russell King - ARM Linux admin <linux@...linux.org.uk>,
        Peter Chen <hzpeterchen@...il.com>
CC:     Michal Nazarewicz <mina86@...a86.com>,
        Andy Duan <fugang.duan@....com>,
        "linux-usb@...r.kernel.org" <linux-usb@...r.kernel.org>,
        lkml <linux-kernel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-arm-kernel@...ts.infradead.org" 
        <linux-arm-kernel@...ts.infradead.org>,
        Marek Szyprowski <m.szyprowski@...sung.com>
Subject: RE: Why CMA allocater fails if there is a signal pending?

 
> 
> On 3/25/19 3:26 AM, Russell King - ARM Linux admin wrote:
> > On Mon, Mar 25, 2019 at 04:37:09PM +0800, Peter Chen wrote:
> >> Hi Michal & Marek,
> >>
> >> I meet an issue that the DMA (CMA used) allocation failed if there is
> >> a user signal, Eg Ctrl+C, it causes the USB xHCI stack fails to
> >> resume due to dma_alloc_coherent failed. It can be easy to reproduce
> >> if the user press Ctrl+C at suspend/resume test.
> >
> > It has been possible in the past for cma_alloc() to take seconds or
> > longer to allocate, depending on the size of the CMA area and the
> > number of pinned GFP_MOVABLE pages within the CMA area.  Whether that
> > is true of today's CMA or not, I don't know.
> >
> > It's probably there to allow such a situation to be recoverable, but
> > is not a good idea if we're expecting dma_alloc_*() not to fail in
> > those scenarios.
> >
> 
> This is a known issue that was discussed here before:
> 
> https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.
> org%2Fpipermail%2Flinux-arm-kernel%2F2014-
> November%2F299265.html&amp;data=02%7C01%7Cpeter.chen%40nxp.com%7Cf
> e85f6ffa92d42eb1b7208d6b14125d2%7C686ea1d3bc2b4c6fa92cd99c5c301635%7
> C0%7C0%7C636891290696033268&amp;sdata=f7%2F85Wz8%2Bp8iDblpgrp3W8
> Ffw0brUpdU7x3rdZiCmvc%3D&amp;reserved=0
> 
> one issue is that the process that is responsible for putting the system asleep and is
> being resumed (which can be as simple as your shell doing an 'echo "standby" >
> /sys/power/state' can be killed, and that propagates throughout dpm_resume(). It is
> debatable whether the signal should be ignored or not, probably not.
> 
> You can work around this by wrapping your echo to /sys/power/state with a shell
> script that trap the signal and say, does an exit 1. AFAIR there are many places
> where a dma_alloc_* allocation can fail, and not all drivers are designed to recover
> correctly.
 
Thanks, Florian.

This workaround can't work since the kernel captured this signal INT after the
freezable tasks are frozen, and when the resume backs, the driver's resume
run before freezable application (echo mem > /sys/power/state). I added
captured code at script, you could find it at the last output.

rtcwakeup.out: wakeup from "mem" using rtc0 at Fri Feb 22 21:42:14 2019
[  594.728338] PM: suspend entry (deep)
[  594.731970] PM: Syncing filesystems ... done.
[  594.740272] Freezing user space processes ... (elapsed 0.001 seconds) done.
[  594.748751] OOM killer disabled.
[  594.751995] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[  594.760660] Suspending console(s) (use no_console_suspend to debug)
^C^C^C[  595.437113] PM: suspend devices took 0.672 seconds
[  595.450647] Disabling non-boot CPUs ...
[  595.464590] CPU1: shutdown
[  595.464597] psci: CPU1 killed.
[  595.488493] CPU2: shutdown
[  595.507831] psci: Retrying again to check for CPU kill
[  595.507835] psci: CPU2 killed.
[  595.524423] CPU3: shutdown
[  595.543821] psci: Retrying again to check for CPU kill
[  595.543826] psci: CPU3 killed.
[  595.544247] fail to power on resource 289
[  595.544277] Enabling non-boot CPUs ...
[  595.545046] Detected VIPT I-cache on CPU1
[  595.545073] GICv3: CPU1: found redistributor 1 region 0:0x0000000051b20000
[  595.545113] CPU1: Booted secondary processor [410fd042]
[  595.545749]  cache: parent cpu1 should not be sleeping
[  595.545956] CPU1 is up
[  595.546654] Detected VIPT I-cache on CPU2
[  595.546673] GICv3: CPU2: found redistributor 2 region 0:0x0000000051b40000
[  595.546698] CPU2: Booted secondary processor [410fd042]
[  595.547036]  cache: parent cpu2 should not be sleeping
[  595.547213] CPU2 is up
[  595.547910] Detected VIPT I-cache on CPU3
[  595.547927] GICv3: CPU3: found redistributor 3 region 0:0x0000000051b60000
[  595.547953] CPU3: Booted secondary processor [410fd042]
[  595.548293]  cache: parent cpu3 should not be sleeping
[  595.548490] CPU3 is up
[  596.511052] usb usb1: root hub lost power or was reset
[  596.511060] usb usb2: root hub lost power or was reset
[  596.513302] cma: cma_alloc: alloc failed, req-size: 1 pages, ret: -4
[  596.723913] hub 1-0:1.0: hub_ext_port_status failed (err = -32)
[  596.723917] hub 2-0:1.0: hub_ext_port_status failed (err = -32)
[  596.724010] hub 2-0:1.0: hub_ext_port_status failed (err = -32)
[  596.724044] usb usb2-port1: cannot disable (err = -32)
[  596.727600] PM: resume devices took 1.156 seconds
[  596.887164] OOM killer enabled.
[  596.890337] Restarting tasks ... done.
[  596.897893] PM: suspend exit
signal INT received, script ending

Peter