lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPDyKFoEHFpvvDgkuMSyyTMyCsmXv_Psg9uO+BjXcP0QSeE4EA@mail.gmail.com>
Date:   Wed, 10 Jan 2018 10:26:31 +0100
From:   Ulf Hansson <ulf.hansson@...aro.org>
To:     "Rafael J. Wysocki" <rjw@...ysocki.net>
Cc:     Geert Uytterhoeven <geert@...ux-m68k.org>,
        Linux PM <linux-pm@...r.kernel.org>,
        Kevin Hilman <khilman@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Lukas Wunner <lukas@...ner.de>,
        Yoshihiro Shimoda <yoshihiro.shimoda.uh@...esas.com>,
        Linux-Renesas <linux-renesas-soc@...r.kernel.org>,
        USB list <linux-usb@...r.kernel.org>
Subject: Re: [PATCH v2] PM / runtime: Rework pm_runtime_force_suspend/resume()

On 9 January 2018 at 17:28, Rafael J. Wysocki <rjw@...ysocki.net> wrote:
> On Tuesday, January 9, 2018 5:03:18 PM CET Rafael J. Wysocki wrote:
>> On Tue, Jan 9, 2018 at 4:30 PM, Geert Uytterhoeven <geert@...ux-m68k.org> wrote:
>> > Hi Rafael,
>> >
>> > CC usb
>> >
>> > On Tue, Jan 9, 2018 at 4:00 PM, Rafael J. Wysocki <rafael@...nel.org> wrote:
>> >> On Tue, Jan 9, 2018 at 2:37 PM, Geert Uytterhoeven <geert@...ux-m68k.org> wrote:
>> >>> On Wed, Jan 3, 2018 at 12:06 PM, Rafael J. Wysocki <rjw@...ysocki.net> wrote:
>> >>>> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>> >>>>
>> >>>> One of the limitations of pm_runtime_force_suspend/resume() is that
>> >>>> if a parent driver wants to use these functions, all of its child
>> >>>> drivers have to do that too because of the parent usage counter
>> >>>> manipulations necessary to get the correct state of the parent during
>> >>>> system-wide transitions to the working state (system resume).
>> >>>> However, that limitation turns out to be artificial, so remove it.
>> >>>>
>> >>>> Namely, pm_runtime_force_suspend() only needs to update the children
>> >>>> counter of its parent (if there's is a parent) when the device can
>> >>>> stay in suspend after the subsequent system resume transition, as
>> >>>> that counter is correct already otherwise.  Now, if the parent's
>> >>>> children counter is not updated, it is not necessary to increment
>> >>>> the parent's usage counter in that case any more, as long as the
>> >>>> children counters of devices are checked along with their usage
>> >>>> counters in order to decide whether or not the devices may be left
>> >>>> in suspend after the subsequent system resume transition.
>> >>>>
>> >>>> Accordingly, modify pm_runtime_force_suspend() to only call
>> >>>> pm_runtime_set_suspended() for devices whose usage and children
>> >>>> counters are at the "no references" level (the runtime PM status
>> >>>> of the device needs to be updated to "suspended" anyway in case
>> >>>> this function is called once again for the same device during the
>> >>>> transition under way), drop the parent usage counter incrementation
>> >>>> from it and update pm_runtime_force_resume() to compensate for these
>> >>>> changes.
>> >>>>
>> >>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>> >>>
>> >>> This patch causes a regression during system resume on Renesas Salvator-XS
>> >>> with R-Car H3 ES2.0:
>> >>
>> >> I have dropped it for now, but we need to address the underlying issue.
>> >>
>> >>>     SError Interrupt on CPU3, code 0xbf000002 -- SError
>> >>>     SError Interrupt on CPU2, code 0xbf000002 -- SError
>> >>>     CPU: 3 PID: 1769 Comm: kworker/u16:13 Not tainted
>> >>> 4.15.0-rc7-arm64-renesas-05338-gf14cf570a813c9ca-dirty #97
>> >>>     CPU: 2 PID: 1774 Comm: kworker/u16:18 Not tainted
>> >>> 4.15.0-rc7-arm64-renesas-05338-gf14cf570a813c9ca-dirty #97
>> >>>     Hardware name: Renesas Salvator-X 2nd version board based on
>> >>> r8a7795 ES2.0+ (DT)
>> >>>     Hardware name: Renesas Salvator-X 2nd version board based on
>> >>> r8a7795 ES2.0+ (DT)
>> >>>     Workqueue: events_unbound async_run_entry_fn
>> >>>     Workqueue: events_unbound async_run_entry_fn
>> >>>     pstate: 60000005 (nZCv daif -PAN -UAO)
>> >>>     pstate: 60000005 (nZCv daif -PAN -UAO)
>> >>>     pc : rcar_gen3_phy_usb2_init+0x34/0xf8
>> >>>     pc : rcar_gen3_phy_usb2_init+0x34/0xf8
>> >>>     lr : phy_init+0x64/0xcc
>> >>>     lr : phy_init+0x64/0xcc
>> >>>     ...
>> >>>     Kernel panic - not syncing: Asynchronous SError Interrupt
>> >>>
>> >>> Note that before, it printed a warning instead:
>> >>>
>> >>>     Enabling runtime PM for inactive device (ee0a0200.usb-phy) with
>> >>> active children
>> >>>     WARNING: CPU: 0 PID: 1741 at drivers/base/power/runtime.c:1300
>> >>> pm_runtime_enable+0x94/0xd8
>> >>>
>> >>> Reverting commit 0408584d580d4a2c ("PM / runtime: Rework
>> >>> pm_runtime_force_suspend/resume()") fixes the crash.
>> >>>
>> >>> Note that applying Ulf's "[PATCH v2 0/3] phy: core: Re-work runtime PM
>> >>> deployment and fix an issue"
>> >>> (https://www.spinics.net/lists/linux-renesas-soc/msg21719.html) instead
>> >>> does not fix the crash.
>> >>
>> >> OK
>> >>
>> >>> With more debug code added, it seems the EHCI module clocks (701-703) are
>> >>> enabled earlier than before. I guess this triggers the workqueue to perform
>> >>> an operation while another related device (HSUSB 704?) is still disabled?
>> >>
>> >> Probably.
>> >>
>> >> Likely a device that wasn't resumed before resumes now and that causes
>> >> the issue to appear.
>> >>
>> >> I'm wondering if adding the ignore_children check to the patch helps.
>> >> If not, there clearly is a resume ordering issue that is papered over
>> >> by the current code.
>> >
>> > Something fishy is going on. Status of the USB PHYs differ before/after
>> > system suspend, according to /sys/kernel/debug/pm_genpd/pm_genpd_summary:
>> >
>> > -    /devices/platform/soc/ee0a0200.usb-phy              active
>> > -    /devices/platform/soc/ee0c0200.usb-phy              active
>> > -    /devices/platform/soc/ee080200.usb-phy              active
>> > +    /devices/platform/soc/ee0a0200.usb-phy              suspended
>> > +    /devices/platform/soc/ee0c0200.usb-phy              suspended
>> > +    /devices/platform/soc/ee080200.usb-phy              suspended
>>
>> Yeah.
>>
>> That's because of the pm_runtime_force_suspend/resume() called by
>> genpd.  These functions generally may cause devices active before
>> system suspend to be left in suspend after it.  That generally is a
>> good idea if the device was not really in use before the system
>> suspend, but there is a problem that the driver of it may not be
>> prepared for that to happen (and also the way to determine the "not
>> really in use" case may not be as expected by the driver).
>
> So below is something to try (untested).
>
> I know that Ulf will be protesting, but that's what I would do unless there
> are really *really* good reasons not to.

I am not protesting! :-)

Instead I think we are rather in agreement that we are concerned about
that genpd are invoking pm_runtime_force_suspend|resume().

>
>
> ---
>  drivers/base/power/domain.c |   34 +++++++++++++++++++++-------------
>  1 file changed, 21 insertions(+), 13 deletions(-)
>
> Index: linux-pm/drivers/base/power/domain.c
> ===================================================================
> --- linux-pm.orig/drivers/base/power/domain.c
> +++ linux-pm/drivers/base/power/domain.c
> @@ -1048,8 +1048,9 @@ static int genpd_finish_suspend(struct d
>         if (ret)
>                 return ret;
>
> -       if (genpd->dev_ops.stop && genpd->dev_ops.start) {
> -               ret = pm_runtime_force_suspend(dev);
> +       if (genpd->dev_ops.stop && genpd->dev_ops.start &&
> +           !pm_runtime_status_suspended(dev)) {
> +               ret = genpd_stop_dev(genpd, dev);

Something like this existed there before, but because of other
internal genpd reasons. It's an option but there are issues with it.

I think it's fragile because invoking the ->stop() callback may
trigger calls to "disable" resources (like clocks, pinctrls, etc) for
the device. Doing this at ->suspend_noirq() may be too late, as that
subsystem/driver for the resource(s) may already be system suspended.
This is the classic problem of suspending (and later resuming) devices
in in-correct order.

Today we deal with this, by drivers/subsystem assigning the right
level of callback, as well as making sure the "dpm_list" is sorted
correctly (yeah, we have device links as well).

The point is, we can go for this solution, but is it good enough?

Another option, is to simply to remove (and not replace with something
else) the calls to pm_runtime_force_ suspend|resume() from genpd. This
moves the entire responsibility to the driver, to put its device into
low power state during system suspend, but with the benefit of that it
can decide to use the correct level of suspend/resume callbacks.

No matter how we do it, changing this may introduce some potential
regressions from a power consumption point of view, however although
only for those SoCs that uses the ->start|stop() callbacks. To
mitigate these power regressions, some drivers needs to be fixed, but
that seems inevitable anyway, doesn't it?

[...]

Kind regards
Uffe

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ