lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 18 Feb 2019 13:12:54 +0100
From:   "Rafael J. Wysocki" <rafael@...nel.org>
To:     Jon Hunter <jonathanh@...dia.com>
Cc:     Ulf Hansson <ulf.hansson@...aro.org>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux PM <linux-pm@...r.kernel.org>,
        Daniel Vetter <daniel@...ll.ch>,
        Lukas Wunner <lukas@...ner.de>,
        Andrzej Hajda <a.hajda@...sung.com>,
        Russell King - ARM Linux <linux@...linux.org.uk>,
        Lucas Stach <l.stach@...gutronix.de>,
        Linus Walleij <linus.walleij@...aro.org>,
        Thierry Reding <thierry.reding@...il.com>,
        Laurent Pinchart <laurent.pinchart@...asonboard.com>,
        Marek Szyprowski <m.szyprowski@...sung.com>,
        linux-tegra <linux-tegra@...r.kernel.org>
Subject: Re: [PATCH 2/2] driver core: Fix possible supplier PM-usage counter imbalance

 On Fri, Feb 15, 2019 at 5:44 PM Jon Hunter <jonathanh@...dia.com> wrote:
>
>
> On 15/02/2019 14:37, Ulf Hansson wrote:
> > On Fri, 15 Feb 2019 at 12:00, Jon Hunter <jonathanh@...dia.com> wrote:
> >>
> >> Hi Rafael,
> >>
> >> On 12/02/2019 12:08, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> >>>
> >>> If a stateless device link to a certain supplier with
> >>> DL_FLAG_PM_RUNTIME set in the flags is added and then removed by the
> >>> consumer driver's probe callback, the supplier's PM-runtime usage
> >>> counter will be nonzero after that which effectively causes the
> >>> supplier to remain "always on" going forward.
> >>>
> >>> Namely, device_link_add() called to add the link invokes
> >>> device_link_rpm_prepare() which notices that the consumer driver is
> >>> probing, so it increments the supplier's PM-runtime usage counter
> >>> with the assumption that the link will stay around until
> >>> pm_runtime_put_suppliers() is called by driver_probe_device(),
> >>> but if the link goes away before that point, the supplier's
> >>> PM-runtime usage counter will remain nonzero.
> >>>
> >>> To prevent that from happening, first rework pm_runtime_get_suppliers()
> >>> and pm_runtime_put_suppliers() to use the rpm_active refounts of device
> >>> links and make the latter only drop rpm_active and the supplier's
> >>> PM-runtime usage counter for each link by one, unless rpm_active is
> >>> one already for it.  Next, modify device_link_add() to bump up the
> >>> new link's rpm_active refcount and the suppliers PM-runtime usage
> >>> counter by two, to prevent pm_runtime_put_suppliers(), if it is
> >>> called subsequently, from suspending the supplier prematurely (in
> >>> case its PM-runtime usage counter goes down to 0 in there).
> >>>
> >>> Due to the way rpm_put_suppliers() works, this change does not
> >>> affect runtime suspend of the consumer ends of new device links (or,
> >>> generally, device links for which DL_FLAG_PM_RUNTIME has just been
> >>> set).
> >>>
> >>> Fixes: e2f3cd831a28 ("driver core: Fix handling of runtime PM flags in device_link_add()")
> >>> Reported-by: Ulf Hansson <ulf.hansson@...aro.org>
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> >>> ---
> >>>
> >>> Note that the issue had been there before commit e2f3cd831a28, but it was
> >>> overlooked by that commit and this change is a fix on top of it, so make
> >>> the Fixes: tag point to commit e2f3cd831a28 (instead of an earlier one
> >>> that the patch will not be applicable to).
> >> I noticed that yesterday's and today's -next were no longer booting on
> >> one of our Tegra boards (Tegra210 Jetson TX2) because networking is
> >> failing. The ethernet chip is a USB device and looking at the bootlogs I
> >> can see that the Tegra XHCI driver is failing ...
> >>
> >>  tegra-xusb 70090000.usb: xHCI host controller not responding, assume dead
> >>  tegra-xusb 70090000.usb: HC died; cleaning up
> >>
> >> The Tegra XHCI driver uses multiple power-domains and uses
> >> device_link_add() to attach them. So now I am wondering if there is
> >> something that we have got wrong in our implementation. However, I don't
> >> see the device being probed deferred on boot or anything like that.
> >>
> >> The driver in question is drivers/usb/host/xhci-tegra.c and we add the
> >> links in the function tegra_xusb_powerdomain_init() which is before RPM
> >> is enabled. Let me know if you have any thoughts.
> >
> > If you are willing to help debugging then I am offering my assistance.
> >
> > I would start by enabling CONFIG_PM_ADVANCED_DEBUG, which gives you
> > some more information about the runtime PM state of the device, like
> > the usage count for example.
> > I would also add a couple of prints in
> > tegra_xusb_runtime_suspend|resume() and in the ->power_on|off()
> > callbacks for the corresponding genpds, to see when those gets called.
>
> From the bootlog I see ...
>
> [    4.445827] tegra_xusb_runtime_resume-788
> [    4.508799] tegra-xusb 70090000.usb: Firmware timestamp: 2015-08-10 09:47:54 UTC

This message comes from tegra_xusb_load_firmware() in
tegra_xusb_probe() which is after the pm_runtime_get_sync().

If the device was PM-runtime-suspended before, the
pm_runtime_get_sync() will runtime-resume and reference-count the
suppliers in addition to resuming the device.  In that case
pm_runtime_put_suppliers() will suspend the suppliers, so there is a
bug in there.

What happens is that the links are new when pm_runtime_get_sync() runs
and so their rpm_active refcounts are one.  After the
pm_runtime_get_sync() they are two and pm_runtime_put_suppliers() will
drop them by one and drop the PM-runtime usage counter of each of them
by one, so they will become zero and the suppliers will suspend.

Passing DL_FLAG_RPM_ACTIVE to device_link_add() should help, but IMO
things should also work without that.

> [    4.516223] tegra-xusb 70090000.usb: xHCI Host Controller
> [    4.521622] tegra-xusb 70090000.usb: new USB bus registered, assigned bus number 1

This comes from usb_add_hcd()

> [    4.530087] tegra-xusb 70090000.usb: hcc params 0x0184f525 hci version 0x100 quirks 0x0000000000010010
> [    4.539398] tegra-xusb 70090000.usb: irq 69, io mem 0x70090000
> [    4.553671] tegra-xusb 70090000.usb: xHCI Host Controller
> [    4.559064] tegra-xusb 70090000.usb: new USB bus registered, assigned bus number 2

Like this.

> [    4.566622] tegra-xusb 70090000.usb: Host supports USB 3.0  SuperSpeed

And this if from xhci_gen_setup(), so probe returns around this point.

> [    4.595393] tegra-pmc: tegra_genpd_power_off-673: xusbc
> [    4.600672] tegra-pmc: tegra_genpd_power_off-673: xusba

And this appears to be done by pm_runtime_put_suppliers().

Hmm, I need to think how to fix this.  Maybe we'll need to revert
$subject patch and do something else, we'll see (later today).

Powered by blists - more mailing lists