lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3332408.jE0xQCEvom@workhorse>
Date: Thu, 04 Sep 2025 14:50:13 +0200
From: Nicolas Frattaroli <nicolas.frattaroli@...labora.com>
To: Ulf Hansson <ulf.hansson@...aro.org>
Cc: linux-pm@...r.kernel.org, linux-kernel@...r.kernel.org,
 kernel@...labora.com, linux-rockchip@...ts.infradead.org,
 Cristian Ciocaltea <cristian.ciocaltea@...labora.com>,
 Sebastian Reichel <sebastian.reichel@...labora.com>,
 Heiko Stuebner <heiko@...ech.de>
Subject: Re: [PATCH] pmdomian: core: don't unset stay_on during sync_state

On Thursday, 4 September 2025 11:17:01 Central European Summer Time Ulf Hansson wrote:
> On Tue, 2 Sept 2025 at 20:23, Nicolas Frattaroli
> <nicolas.frattaroli@...labora.com> wrote:
> >
> > This reverts commit de141a9aa52d6b2fbeb63f98975c2c72276f0878.
> 
> I can't find this commit hash. What tree are you using when testing this?
> 
> Are you trying to revert 0e789b491ba04c31de5c71249487593e386baa67 ?

Probably, I did the revert on a rebased branch and then rebased the revert
onto v6.17-rc4 so it likely is the wrong hash here. I'll fix this in v2 if
there is a v2 (it might actually become a different patch, see huge text
below, sorry!)

> 
> >
> > On RK3576, the UFS controller's power domain has a quirk that requires
> > it to stay enabled, infrastricture for which was added in Commit
> > cd3fa304ba5c ("pmdomain: core: Introduce dev_pm_genpd_rpm_always_on()").
> >
> > Unfortunately, Commit de141a9aa52d ("pmdomain: core: Leave powered-on
> > genpds on until sync_state") appears to break this quirk wholesale. The
> > result is that RK3576 devices with the UFS controller enabled but unused
> > will freeze once pmdomain shuts off unused domains.
> >
> > Revert it until a better fix can be found.
> 
> This sounds a bit vague to me, can you please clarify and elaborate a
> bit more so I can try to help.
> 
> What does "UFS controller enabled but unused" actually mean? Has the
> UFS controller driver been probed successfully and thus its
> corresponding device been attached to its PM domain?

It means the UFS controller driver has probed, but does not find a
UFS storage chip connected to it, and therefore reports a probe
failure. This is a possibility on single-board computers like the
Radxa ROCK 4D, where the UFS storage is a separate module that plugs
into some headers.

> Moreover, the behaviour of dev_pm_genpd_rpm_always_on() is orthogonal
> to what 0e789b491ba0 ("pmdomain: core: Leave powered-on genpds on
> until sync_state") brings along with its corresponding sync_state
> series for genpd [1]. Again, more information is needed to understand
> what goes wrong.

The reason why Rockchip's UFS driver needs this function is that once
the RK3576_PD_USB power domain is turned on on the RK3576 SoC, it should
not be toggled off again until a whole SoC suspend/resume cycle because
the off/on operation is seemingly not idempotent. This does not preclude
turning off the power domain if the device isn't used at all, e.g. the DT
node is absent. This is why the affected PDs are not marked as always-on
in the PD init data for the SoC, as the device driver is the best place to
set this during runtime.

The way I got to this commit is through a bisect between the UFS node
being enabled on the ROCK 4D (commit 00abee2b18342d6c2f6f37225682fa7ca0d33142)
and v6.17-rc4. The bisect landed on the pmdomain merge commit
(commit fc8f5028eb0cc5aee0501a99f59a04f748fbff1c) as the first bad commit,
so I checked out v6.16, cherry-picked the UFS node enablement, made sure it
works fine with that, and then rebased the 44 pmdomain commits that this
merge commit merged onto this base. I verified that the tip of that then
exhibited the faulty behaviour, namely that my ROCK 4D was locking up
some time after boot, right after the kernel log message

    [   33.756516] vdd_npu_s0: disabling

so presumably when unused regulators and domains were being disabled.
Aside note: setting vdd_npu_s0 to always-on also works to work around
the issue, and I'm not quite sure why, because this regulator is not
used for anything right now so this may be some peculiar SoC silicon
design where VDD_NPU is leaking into the part that the UFS PD actually
should be gating, preventing the lockup on accident.

Anyway, so I did a bisect between the UFS introduction and the rebased
tip of the pmdomain branch, and landed on the commit I'm reverting.

The problem exhibits itself not when the affect PD is first turned off,
but when the NPU regulator is turned off. How could this be relevant to
power domains at all? I have no idea.

I admit that I don't understand the commit I'm reverting, as it talks of
keeping powered-on genpds on, but the code sets a member called "stay_on"
to false.

However, reverting it 100% reproducibly fixes the observed lockup. The
lockup does not occur if an UFS module is connected to the SBC.

It seems `dev_pm_genpd_rpm_always_on` is never run if UFS experiences
a probe failure, so I'm not entirely sure how this specific commit
changes the behaviour in a way that makes it unhappy. I agree the
solution is probably not a revert here. Also, adding an unconditional
`dev_pm_genpd_rpm_always_on` in the failing UFS probe path doesn't
work, likely because the driver is unbound. Maybe this is a complete
red herring and `dev_pm_genpd_rpm_always_on` is unrelated.

The only other device that uses RK3576_PD_USB is the usb_drd0_dwc3
usb controller. This node is not enabled in my rebase-pd-onto-ufs-enable
branch, so even assuming the problem is that the usb driver is missing
the same `dev_pm_genpd_rpm_always_on` call, that shouldn't matter
becuase nothing else is using that power domain. Unless, of course, our
description of that power domain is incomplete, which is possible.

The problem may be that this was always racey and the genpd changes
just made the race go wrong more often.

Another observation: my kernel log during afflicted bootups contains

    rockchip-pm-domain 27380000.power-management:power-controller: sync_state() pending due to 2a2d0000.ufshc

A further observation: pd_ignore_unused does not fix it. Looking at
the revert and experimenting, removing the `stay_on = false` is not
what actually fixes my problem, but removing the
`#ifndef CONFIG_PM_GENERIC_DOMAINS_OF`, even without setting
pd_ignore_unused, fixes my issue.

This is probably a big enough clue to suss out what's going on; that
I *need* "[    2.868987] PM: genpd: Disabling unused power domains"
to run quite early before the regulator that feeds VDD_NPU is disabled
after the bootup completes unless an UFS module is present or the UFS
controller is disabled makes me think this is another peculiarity of
the RK3576 power domains hardware, because none of this I just wrote
sounds like the words of a sane human.

Kind regards,
Nicolas Frattaroli

> 
> Kind regards
> Uffe
> 
> [1]
> https://lore.kernel.org/all/20250701114733.636510-1-ulf.hansson@linaro.org/
> 
> >
> > Fixes: de141a9aa52d ("pmdomain: core: Leave powered-on genpds on until sync_state")
> > Signed-off-by: Nicolas Frattaroli <nicolas.frattaroli@...labora.com>
> > ---
> >  drivers/pmdomain/core.c | 4 ----
> >  1 file changed, 4 deletions(-)
> >
> > diff --git a/drivers/pmdomain/core.c b/drivers/pmdomain/core.c
> > index 0006ab3d078972cc72a6dd22a2144fb31443e3da..4eba30c7c2fabcb250444fee27d7554473a4d0c2 100644
> > --- a/drivers/pmdomain/core.c
> > +++ b/drivers/pmdomain/core.c
> > @@ -1357,7 +1357,6 @@ static int genpd_runtime_resume(struct device *dev)
> >         return ret;
> >  }
> >
> > -#ifndef CONFIG_PM_GENERIC_DOMAINS_OF
> >  static bool pd_ignore_unused;
> >  static int __init pd_ignore_unused_setup(char *__unused)
> >  {
> > @@ -1393,7 +1392,6 @@ static int __init genpd_power_off_unused(void)
> >         return 0;
> >  }
> >  late_initcall_sync(genpd_power_off_unused);
> > -#endif
> >
> >  #ifdef CONFIG_PM_SLEEP
> >
> > @@ -3494,7 +3492,6 @@ void of_genpd_sync_state(struct device_node *np)
> >         list_for_each_entry(genpd, &gpd_list, gpd_list_node) {
> >                 if (genpd->provider == of_fwnode_handle(np)) {
> >                         genpd_lock(genpd);
> > -                       genpd->stay_on = false;
> >                         genpd_power_off(genpd, false, 0);
> >                         genpd_unlock(genpd);
> >                 }
> > @@ -3522,7 +3519,6 @@ static void genpd_provider_sync_state(struct device *dev)
> >
> >         case GENPD_SYNC_STATE_SIMPLE:
> >                 genpd_lock(genpd);
> > -               genpd->stay_on = false;
> >                 genpd_power_off(genpd, false, 0);
> >                 genpd_unlock(genpd);
> >                 break;
> >
> > ---
> > base-commit: 5cc61f86dff464a63b6a6e4758f26557fda4d494
> > change-id: 20250902-rk3576-lockup-regression-5b1f1fb7ff21
> >
> > Best regards,
> > --
> > Nicolas Frattaroli <nicolas.frattaroli@...labora.com>
> >
> 





Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ